Johan and I were overjoyed: last week Last.fm sent us to the Hadoop Summit 2008 in Santa Clara, California. Under Johan's wings Last.fm became one of the earliest adopters of Doug Cutting's Hadoop, and I'm a frequent user myself.
And we had an excellent time. The conference was great as expected, we had lots of interesting conversations with people from all kinds of backgrounds. Additionally we spent the rest of our trip meeting people from other companies (Facebook, Powerset, and others), discussing technology (we're currently really interested in HBase), the various issues that arise from having to cope with increasingly large data sets, etc.
It was very apparent that we're witnessing the emergence of a new culture of data teams at Internet startups and corporations that manage larger and larger data sets and want better mechanisms for storage, offline processing and analysis. Many are unhappy with existing solutions; because they solve the wrong problems, are based on ancient storage/processing models, are too expensive, or based on unsuitable infrastructure designs. The ideal computing model in this context is a distributed architecture: if your current system is at its limits you can just add more machines.
One current trend within the Hadoop community is the emergence of processing models on a higher level of abstraction; these usually incorporate a unified model to manage schemas/data structures, and data flow query languages that often bear a striking resemblance to SQL. But they're not trying to imitate relational databases -- e.g. nobody is interested in transactions or low latency. These are offline processing systems.
My personal favourite among these is Facebook's Hive, which could be described as their approach to a data warehousing model on top of MapReduce; it may see an open source release this year (but you never know with these projects.) Then there's Pig, Jaql, and others.
Microsoft has a research project along those lines called Dryad (I think we saw Michael present at one of last year's Google Open Source Jams in London), and I'm quite impressed by their approach. Since they can rely on an existing well-integrated infrastructure they can concentrate on solving the core issues; Dryad implements a execution engine for arbitrary distributed processing systems that self-optimises by transforming a data flow graph and that integrates with their embedded query language Linq. Programs then load data from arbitrary sources (SQL server, file stores, ...).
So Microsoft is already working on much higher levels of abstraction whereas everybody else has to start by first building up some basic infrastructure. It's quite clear that the lack of integration between many of the open source projects in this field results in a duplication of efforts; but there also was a clear aversion among the attendees towards such proprietary systems. (Maybe not surprising at a conference for an open source project.)
I also didn't realise that Yahoo played such a big part in Hadoop development. They obviously regard it a core component of their infrastructure roadmap. Doug Cutting was employed by Yahoo when the project was in its infancy, and 80% of the project's commits are by Yahoo employees. In other words, the biggest beneficiary of Google's publication of the MapReduce paper turned out to be their largest competitor.
Another issue that came up in conversations was the impact a Microsoft/Yahoo merger may have on Yahoo's open source projects -- apparently there is a good chance that MS may decide to switch Yahoo over to their own distributed processing and search infrastructures. (And I wouldn't judge them for it, cf. above.)
Update: Ah, I almost forgot: Last.fm is hiring people to build data warehouses and stuff! :)