Last week, the Register announced that Google moved “away from MapReduce.” Given that several companies adopted MapReduce (hence copying Google), is Google moving a step ahead of its copycats? Moreover, Tony Bain is asking today whether Stonebraker was right in stating that MapReduce was a “a giant step backward.” Is MapReduce itself any good?
As reported by the Register, one problem with MapReduce is that it is essentially batch-processing oriented. Once you start the process, you can’t easily update the input data and expect the output to be sane. Thus, MapReduce is poor at real-time processing. Yet, it will remain fine for latence-oblivious applications such as Extract-Transform-Load or number crunching.
We now expect Google to index my blog post within minutes after I post them. Google had to update its batch-oriented architecture for a real-time indexing approach. However, it is unclear whether this puts Google technologically ahead of, say, Microsoft Bing.
The big picture is maybe more interesting. We used to view the Web as a large collection of documents—as a library. Indexes updated daily were just fine. We now view the Web as an endless stream of data—like a live meeting between billions of people.
Further reading: Julian Hyde, Data in Flight, ACM Queue, 2009.
It does seem like we need methods that refine incrementally.
With the effervescence of activity in social media (many true experts now getting into that game) and increasingly rapid creation of new knowledge, models, and frameworks, knowledge is becoming obsolete faster and it seems like it is more and important to know about activity that occurs in the present, even as it builds on the past.
(My argument is that there is a necessary coevolution between search and the kind of fine-grained open collaboration that is now emerging on the Web.)
It’s true you don’t want to use a system designed for large scale batch processing for tasks that aren’t large scale batch processing.
For example, there was an article a while back talking about how GMail ran into problems because the data storage was ultimately layered on top of GFS, which isn’t designed for random access workloads:
http://glinden.blogspot.com/2010/03/gfs-and-its-evolution.html
That being said, I think the Register article is badly overstated. Incremental index updates are run out of Bigtable, but full index rebuilds are probably still run out of MapReduce/GFS. Moreover, Bigtable itself is layered on top of GFS.
1. Stonebraker IMHO create a huge mess. Because map-reduce is nowhere close to a database system. All this comparison does not make much sense.
2. Most data is still static. Dynamic data, of course, needs special treatment.
Google insiders’ reactions to that article was “What? That’s a weird reading. Buh? Lipkovitz was misquoted!” and “what a crappy article.”
The article is, apparently, very misleading and, in places, downright wrong. Google built something cool and new, but in no way are they moving away from MapReduce.
And, on a personal note, Stonebraker seems like an ass.
I know exactly what happens but can’t say because of NDA, the article is not far away from truth.
(a) MapReduce is a wonderful tool for a large open class of problems.
(b) from the very start, Google had other ways to run distributed processing atop GFS (not just MapReduce)
(c) if I believed in conspiracy theories, I’d state that the original Google’s paper about MapReduce was a smart decoy to create a confusion among “fast followers” and send them on a wrong trail by downplaying the importance of GFS (compared to MapReduce).
One of the reasons of Hadoop’s success is that (unlike other similar attempt) it focused on its file system from the very start.