MapReduce — a Major Step Backwards? 157
The Database Column has an interesting, if negative, look at MapReduce and what it means for the database community. MapReduce is a software framework developed by Google to handle parallel computations over large data sets on cheap or unreliable clusters of computers. "As both educators and researchers, we are amazed at the hype that the MapReduce proponents have spread about how it represents a paradigm shift in the development of scalable, data-intensive applications. MapReduce may be a good idea for writing certain types of general-purpose computations, but to the database community, it is: a giant step backward in the programming paradigm for large-scale data intensive applications; a sub-optimal implementation, in that it uses brute force instead of indexing; not novel at all -- it represents a specific implementation of well known techniques developed nearly 25 years ago; missing most of the features that are routinely included in current DBMS; incompatible with all of the tools DBMS users have come to depend on."
may be missing the (data)points (Score:5, Insightful)
I don't know why this article is so harshly critical of MapReduce. They base their critique and criticism on the following five tenets, which they further elaborate in detail in the article:
If you take the time to read the article you'll find they use axiomatic arguments with lemmas like: "schemas are good", and "Separation of the schema from the application is good, etc. First, they make the assumption that these points are relevant and germaine to MapReduce. But, they mostly aren't.
Also taking the five tenets listed, here are my observations:
they don't offer any proof, merely their view... However, the fact that Google used this technique to re-generate their entire internet index leads me to believe that is this were indeed a giant step backward, we must have been pretty darned evolved to step "back" into such a backwards approach
Not sure why brute force is such a poor choice, especially given what this technique is used for. From wikipedia:
Again, not sure why something "old" represents something "bad". The most reliable rockets for getting our space satellites into orbit are the oldest ones.
I would also argue their bold approach to applying these techniques in such a massively aggregated architecture is at least a little novel, and based on results of how Google has used it, effective.
They're mistakenly assuming this is for database programming
See previous bullet
Are these guys just trying to stake a reputation based on being critical of Google?
Re:may be missing the (data)points (Score:4, Insightful)
Re:may be missing the (data)points (Score:5, Funny)
It's also terrible for painting.
Re: (Score:2)
I am a little surprised by the parallelization methods, though. Informix developed some parallel database methods some time back, which is partly why IBM bought them. I'm sure that the cutting-edge parallel database techniques that exist today have advanced beyond herustics
Re:may be missing the (data)points (Score:5, Informative)
Re:may be missing the (data)points (Score:5, Interesting)
Re: (Score:3, Insightful)
Re: (Score:2)
While a good point, it's still irrelevant to Google and Map-Reduce because Google's search engine is NOT a RDBMS. It's almost pure indexing, and what they are doing is comparing, say , Oracle to a specific B+-Tree implementation. They are seeing the Map-Reduce algorithm purely from a RDBMS perspective--not a "let's solve this specific problem" perspective.
I'
Re:may be missing the (data)points (Score:5, Funny)
6. New things are scary.
7. Google is on their lawn.
8. Matlock is the best television show ever.
Re: (Score:2)
Re: (Score:2)
Re: (Score:2)
I don't know much about database theory, but do know that Michael Stonebraker already has a reputation.
Re:may be missing the (data)points (Score:4, Insightful)
Re: (Score:2)
I've bumped into this attitude in the little bit of time I spent as a developer: people who think that every last bit of configuration and data can (and must!) be crammed into a relational model, whether it belongs there or not. Performan
Re: (Score:2)
Re: (Score:2)
Wait, are they "experts", though? (Score:2)
That does not sound at all like a database expert to me. It's a simple many-to-many relationship!
Re:may be missing the (data)points (Score:4, Interesting)
The primary grounds for complaint seems to be "this isn't the way we do things in the database world". Each of the complaints (except #3) boils down to this (#1: The database community had arguments a few decades back and developed, at the time, a set of conventions; Map Reduce doesn't follow them and is, therefore, bad; #2: All databases use one of two kinds of indexes to accelerate data access; MapReduce doesn't and is, therefore, bad; #3: Databases do something like MapReduce, so MapReduce isn't necessary; #4: Modern databases tend to offer a variety of support utilities and features that MapReduce doesn't, so MapReduce is bad; #5: MapReduce isn't out-of-the-box compatible with existing tools designed to work with existing databases and is, therefore, bad.)
And its from The Database Column, a blog that from its own "About" page is comprised of experts from the database industry.
I suspect part of the reason they are harshly critical is that this is a technology whose adoption and use in large, data-centric tasks is (regardless of efficiency) a threat to the market value of the skills in which they've invested years and $$ developing expertise.
At the end, they note (as an afterthought) that they recognize that MapReduce is an underlying approach, and that there are projects ongoing to build DBMS's on top of MapReduce, a fact which, if considered for more than a second, explodes all of their criticism which is entirely premised on the idea that MapReduce is intended as a general purposes replacement for existing DBMSs, rather than a lower-level technology which is currently used stand-alone for applications for which current RDBMSs do not provide adequate performance (regardless of their other features), and on which DBMS implementations (with all the features they complain about MapReduce lacking) might, in the future, be built.
Re: (Score:3, Insightful)
Re: (Score:3)
Indexing is useless here. (Score:5, Insightful)
This works well if you can create such a slice - a piece of data you will match against. It becomes increasingly unwieldy if there are many ways to match a data - multiple columns mean multiple indices. And then if you remove columns entirely, making records just long strings, and start matching random words in the record, index becomes useless - hashes become bigger than chunks of data they match against, indexing all possible combinations of words you can match against results in index bigger than the database, and generally... bummer. Index doesn't work well against freestyle data searchable in random form.
Imagine a database with its main column being VARCHAR(255) and using about full length of it, then search using a lot of LIKE and AND, picking various short pieces out of that column, and the database being terabytes big. Try to invent a way to index it.
Re:Indexing is useless here. (Score:4, Funny)
Re: (Score:2)
This is also an old problem, with well understood solutions. Nearly all modern RDMS systems ship with a solution to this built in and well documented.
Google = statistical database? (Score:3, Insightful)
The thing is if Google uses this to create their index-like structure of the internet for their search engine, and it is not exactly like a RDBMS, well, so what? The MapReduce thing seems to be targeted at large sets of data and semi-accurate data mining, not exact results. No one really cares if there are 3,000
Re:may be missing the (data)points (Score:5, Interesting)
> Are these guys just trying to stake a reputation based on being critical of Google?
Um... yes?
The Database Column is being coy about being a corporate blog for Vertica, a high performance database database product, but in fact it is. Vertica is a commercial implementation of C-Store and was founded by Michael Stonebraker, the most prominent proponent of column based databases (get it? the database column). So yes, they have a very good reason to be hostile to Google.
http://www.vertica.com/company/leadership [vertica.com]
http://en.wikipedia.org/wiki/C-Store [wikipedia.org]
http://en.wikipedia.org/wiki/Michael_Stonebraker [wikipedia.org]
http://www.databasecolumn.com/2007/09/contributors.html [databasecolumn.com]
Re:may be missing the (data)points (Score:4, Insightful)
1) No indexing.
Which means
2) Certain types of constraints probably don't work (such as UNIQUE constraints)
Which also means
3) Referential integrity checking and other things don't work.
This leads to the conclusion that the idea is good for certain types of data-intensive but not integrity-intensive applications (think Ruby on Rails-type apps) but *not* good for anything Edgar Codd had in mind....
Re: (Score:3, Interesting)
ObDilbert (Score:2)
Re: (Score:2)
Relational databases are so routinely used in applications that they aren't a good fit for that I suspect many relational database "experts" aren't even acquainted with the practise of selecting network computing paradigms and technologies for an application that are a good fit for the application.
Re: (Score:2)
A giant step backward in the programming paradigm for large-scale data intensive applications
they don't offer any proof, merely their view...
Actually they do. Their proof is that this sort of approach was tried several decades ago, and has been found to be a fundamentally lacking approach for general purpose data processing. They provide several examples.
Not novel at all -- it represents a specific implementation of well known techniques developed nearly 25 years ago
Again, not sure why something "old" represents something "bad". The most reliable rockets for getting our space satellites into orbit are the oldest ones.
It's not bad because its 'old'. It's bad because it was rejected by the community consensus a long time ago as a general purpose solution. Their argument isnt that bad = old. Their argument is that ignoring the lessons of the past is bad.
Missing most of the features that are routinely included in current DBMS
They're mistakenly assuming this is for database programming
They never state that MapReduce is for database pr
Re:may be missing the (data)points (Score:4, Interesting)
Google's contribution (and yes it does predate them by a long time) is to point out that MapReduce is a bit more than an algorithm -- it is a design pattern. Design patterns help us write clean code by establishing a consistent vocabulary (e.g. actors, containers, operators, etc), and furthermore are important insofar as they making algorithms accessible to programmers. Right now we badly need more well-defined design patterns in the area of parallel computing as this is essentially the future of programming.
Re: (Score:2, Interesting)
Re: (Score:2)
Just watch. (Score:2, Insightful)
And watch. It'll be massively successful because it works.
Blink blink (Score:4, Funny)
Re:Blink blink (Score:5, Funny)
Re: (Score:2)
Not that I'm accusing you of anything like that.
I'm fired, aren't I?[/obligatory simpsons quote]
Fnord? (Score:2)
Does that mean Paradigm is a Fnord? As in, I can now say stuff you won't be able to consciously read, because it has the Fnord Paradigm in it?
Databases? WTF? (Score:5, Insightful)
Since when did MapReduce have anything to do with databases? It's actually about parallel computations, which are entirely different.
Re: (Score:2)
Re: (Score:2)
MapReduce is a tool, one of whose principal applications is conducting queries on large bodies of data consisting of records of similar structure. It, therefore, competes with traditional DBMSs to a degree.
Now, (largely because of the limitations the authors note), it generally is only used currently for the kind of applications where setting up a traditional RDBMS to handle them would be impractical: Google developed their implementation of MapRed
Re: (Score:2)
Responding to myself is bad form, but:
Obviously, this only makes sense taking "similar structure" extremely loosely; still, the point is that MapReduce was developed to fill a niche for which RDBMSs were being used previously, in the absence of a more specialized tool, so it clearly competes with them, to
Re: (Score:2)
ANd no it is not really a step backwards for databases. It is actually something which offers a niche solution for large-scale single-purpose, semi-accurate databases.
This is almost but not entirely unlike what Codd had in mind when he wrote his seminal paper: "A Relational Model of Data for Large Shared Data Banks."
If it were a paradigm shift, it would be a step backward. However, as "one more tool in the toolbox" it is useful in some cases where RDBMS's are not.
I figured it out. Really. (Score:2)
Re: (Score:2)
Re: (Score:3, Insightful)
1) The fact that MapReduce is being used for specific low level applications does not make it intrinsically different or uncomparable to an RDBMS, although it may n
Re: (Score:3, Insightful)
I think TFA is being silly in trying to compare MapReduce to DBMSs. Yes, of course MapReduce compares unfavorably, because it isn't a DBMS. The comment that MapReduce is "A
Re: (Score:2)
I don't think it's pointless. It (ie database theory) actually explains why RDBMSes are so popular: they are flexible enough to solve many problems, and people find it easy to think in those terms, and apply that viewpoint.
Re: (Score:2)
The article seems to assume that MapReduce is trying to compete with RDBMSs, and even attacks the authors of MapReduce, suggesting that they should read up on database theory. An article which simply argued that
Re: (Score:2)
Re: (Score:3, Insightful)
Money, meet mouth (Score:4, Insightful)
Now, this is not to say that a more sophisticated approach wouldn't work. It's just that when you have thousands of boxes in a few ethernet segments, communication overhead becomes really quite large, so large in fact that whatever can be saved with brute-force computation it'll usually be worth it. Consider that from what I've heard, at Google these thousands of boxes are mostly containers for RAM modules so there's rather a lot of computation power per gigabyte available to throw away with a brute force system.
Also, I would like to point out that map/reduce is demonstrated to work. Apparently quite well too. Certainly better than any hypothetical "better" massively parallel RDBMS available in a production quality implementation today.
Re: (Score:3, Interesting)
I recently read somewhere (if only I could recall the link...) that on average Google's MapReduce jobs process something in the order of 100 GB/second, 24/7/365
I've got nothing against RDBMS... but how can you be critical about a tool that scales and performs so well? It's just a matter of selecting and using the right tool for the job.
Re: (Score:2, Informative)
Re: (Score:2)
> their paradigms to datasets that are measured in the tens of terabytes
> and stored on thousands of computers.
I'm not defending the authors of the article, but...
1. this article wasn't written by rdbms people, but rather by column database people. There's nothing traditional or relational about their background.
2. Every solution is a result of a variety of compromises, there is no such thing as the perfect solution outside of
Re: (Score:2)
I'm not defending the authors of the article, but...
1. this article wasn't written by rdbms people, but rather by column database people. There's nothing traditional or relational about their background.
A column database [wikipedia.org] is a particular way of implementing relational databases. David DeWitt (faculty homepage [wisc.edu], wikipedia entry [wikipedia.org]) is best known for object-relational work, which is in the traditional relational area. Michael Stonebraker [wikipedia.org], the other author, is probably the best known relational database person living today (though C.J. Date might be another candidate).
Eivind.
As one of the comments on the blog ... (Score:4, Insightful)
"You seem to not have noticed that mapreduce is not a DBMS."
Exactly. These are the same sort of criticisms that you hear around memcached [danga.com] - the feature set is smaller, etc - and they make the same mistake. It's not a DBMS, and it's not supposed to be. But it does what it does quite well nonetheless!
distributed indexes? (Score:2)
Isn't the overhead of a distributed index usually not worth the bother? This scheme sounds similar to the way Teradata handles its distribution and it manages to get a lot done with hardly any secondary indexes. I think the thinking in the article indicates standalone database server box thinking.
Ideas ahead of their time? (Score:5, Insightful)
There are many classic/old techniques which are only now being used - and very successfully - precisely because the hardware simply wasn't there. A recent
Sometimes the "old" methods are best - you just need the horsepower to pull it off. Clever improvements only scale so long.
Re: (Score:2)
The rest of the article is just DB-centric whining.
Bad Perspective (Score:2)
Certainly if you were to implement map-reduce within the confines of the relational database world, there are implementation methodologies that would need to be taken to make it easier for the RDBMS developer to work with the storage and querying mechanisms.
The article implies that map-reduce is bad because it doesn't place restrictions common to the dat
A completely uninformed analysis (Score:3, Insightful)
Even more importantly, you can create schemas with MapReduce by how you write your Map/Reduce functions. This is a matter of the datafunction exchange (all data can be represented as a function, likewise all functions can be represented as data). I admit ignorance to how this MapReduce system works, but I would be surprised if you couldn't get a relational database back out.
The advantage is you get with MapReduce is that you aren't necessarily tied to a single representation of data. Especially for companies like Google, which may want to create dynamic groups of data, this could be a big win. Again, this is all speculative, as I have very little experience with these systems.
A Very Human Response (Score:3, Insightful)
Re: (Score:2)
That's not the "problem" (from the perspective of the authors of TFA), really.
The problem is that provides an alternative to work that has been going on in industry, and in particular that it provides a way to end-run some of the limitations of traditional databases that
belly acres (Score:2)
As though these are the exclusive choices. TFA goes on to complain about implementing 25 year old ideas, though they are actually rather older than that--they just didn't strike the RDB types until the eighties. They proceed to insist that the system cannot scale. Arguing google's scalability is like arguing gravity.
FTFA (Score:5, Insightful)
That's a joke, right?
I think Google's already taken care of all the experimental evaluations you'd need.
Re: (Score:2)
I know, that's what I thought.
But then again... a few weeks ago I was involved in a phone call with one of our clients. They're a huge client for us, to a degree that they can significantly influence the future direction of our product by complaining loud enough, and our first client to use some new "high-availability" features we're gradually rolling out.
In the course of our conversation, one of the client's guys essentially pooped on a large part of our product roadmap, basically beca
Re: (Score:2)
A step from where? (Score:4, Funny)
Re: (Score:2)
Re: (Score:2)
Translation: (Score:2)
Missing the forest for the trees... (Score:4, Insightful)
Comparing it to a DBMS on fanciness is pointless, because the DBMS solution fails where MapReduce succeeds.
Step backward? (Score:2)
Huh? (Score:2)
Vertica (Score:4, Interesting)
Vertica launches database-focused blog (Score:2)
Re: (Score:2)
The are afraid... (Score:2)
Traditional RDBMSes have their place, but we're going to see a lot more applica
like Spider Robinson sang.. (Score:2, Funny)
So I could shift my pair 'a dimes..."
Article really misses the point (Score:5, Insightful)
Also, I had a major WTF moment when I read this:
Given the experimental evaluations to date, we have serious doubts about how well MapReduce applications can scale.
Empirical evidence to date suggests that MapReduce scales insanely well. Exhibit A: Google, which uses MapReduce running on literally thousands of servers at a time to chew through literally hundreds of terabytes of data. (Google uses MapReduce to index the entire World Wide Web!)
This in turn suggests that the authors of TFA are firmly ensconced in the ivory tower.
They complained that brute-force is slower than indexed searches. Well, nothing about MapReduce rules out the use of indexes; and for common problems, Google can add indexes as desired. (Google uses MapReduce to build their index to the Web in the first place.) And because Google adds servers by the rackful, they have quite a lot of CPU power just waiting to be used. Brute force might not be slower if you split it across thousands of servers!
Likewise, they complain that one can't use standard database report-generating tools with MapReduce; but if the Reduce tasks insert their results into a standard database, one could then use any standard report-generating tools.
MapReduce lets Google folks do crazy one-off jobs like ask every single server they own to check through their system logs for a particular error, and if it's found, return a bunch of config files and log files. Even if you had some sort of distributed database that could run on thousands of machines, any of which might die at any moment, and if you planned ahead and set the machines to copy their system logs into the database, I don't see how a database would be better for that task. That's just a single task I just invented as an example; there are many others, and MapReduce can do them all.
And one of the coolest things about MapReduce is how well it copes with failure. Inevitably some servers will respond very slowly, or will die and not respond; the MapReduce scheduler detects this and sends the Map tasks out to other servers so the job still finishes quickly. And Google keeps statistics on how often a computer is slow. At a lecture, I heard a Google guy explain how there was a BIOS bug that made one server in 50 disable some cache memory, thus greatly slowing down server performance; the MapReduce statistics helped them notice they had a problem, and isolate which computers had the problem.
MapReduce lets you run arbitrary jobs across thousands of machines at once, and all the authors of the article seem to be able to see is that it's not as database-oriented as a real database.
steveha
Re: (Score:2)
Well, I didn't RTFA, but I also had a major WTF moment when I read that line. I don't understand what all that buzz is about, map reduce is an old approach that is known to work well. Also, it scales the best way any algorithm could scale, its only bottleneck is data
Article misses the point of MapReduce/RDBMS (Score:2)
The whole point of a relational DBMS is to store, link and maintain the integrity of data in tables based on the relationships among the data.
MapReduce is about processing data... it's not focused on maintaining integrity, and the kinds of datasets suitable for MapReduce probably don't have well defined relationships.
Re: (Score:2)
Or maybe talking the (free!) competition from a blog launched by a company trying to sell a different alternative to traditional databases (but one which outwardly looks more like a traditional RDBMS) for an overlapping problem domain (that is, column-oriented databases, which address some of the same distribution and parallelization issues that MapReduce addresses, and target some of the same areas [e.g., "big science"] where it has been suggested that MapReduce mig
Re: (Score:2)
Re: (Score:2)
In related news: Screwdrivers suck because... (Score:5, Funny)
2) They don't work like hammers,
3) You can already drive in a screw with a hammer,
4) They aren't good at ripping out nails, and
5) They aren't good at driving nails.
Brought to you by The Hammer Column, a blog written by experts in the hammer industry, and launched by Hammertron, makers of a revolutionary new kind of hammer [vertica.com].
They have a point. And it matters (Score:2)
I understand what they're getting at. What makes modern SQL-driven databases so useful is that they optimize queries. If you're asking for every entry in A that's also in B, any modern database will check whether it's faster to look up every A in B, every B in A, or do a match where both databases are read through sequentially by the same key. The best choice depends on the database record counts, available indices, and key types and lengths. The database system figures that out; it's not in the SQL que
Crawl is concurrent database update, not batch (Score:2)
No, crawling the web isn't a map/reduce type problem. It's a large number of long-running processes feeding a database-like engine.
Map/reduce is for batch-like jobs. Long-running systems with intercommunication have to be organized differently.
This coming from the DB Community? (Score:2)
When rotating HD disks will be replaced by SSDs and
Re: (Score:2)
What, pray tell would you replace SQL with? Bear in mind you have to make it capable of replacing BILLIONS of lines of stored procedures and db code. And whatever magnificent replacement you invent also has to be a great enough improvement over SQL to be worth the TRILLIONS of dollars necessary to retrain the entire SQL economy.
Go ahead, let's hear your awesome idea! Or were you just being an idiot?
Re: (Score:2)
You now all these Frameworks out there that are gaining traction? Rails, Django, CakePHP, etc.? Well, there is this one, Symfony, that uses a powerfull PHP DB abstraction layer called 'Propel'. One of it's perks is that you don't write any SQL anymore. None. Meaning: You write your transactions and persistance layer interactions in the programming language in which you write everything else aswell. If SQL is so cool, then why don't we have a different PL for each task? We cou
The only thing wrong with map-reduce... (Score:2)
Stream processing. (Score:2)
It's an apples and oranges comparison, and the author's never eaten an orange.
Index Every Column? (Score:2)
For Query-by-Example-like tools, often you cannot predict which columns need indexing: they ALL do. At some point it just seems easier to split the data sets up onto dozens or hundreds of hard-drives and just do a sequencial search on each one in parellel. I cannot say whether it is clearly faster than indexing every column, but it is certainly simpler from a technical standpoint. And, it would possibly require less disk-space
Q: Implementation issues (Score:2)
Re: (Score:2)
Also: How's a DBM supposed to profit off that? (Score:2)
And, if mapreduce doesn't generate vast license income for Oracle, it must suck. Imagine the per-processor charges Google would be paying!
What is MapReduce SPECIFICALLY useful for? (Score:2)
Proponents of MapReduce highlight two advantages:
1. MapReduce makes it very easy to program data transformations, including ones to which relational structures are of little relevance.
2. MapReduce runs in massively parallel mode "for free," without extra programming.
Based on those advantages, MapReduce would indeed seem to have significant uses, including:
* Specialized indexing of large quantities of data.
Re: (Score:2)