(Note: thanks to Daniel for letting me use his blog. All comments, questions and criticisms are appreciated, but in this case Daniel is not the guilty party! Antonio)
As a database researcher, I was “brought up” to think of databases as the ideal tool for data management. There was simply no good reason for anyone with data to manage to use something else. It was with dismay, then, that little by little I came to realize that the importance of databases for data management is diminishing tremendously out there in the real world. The advent of NoSQL was a clear call for anyone paying attention (although in theory NoSQL does not necessarily mean no-relational, in practice it does). The fact that scientists are using anything but (relational) databases for e-science is also another sign that the Apocalypse is among us (just kidding).
It seems clear to me that it must be admitted that there are situations in which a (relational) database is simply not the appropriate tool. This immediately brings forth a question: When is a (relational) database the right tool? My idea to attack this question was to look at the problems people are trying to solve, the processes they use to solve them, and to classify the data those processes deal with along relevant dimensions. Of course, what is relevant is in the eye of the beholder; but there are some characteristics that have traditionally been challenging: dealing with very large data sets, and dealing with irregular data. So I analyzed processes and their data based on those two dimensions.
- data size: to classify data by size, I used an entirely pragmatic measure. Roughly, if a data set fits into memory, it is small (up to 2GB or so in today’s computers); if it fits in a single disk (or, to be exact, a disk array with a single controller), complexity is medium (up to 1TB or so); if it needs several disks/parallel/distributed systems, it’s large (note that this last category includes perhaps too much, since it goes from a few TB to PTs and beyond). The motivation for this division is clear: algorithms that can assume that all data fits in memory are ‘qualitatively’ different from those that need disk access. And adapting algorithms for parallel/distributed environments presents non-trivial challenges. So yes, these categories are very much dependent on current architecture and technology, and may have to be revised once a great solid state device is developed. But until a revolution comes, that will just mean changing the exact ‘boundaries’. And I think the division is very relevant for people using the processes, especially for those that require interactive (or near interactive) response time.
- data complexity: intuitively, this has to do with the data structure or organization, but admittedly, it is a much harder nut to crack then the previous one. After some thinking, I decided to subdivide the problem along 3 ‘sub-dimensions’:
- Conceptual complexity: Intuitively, this is the number of elements (entities, relationships) in a conceptual model of the process’ data. This is clearly a bit vague, since one can build several conceptual models of the same data, but it is not important here which entities one “sees” in the data, as to ‘how many’. Of course, both issues are related, but still there is an intuitive sense in which this matters. So, if a conceptual model would identify 1 main entity, everything else being attributes or entities “strongly connected” to it, then that data would be considered simple -the idea is that you could keep all data in 1 file, with minimal redundancy. If you can identify several entities, then the complexity is medium. How many? There is no clear-cut number here. One possibility is to say that it’s not too much for one person to comprehend “at once”; based on the famous finding of “the magical number seven, plus minus two”, but taking into account that one can get helps from ‘external memory devices’ (i.e. pencil and paper), we could set an upper limit of 10 to 20 entities. More than that, and we consider the conceptual complexity large.
- regularity: relational databases assume that you can define a schema for your data before-hand. However, this is not always possible. There has been much research to deal with somewhat irregular (semistructured) data, but here I make another distinction that I consider more relevant to today’s processes: whether the schema is ‘close vocabulary’ (i.e. all possible entities/attributes/relations can be enumerated once and for all) or ‘open’. Schemae (both relational and object-oriented), taxonomies, ontologies, even DTDs and XML Schema-compliant data are closed. Of course, within those there are distinctions. When the schema comes first (relational and object-oriented) I consider that regularity is high (and complexity low); when schema and data may be decoupled (as in semistructured data), I consider that regularity is medium (and so is complexity). On the open vocabulary front, regularity is very low (and complexity is very high). Note that key-value stores are considered ‘open’, since in many of them the value is opaque (i.e. it could be anything). This is not really a schema, it’s just a convenient way to distribute data (keys are made up and used for hashing or sorting; they are not inherent to the data).
- schema rate of change: even if one has a schema, or at least a closed vocabulary, one may not be able to take full advantage of this fact if the schema keeps on changing. Relational databases implicitly assume that once a schema is created, there is going to be little to no change. At the other extreme, in some modern processes nothing is assumed about how data evolves, and there is complete freedom: two objects in a collection may have completely different attributes. Also, an object may change to the point that it has nothing in common with the original (except the key, of course). If there is no change, or very infrequent one, I consider complexity low; if there is some change allowed, but within certain limits, I consider it medium; if any type of change is allowed, I consider it high.
Thus, we can analyze a process by looking at the data it consumes and/or generates, to decide what kind of system better supports it. The funny thing is, under this analysis, RDBMS seem only adequate when size and complexity are medium -in particular, data size is medium, conceptual simplicity can be handled, but regularity must be high and schema rate of change must be low. For processes with other characteristics, a RDBMS may not be well suited. Some people will argue that most RDBMS nowadays can handle very large data and irregular data -since commercial systems come with ‘cluster’-based extensions, as well as extensions to handle XML. But many of the problems facing large datasets are not solved by throwing more hardware at them. For high availability, replication is needed; and this bring issues for transaction support. For complex data analysis, many times the approaches required are not supported (or not supported well) by SQL -hence, just being able to store the data is not enough. So for many processes, using a distributed RDBMS or a cluster-based RDBMS will still not do. As for extensions to handle object, XML, and even text, the problem is that each one of these extensions was basically a compromise that yielded unwieldy systems, lost much of the simplicity of the original relational model, and in exchange gave clumsy tools.
What happens in other situations? When your data is low in size and low in complexity, some processes make do with files and some domain-specific programs. This seems to be the case with much e-science. The overhead that a database brings is just no worth it. As for very large data, this seems to be the niche of the all these new shiny NoSQL systems.
So, all in all, it seems that RDBMS are being relegated to a very narrow niche. To be sure, it is still a very profitable one, so there is no big market pressure on database companies (yet, although some of them are already responding, like Oracle with its NoSQL database). One can argue that RDBMS have been ‘under threat’ before, and have reacted to it -but as a result, RDBMS seems more bloated than ever to some users, while still failing to satisfy their information needs. Maybe it’s time to rethink the whole thing from scratch?
NoSQL is getting all the attention. It’s new and exciting and all. I myself am using a home cooked column oriented DB that’s not relational.
Having said all that, relational is very far from dead. For one thing relational DBMSs do very well for business critical OLTP stores. There, they are unassailable. In fact, in any company that uses NoSQL for some task, likely a mission critical one, you will find numerous uses of relational DBs for the day today internal operations. For those uses noSQL is simply not a replacement.
In short – noSQL does well on some rather vertical uses. Relational/SQL does well on pretty much everything else, which is the majority of what’s going on.
Phil Jones
Nah. He’s missed the point entirely. However much you dress it up in fancy names (NoSQL, object databases etc.) non-relational databases are just hierarchical databases.
Hierarchical databases have the following feature : they’re quick to search *down* the tree, but bloody expensive to search up the tree.
For example, if you have addresses as subtrees of users then, given a user, it’s fast to find where she lives. But it’s correspondingly expensive to find which people live in London.
Relational database by keeping everything flat, rebalance that. It’s (more or less) as fast to find who lives in London as it is to find where I live.
The key difference then is if you can predict in advance the kind of queries you are going to make, and you know they always go in one direction, then NoSQL will beat relational (at any scale).
But if you *don’t* know which way you’ll be asking the question, or you have to ask questions that go both ways (from users to locations AND from locations to users) then relational is better. (Or, at least, makes all queries equally efficient / inefficient)
How does this discussion tie into the problems with WinFS (http://en.wikipedia.org/wiki/WinFS)?
@Turney
It is hard to tell what WinFS is exactly but if it is, as some described it, a project that aims to store everything in a big relational database, then you would expect both performance and complexity problems because data tends to be irregular.
I think there’s an interesting story behind the failure of WinFS. I hope the story goes public some day.
Ehhhh…. It is just time for hybrid approaches: relational is two simple and requires de-normalization for efficient processing. A column storage + an ability to handle arbitrary objects is not a completely new paradigm, it can also be handled within a relational dB. Memory or no-memory does not matter, because it is not a model characteristic. Likewise, transaction and persistence model can be more relaxed than a classic one. It does not mean that RDBMS are gonna die.
And we have this just in time! 🙂
Facebook shares some secrets on making MySQLÂ scale | Read http://gigaom.com/cloud/facebook-shares-some-secrets-on-making-mysql-scale/