(Note: thanks to Daniel for letting me use his blog. All comments, questions and criticisms are appreciated, but in this case Daniel is not the guilty party! Antonio)
As a database researcher, I was “brought up” to think of databases as the ideal tool for data management. There was simply no good reason for anyone with data to manage to use something else. It was with dismay, then, that little by little I came to realize that the importance of databases for data management is diminishing tremendously out there in the real world. The advent of NoSQL was a clear call for anyone paying attention (although in theory NoSQL does not necessarily mean no-relational, in practice it does). The fact that scientists are using anything but (relational) databases for e-science is also another sign that the Apocalypse is among us (just kidding).
It seems clear to me that it must be admitted that there are situations in which a (relational) database is simply not the appropriate tool. This immediately brings forth a question: When is a (relational) database the right tool? My idea to attack this question was to look at the problems people are trying to solve, the processes they use to solve them, and to classify the data those processes deal with along relevant dimensions. Of course, what is relevant is in the eye of the beholder; but there are some characteristics that have traditionally been challenging: dealing with very large data sets, and dealing with irregular data. So I analyzed processes and their data based on those two dimensions.
- data size: to classify data by size, I used an entirely pragmatic measure. Roughly, if a data set fits into memory, it is small (up to 2GB or so in today’s computers); if it fits in a single disk (or, to be exact, a disk array with a single controller), complexity is medium (up to 1TB or so); if it needs several disks/parallel/distributed systems, it’s large (note that this last category includes perhaps too much, since it goes from a few TB to PTs and beyond). The motivation for this division is clear: algorithms that can assume that all data fits in memory are ‘qualitatively’ different from those that need disk access. And adapting algorithms for parallel/distributed environments presents non-trivial challenges. So yes, these categories are very much dependent on current architecture and technology, and may have to be revised once a great solid state device is developed. But until a revolution comes, that will just mean changing the exact ‘boundaries’. And I think the division is very relevant for people using the processes, especially for those that require interactive (or near interactive) response time.
- data complexity: intuitively, this has to do with the data structure or organization, but admittedly, it is a much harder nut to crack then the previous one. After some thinking, I decided to subdivide the problem along 3 ‘sub-dimensions’:
- Conceptual complexity: Intuitively, this is the number of elements (entities, relationships) in a conceptual model of the process’ data. This is clearly a bit vague, since one can build several conceptual models of the same data, but it is not important here which entities one “sees” in the data, as to ‘how many’. Of course, both issues are related, but still there is an intuitive sense in which this matters. So, if a conceptual model would identify 1 main entity, everything else being attributes or entities “strongly connected” to it, then that data would be considered simple -the idea is that you could keep all data in 1 file, with minimal redundancy. If you can identify several entities, then the complexity is medium. How many? There is no clear-cut number here. One possibility is to say that it’s not too much for one person to comprehend “at once”; based on the famous finding of “the magical number seven, plus minus two”, but taking into account that one can get helps from ‘external memory devices’ (i.e. pencil and paper), we could set an upper limit of 10 to 20 entities. More than that, and we consider the conceptual complexity large.
- regularity: relational databases assume that you can define a schema for your data before-hand. However, this is not always possible. There has been much research to deal with somewhat irregular (semistructured) data, but here I make another distinction that I consider more relevant to today’s processes: whether the schema is ‘close vocabulary’ (i.e. all possible entities/attributes/relations can be enumerated once and for all) or ‘open’. Schemae (both relational and object-oriented), taxonomies, ontologies, even DTDs and XML Schema-compliant data are closed. Of course, within those there are distinctions. When the schema comes first (relational and object-oriented) I consider that regularity is high (and complexity low); when schema and data may be decoupled (as in semistructured data), I consider that regularity is medium (and so is complexity). On the open vocabulary front, regularity is very low (and complexity is very high). Note that key-value stores are considered ‘open’, since in many of them the value is opaque (i.e. it could be anything). This is not really a schema, it’s just a convenient way to distribute data (keys are made up and used for hashing or sorting; they are not inherent to the data).
- schema rate of change: even if one has a schema, or at least a closed vocabulary, one may not be able to take full advantage of this fact if the schema keeps on changing. Relational databases implicitly assume that once a schema is created, there is going to be little to no change. At the other extreme, in some modern processes nothing is assumed about how data evolves, and there is complete freedom: two objects in a collection may have completely different attributes. Also, an object may change to the point that it has nothing in common with the original (except the key, of course). If there is no change, or very infrequent one, I consider complexity low; if there is some change allowed, but within certain limits, I consider it medium; if any type of change is allowed, I consider it high.
Thus, we can analyze a process by looking at the data it consumes and/or generates, to decide what kind of system better supports it. The funny thing is, under this analysis, RDBMS seem only adequate when size and complexity are medium -in particular, data size is medium, conceptual simplicity can be handled, but regularity must be high and schema rate of change must be low. For processes with other characteristics, a RDBMS may not be well suited. Some people will argue that most RDBMS nowadays can handle very large data and irregular data -since commercial systems come with ‘cluster’-based extensions, as well as extensions to handle XML. But many of the problems facing large datasets are not solved by throwing more hardware at them. For high availability, replication is needed; and this bring issues for transaction support. For complex data analysis, many times the approaches required are not supported (or not supported well) by SQL -hence, just being able to store the data is not enough. So for many processes, using a distributed RDBMS or a cluster-based RDBMS will still not do. As for extensions to handle object, XML, and even text, the problem is that each one of these extensions was basically a compromise that yielded unwieldy systems, lost much of the simplicity of the original relational model, and in exchange gave clumsy tools.
What happens in other situations? When your data is low in size and low in complexity, some processes make do with files and some domain-specific programs. This seems to be the case with much e-science. The overhead that a database brings is just no worth it. As for very large data, this seems to be the niche of the all these new shiny NoSQL systems.
So, all in all, it seems that RDBMS are being relegated to a very narrow niche. To be sure, it is still a very profitable one, so there is no big market pressure on database companies (yet, although some of them are already responding, like Oracle with its NoSQL database). One can argue that RDBMS have been ‘under threat’ before, and have reacted to it -but as a result, RDBMS seems more bloated than ever to some users, while still failing to satisfy their information needs. Maybe it’s time to rethink the whole thing from scratch?