Daniel Lemire's blog

Do relational databases evolve toward rigidity?

The Hanson law of computing states that:

Any software system, including advanced intelligence, is bound to decline over time. It becomes less flexible and more fragile.

I have argued at length that Hanson is wrong. My main argument is empirical: we build much of our civilization on old software, including a lot of open-source software.

We often build new software to do new things, but that’s on top of older software. Maybe you are looking at your smartphone and you think that you are using software built in the last 4 years. If you think so, you are quite wrong.

So it is not the case that old software becomes obviously less useful or somehow less flexible with time. Yet, to adapt to new conditions, old software often needs “rejuvenation” which we typically call “refactoring”. Old database systems like MySQL were designed before JSON and XML even existed. They have since been updated so that they can deal with these data types efficiently.

So old widely used software tends to get updated, refactored, reengineered…

Viewed at a global scale, software evolves by natural selection. Old software that cannot adapt tends to die off.

There has been a fair amount of work in software aging. However, much of the work is of an interested nature: they want to provide guidance to engineers as to when they should engage into refactoring work (to rejuvenate their software). They are less interested in the less practical problem of determining how software evolves and dies.

Software often relies on database tables. These tables are defined by the attributes that make them up. In theory, we can change these attributes, add new ones, remove old ones. Because open-source software gives us access to these tables, we can see how they evolve. Vassiliadis and Zarras recently published an interesting empirical paper on this question.

Their core result is that tables with lots of attributes (wide tables) tend to survive a long time unchanged. Thinner tables (with fewer attributes) die young. Why is that? One reason might be that wide tables covering lots of attributes tend to have lots of code depending on it. Thus changing these tables is expensive: it might require a large refactoring effort. Thus these wide tables tend to stick around and they contribute to “software rigidity”. That is, old software will accumulate these wide tables that are too expensive to change.

I believe that this “evolution toward rigidity” is real. But it is less of a general feature of software, and more of a particular defect of the relational database model.

This defect, in my view, is as follows. The relational model makes the important recognition that some attributes depend on other attributes (sometimes called “functional dependencies”). So if you have the employee identifier, you can get his name and his rank. From his rank, you might get his salary. From this useful starting point, we get two problems:

  1. Instead of simply treating these dependencies between attributes as first-class citizens, the relational model does away with them, by instead representing them as “tables” where, somehow, attributes need to be regrouped. So, incredibly, the SQL language has no notion of functional dependency. Instead, it has keys and tables. These are not the same ideas!

    Why did functional dependencies get mapped to keys and tables? Simply because this is a natural and convenient way to implement functional dependencies. So we somehow get that “employee identifier, name, rank” get aggregated together. This arbitrary glue leads to rigidity as more and more attributes get lumped together. You cannot reengineer just one dependency or one attribute, without possibly affecting a lot of code.

  2. Functional dependencies are nice, but far more inflexible and limited than it seems at first. For example, some people have more than one name. People change name, actually quite often. Some information might be unknown, uncertain. To cope with uncertain or unknown data, the inventor of the relational model added “null” markers to his model, and some kind of three-value logic that is not even consistent. In a recent paper with Badia, I showed that it is not even possible, in principle, to extend functional dependencies to a open-world model (e.g., as represented by disjunctive tables).

So I would say that relational databases tend to favor rigidity over time.

There are some counterpoints that may contribute to explain why the sky is not falling despite this very real problem:

Credit: Thanks to Antonio Badia and Peter Turney for providing me with references and insights for this post.