I took a real beating with my previous post where I argued against excessive normalization on the grounds that it increases complexity and inflexibility, and thus makes the application design more difficult. Whenever people get angry enough to post comments on a post of mine, I conclude that I am onto something. So, let’s go at it again.
On the physical side, developers use normalization to avoid storing redundant data. While this might be adequate with modern data database systems, I do not think this is well founded, in principle. Consider this Java code:
String x1="John Smith";
String x2="Lucy Smith";
String x3="John Smith";
Is this code inefficient? Won’t the Java compiler create 3 strings whereas only 2 are required? Not at all. Java is smart enough to recognize that it needs to store only 2 strings. Thus, there is no reason for this non-normalized table to be inefficient storage-wise even though Jane Wright appears twice:
|Customer ID||First Name||Surname||Telephone Number|
Nevertheless, the First normal form article on wikipedia suggests normalizing the data into two tables:
|Customer ID||Telephone Number|
|Customer ID||First Name||Surname|
Pros of the normalized version:
- It does look like the normalized version uses less storage. However, the database engine could compress the non-normalized version so that both use the same space (in theory).
- We can enforce the constraint that a customer can have a single name by requiring that the Customer ID is a unique key (in the second table). The same constraint can be enforced on the non-normalized table, but less elegantly.
Pros of the non-normalized version:
- A customer could have different names depending on the phone number. For example, Edward Buttress could use his real name for his work phone number, but he could report the name Ed Butt for his home phone number. The power to achieve this is entirely in the hands of the software developer: there is no need to change the schema to add such a feature.
- If you start with existing data or try to merge user accounts, the non-normalized version might be the only sane possibility.
- The database schema is simpler. We have a single table! You can understand the data just by reading it. Most of your queries will be shorter and more readable (no join!).
To a large extent, it seems to me that the question of whether to normalize or not is similar to the debate of static versus dynamic typing. It is also related to the debate between the proponents of the waterfall method versus the agile crowd. Some might say that working with a non-normalized table is cowboy coding. In any case, there is a trade-off. Flexibility versus safety. But to my knowledge, the trade-off is largely undocumented. What are the opportunity costs of complex database schemas? How many databases do get unusable by lack of normalization?
I think that part of the NoSQL appeal for many developers is strong data independence. Having to redesign your schema to add a feature to your application is painful. It may even kill the feature in question because the cost of trying it out is too high. Normalization makes constraints easier, but it also reduces flexibility. We should, at least, be aware of this trade-off.
Note: Yes, my example goes against the current practice and what is taught in all textbooks. But that is precisely my intent.
Update: Database views can achieve a related level of data independence.
17 thoughts on “Over-normalization is bad for you”
I don’t think I am missing the point. I do ask however whether this possibility you evoke—that the names will go out of sync whereas they shouldn’t—is a real threat.
We used to think that catching bugs at design time was essential. Our experience in the last 20 years has shown that belief to be somewhat overblown. Facebook was built with PHP, Twitter with Ruby.
Many NoSQL databases work just like my example. There is no trace of normalization. And people seem to get real work done.
Which is fine if that’s the correct application logic, and not fine if it isn’t.
I’d prefer to have strong data independence. That is, I would prefer to be able to change the application logic without having to change the database schema.
It seems best if the developer rarely needs to change the database schema, because that’s an expensive operation. In practice, I suspect a lot of features are not implemented, or they are badly implemented, because they don’t match the database schema.
I would rather make the database engine work harder to adapt, rather than have the developer redesign the tables.
I see lots of flattened data where the flattening has made it impossible or onerous to answer very reasonable and valuable questions.
In my example, there does not need to be any performance difference between the two databases. (I admit that there might be in the real world.) It is a matter of physical layout, but I’m only discussing the logical representation.
Joins should be an internal issue below the query layer.
I think this is almost exactly what I am saying.
I think you are missing the point here. The problem with non-normalized data is not the added storage (which happens, but disk is cheap), but the anomalies that happen when data is stored redundantly: insertion, deletion, and update anomalies. If the Jane Wright in your example is one person, and it is represented by two in the non-normalized table, what stops an application from changing one row without changing the other (say, change the customerid in one and not the other), thus leaving the database in an inconsistent state? This, I think, it’s the real reason to normalize.
“there is no need to change the schema to add such a feature”
More accurately, this feature matches the logical structure of this database. So it’s not that it’s easy to “add” such a feature, it’s that this database already has it. Which is fine if that’s the correct application logic, and not fine if it isn’t. And since the main pressure towards denormalization is bad tools for dealing with normalization, there’s no reason to believe that denormalized data structures will reliably coincide with correct features you just hadn’t thought about yet.
“most of your queries will be shorter and more readable”
But shorter and more readable is no victory if the results are wrong. And when you flatten a non-flat structure, you lose the ability to get reliable answers to questions along any dimension other than the one you used for flattening. If your application only ever needs to ask questions in one direction, that may be acceptable. But I almost never see non-trivial non-flat datasets where there actually is only one dimension anybody ever really cares about.
I should turn that around: I see lots of flattened data where the flattening has made it impossible or onerous to answer very reasonable and valuable questions.
disk is cheap
Yes, but you do want to load as much as possible into your L2 and L1 caches. Larger databases are often slower on that account.
“Data independence” = full logical normalization. It’s tempting to wish that wasn’t the case, because we’ve spent decades working with bad tools that make you pay for every bit of normalization with relentless pain. But it’s really true, and shouldn’t be scary. All the principles of relational database construction (I mean the logical principles, not any tactical crap about specific nominally-relational implementations) are precisely about making your database accurately express the logical structure of the data, such that your use of it will be constrained only by what it *is*, rather than how it’s been stored. If you want to be able to change the application, with ebullient impunity, without changing the database schema, then you *have* to care about normalization. Normalization is to data what engineering is to bridges. Data modeling (calling it “normalization” was probably a mistake to begin with) is to data what physics is to things. You can’t be against it. It *is*.
And yet, you’re absolutely undeniably irrevocably right (I think) about at least two things:
1. Normalization orthodoxy encourages people to focus on database rules rather than scrutinizing actual logic, which often leads to them making bad modeling decisions. I think it’s possible that the biggest single cause of this is the now-established ass-backwards pattern of teaching “normalization” as a series of pedantically phrased pennances you are guilted into doing to a nice, normal spreadsheet. If data modeling was taught by explaining how to construct a logical model the right way, the first time, I think it would seem vastly more comprehensible and a lot less frightening.
2. In SQL, normalization = pain. It makes perfect sense to hate joins, and want to not have to do them. But you should never have had to do them to begin with. Joins should be an internal issue below the query layer. You should be able to ask for a book’s author without caring whether author is single or multiple or optional, and without knowing or caring how many “tables” are involved in answering your query.
Or, for my longer attempts at attacking this topic myself, see
Taking your example one step further, what if you want to add an email address, and Jane Wright has three email addresses. Do you have three rows (one with a blank telephone number?) or six rows for each combination of telephone number and email address. The latter is effectively how you’ve dealt with two telephone numbers.
What about other attributes ? Perhaps Jane Wright has made 10 orders each with 3 items. Do you have those six email/telephone number combinations against all 30 order lines ?
You might have to add lots of rows, but remember that I am specifically saying that these don’t need to be physical rows. They are merely logical. So you don’t necessarily have more data to store than if the database was normalized. (And I’ll admit that I don’t know of a database engine that can do this. So my point is a bit theoretical.)
Oh! If you are not concerned about storage, then it is a different question.
I am not saying your web application should run out of a single table. I am saying however that it shouldn’t use 15 tables if five will do. The sweet spot is not necessarily with the tables in third normal form. The normal forms are a dogma. (Knowledge without evidence.)
I never suggested they would have to by physical (and I suspect the compression mechanism in Oracle, which works on repeated values, would minimise storage space). That isn’t the point. The point is that the code logic needs to deal with those rows. You need to make deliveries of those orders so your code need to work out whether you are delivering one item that happens to relate to two phone numbers or two items that have different phone number attributes.
That is the problem normalisation addresses. Normalisation is NOTHING to do with physical storage. It is to do with identifying dependencies between data values.
I absolutely agree with you post.
May be there is a trade-off in certain applications, but not for web – and this is what is driving NoSQL.
As soon as web application starts receiving some decent trafic – hundred users per minute, per second etc. Properly denormalized EBNF tables become very heavy liability and very quickly tables have to be denormalised into caching tables – one common table for view output and then denormalized data pushed into cache like memcached.
Pure SQL approach just doesn’t work in real world web – you have to add features, you have to change DB schema as you add them and users can think of more ways to exploit you website then anticipated by fixed database schema.
I once worked for a company that had an amazingly, beautifully hyper-normalized DB for storing e-mail on an embedded device. Efficiency / storage was one big concern. But the tradeoff was *impossible to maintain*, let alone properly understand or debug SQL queries. Which made the UI to the DB really nasty to work with (and often slow too!)
So I agree with you – Normal forms are a dogma and ultimately an expensive one too (in situations where they are not really needed).
I’m using these two blog posts in my next DB class!
Thanks for the testimonial.
I have a question. I’m loving the idea of ditching normalization. But I’m not quite sure how to solve one problem. I can picture a site with profile pages and forums all being denormalized. But what if you don’t use their real name and a username… and they want to change their username? There would be thousands of records with the username hardcoded into the db. Or let’s say you don’t allow the user to change their username. But there is an avatar_url field. And they change their avatar?
(1) My claim is not that normalization is bad. Certainly master-detail schemas are useful and a good practice. I oppose “over-normalization” which leads to overly complicated schemas.
(2) Do you actually want the name to be updated in old posts? What about the reported home page and email addresses? The answer is maybe not so obvious. If you send an email today, and change your name tomorrow, email systems won’t go back and update old emails.
(3) It can become quite wasteful to join the tables each and every time just for the rare event where someone might change his name. From an engineering point of view, it might make a lot more sense to take a big hit when the user does change his name, possibly doing it in an eventually consistent manner (for a time, both names may appear), than paying a price all the time for a feature that most users may never use. Which is better for your application is a matter of trade-off. You should maybe compare the two options experimentally.
To me, what your comments above (such as: I do ask however whether this possibility you evokeâ€”that the names will go out of sync whereas they shouldn’tâ€”is a real threat) suggest is that you’ve never been responsible for anything where you have to provide real data to real people with real results behind them.
Web apps have an advantage in this regard: they often don’t matter. The company I work for does data processing and manipulation for law firms, and we provide a hosted data review platform for their convenience. It is a web app, but the data from it is provided, in the end, to courts and counsel.
So it matters greatly that what comes out relates to what goes in in a deterministic fashion. When a client decides to use redundant tags on objects, the result becomes easily apparent at production time, when it becomes a web to untangle to determine what is REALLY relevant/privileged/redacted/whatever.
Making sure that your information is normalized is ESSENTIAL if you are managing any data of actual consequence. While over-normalization is possible to some degree, it is really an esoteric idea suited only to highly performance-driven situations (and much of the time this can be fixed either by calling functions built into the database engine itself rather than reimplementing them in the application code… or by optimizing properly the applications database calls and caching the appropriate info… this all assumes that indexes are assigned right in the first place, of course.)
UNDER-normalization is a far, far more common situation, and one with real consequences.
You may subscribe to this blog by email.