The relational model, and by extension, the language SQL supports the notion of NULL marker. It is commonly used to indicate that some attribute is unknown or non applicable. NULL markers are a bit strange because they are not values per se. Hence, the predicate 1 = NULL is neither true nor false. Indeed, the inventor of the relational model, E. F. Codd, proposed a 3-value logic model: predicates are true, false or unknown. This lives on even today. Our entire civilization runs on database systems using an unintuitive 3-value logic. Isn’t that something!
Unfortunately, in real life, predicates either evaluate to true, or they don’t. C. J. Date showed that NULL markers end up giving you inconsistent semantics. So our civilization runs on database systems that can be inconsistent!
Yet the NULL markers were introduced for a reason: some things do remain unknown or are non applicable. We can handle these issues with more complicated schemas, but it is not practical. So database designers do allow NULL markers.
How did Codd react when it was pointed out to him that NULL markers make his model inconsistent? He essentially told us that NULL markers are in limbo:
(…) the normalization concepts do NOT apply, and should NOT be applied, globally to those combinations of attributes and tuples containing marks. (…) The proper time for the system to make this determination is when an attempt is made to replace the pertinent mark by an actual db-value.
So the mathematical rigor does not apply to NULL markers. Period.
This sounds pretty bad. I am rather amazed that Codd could get away with this.
But how bad is it in real life?
Let us consider WordPress, the blog engine I am using. As part of the core database schema, only the tables wp_postmeta, wp_usermeta and wp_commentmeta allow NULL markers. These tables are exclusively used to store metadata describing blog posts, users and comments. If this metadata is somehow inconsistent, the blog engine will not fall apart. It may hurt secondary features, such as advanced navigation, but the core data (posts, users and comments) will remain unaffected.
Date was repeatedly asked to prove that NULL markers were indeed a problem. I do not think that he ever conclusively showed that they were a real problem. Anyhow, our civilization has not collapsed yet.
Does anyone has any evidence that NULL markers are a bona fide problem in practice? Oh! Sure! Incompetent people will always find a way to create problems. So let us assume we are dealing with reasonably smart people doing reasonable work.
Credit: This post is motivated by an exchange with A. Badia from Louisville University.
Example of SQL’s inconsistency:
We are given two tables: Suppliers (sno,city) and Parts(pno,city). The tables have both a single row; (S1,’London’) and (P1,null) respectively. That is, we have one supplier in London as well as one part for which the location is left unspecified (hence the null marker).
We have the following query:
Select sno, pno
From Suppliers, Parts
Where Parts.city <> Suppliers.city
Or Parts.city <> ‘Paris’;
In SQL, this query would return nothing due to Codd’s 3-value logic because the where clause only selects row when the predicate is true.
Yet we know that if a physical part is actually located somewhere, it is either not in London or not in Paris. So the answer is wrong.
Let us consider another interpretation: maybe the part P1 is fictitious. It is not physically available anywhere. In such a case, the SQL query still fails to return the correct answer as the part P1 is not in London.
Maybe we could assume instead that the part P1 is available everywhere: this later interpretation is also incorrect
because the query
Select * from Parts where Parts.city = ‘Paris’
will return nothing.
20 thoughts on “Do NULL markers in SQL cause any harm?”
Tony Hoare called null references his billion dollar mistake (http://www.infoq.com/presentations/Null-References-The-Billion-Dollar-Mistake-Tony-Hoare), similar reasoning applies to 3-valued logic. It’s very easy to get bugs in queries when you forget to handle nulls and such bugs usually aren’t noticed until very late in the development cycle. It also leads to things like automatically conversion of values to NULL and vice-versa (e.g. empty varchars are NULL in Oracle).
Many-valued logic is not inherently inconsistent (see http://plato.stanford.edu/entries/logic-manyvalued/). Why not allow three values but use a consistent logic?
I think people unfairly treat clustering of their bugs around features as indications that those features are a huge problem. NULL is a classic case of this.
Yes NULL is error prone. Yes NULL is inconsistent. Yes a lot of errors occur because of NULL.
That’s because NULL is a necessary tool to address error prone, inconsistent aspects of reality. We need it.
Assuming that you do want 3-value logic, there is a computational cost to pay if you want consistency. Yet efficiency is very important for a database system, maybe more so than theoretical rigor.
Let me quote John Grant from “Null Values in SQL”:
What is amazing to me is that Codd got away with an inconsistent 3-value system and not only that, but this became the de facto standard for database systems.
3-valued logic is “flattened” in SQL: In SELECT sentences, if the whole WHERE expression is “unknown” then it is assumed to be “false”. In constrat, in CHECK constraints, if the expression is evaluated to “unknown” (e.g. CHECK X>0) then it is assumed to be “true”.
Some aggegate operators behave in a contra-intuitive way in presence of NULL marks.
However, there is, in my opinion, a situation where NULLs make sense: when representing a N:0..1 relationship between two tables and you don’t want to have an extra table to represent such a relationship.
I should clarify my previous post: obviously there are ways to get the job done without NULL. However, we need to model and address the bits of reality that give rise to NULL, and while their are approaches that avoid the Date’s theoretical problem with NULL, I don’t think there is an approach which avoids all the bugs that so often occur and are attributed to NULL. Unlike say, C’s buffer overflows or NULL terminated strings, the bugs are caused by the nature of the problem, not the nature of the solution.
I’d even argue NULL handles those problems better than most other solutions I’ve seen, but perhaps someone has a better alternative.
An interesting side note – in ieee floating point representation, there is “NaN” (not a number). If I have
double a = numberic_limits::quietNaN();
bool e = (a == a);
will evaluate to false.
@Daniel I agree it follows the standard. I was cnotrasting its behaviour to sql’s null.
@Danieal – you may be right – I wasn’t commenting on the rightness of the approach, just the different choice. Sorry I wasn’t clear.
To look at Codd’s approach – I think there are two problems here:
one is “this answer is not defined for the arguments given”, and the other is “this data item has not been assigned a value”.
NaN is supposed to be non-reflexive under IEEE 734. C++ supports IEEE 734.
PostgreSQL and Oracle do claim to support IEEE 734 but I think that their NaN is reflexive (thus violating the standard).
IEEE 734 is logically consistent, and relies on simple 2-value logic (false or true). Elements in IEEE 734 do not form equivalence classes, however, but that is not required for logical consistency.
Well, the designers of IEEE 734 did not see a need to use 3-value logic. I think they were right. I think Codd was wrong.
Actually “1 = NULL” might not be false, but it is falsy. That is, SELECT * FROM table WHERE field = null; will simply select nothing, because the predicate is always falsy, even if field itself contains null for a record.
The trouble is when you expect negation of falsy. The negation of something undefined is still undefined. Hence “not null” is still null in sql. And “1 null” is still falsy. (But funny enough, “1 (null OR false)” is true because “null OR false” is false and not null but “null AND true” is null.)
If anything, sql could benefit from a value that represents undefined and a value that represents nothing. But maybe todays situation is better, otherwise nobody would understand which value to use in what situation. And now everybody gets burned by sql null at least once and then learns 🙂
Yes. Codd did, in fact, propose to handle different markers to represent just the kind of distinctions you are making, but it was never picked up.
The trouble is when you expect negation of falsy
I think that the trouble is deeper than that. You do end up with inconsistent results. Period. It is not just that NULL is difficult to understand, it is mathematically improper. Please see the detailed explanation Grant gave: http://www09.sigmod.org/sigmod/record/issues/0809/p23.grant.pdf
Even Codd admitted as much: to him, the NULL markers were outside of the relational model and not subject to normalization. He knew of the problems, because they were reported to him, but he viewed them as irrelevant.
In some sense, history proved him right. Or did it?
In the example of the paper, P1 has NULL for a city, and the author poses that whatever you fill in for NULL, it must result in the predicate to become true. However, maybe the parts supplier P1 is omnipresent, it is in every city all at once. In that case the predicate should stay false on two accounts: P1 is the same city as S1 and P1 is in Paris.
NULL is not a value, it is outside the domain of the values that could have been in its place.
Is that mathematically improper? I am not convinced; it is practical, however. So yes, I think Codd was proven right.
Going back to my first statement. Notice how there are no inconsistencies if you don’t use negations (not or unequal). If need be use “select inverse”.
So how do you deal with this as a programmer? Just don’t use negations unless you think about how you want to deal with null. Basically you need to deal with the fact that the sql engine does not know that the domain of cities is finite and void of quantum values.
And surely not having null is much more of a burden… it is like saying because division by zero is such a problem that we don’t want zero in math.
Yes, it is mathematically inconsistent. There is no way around it. I’ll update my blog post right away with a more elaborate discussion. Your interpretation falls apart too as you’ll see.
I see what you mean, but your reasoning is incorrect, see NULL might mean is nowhere, or everywhere, or only paris or only london. Or it might even be a quantum value that is london if you compare it to paris, and paris when you compare it to london. (This is easier to see if you stop using singular values and use sets of cities instead, now what does the NULL set mean, or a set containing Paris and NULL mean?)
The only sensible answer to a comparison with null is to return null again. The inconsistency is the pragmatic choice to let null be a falsy value when it comes to actually doing work (selecting records).
The sound thing to do in sql is not to return a list of some records when encountering null, instead to collapse the whole result into null itself. As in, I don’t know the answer to your question. But that is not very pragmatic …
The sound thing to do in sql is not to return a list of some records when encountering null, instead to collapse the whole result into null itself. As in, I don’t know the answer to your question.
But that is not what SQL does. It does return an answer even when nulls are involved. SQL itself does not offer any consistent view.
Of course, the programmer can check (painfully) to see if any nulls are involved. Or he can forbid nulls. Or he can use other, more complicated schemas, or special values… Anyhow, the burden is on the programmer to do the right thing with nulls… SQL will not help you. Arguably, it makes it hard for you to do the right thing.
Just so we are clear: I am not arguing against NULLs.
You may subscribe to this blog by email.