Why do we need database joins?

In a recent post, I argued that the current NoSQL trend could be called NoJoin. My argument boils down to the fact that SQL entices you to normalize your data which creates complicated schemas. Meanwhile, NoSQL database systems use simple schemas and are therefore easier to scale out.

Curt Monash has a reasonable post where he points out that we need joins because we normalize. Furthermore, he offers reasons for normalization:

  • To simplify the programming of the updates. Simply put, if the string “Montreal” appears once in your database, and the city changes its name, it is trivial to do the update. This applies mostly when you have complex schemas.
  • For faster updates. Updating a single entry in a database is much faster than searching and updating for all occurrences of the value “Montreal”. This is mostly applicable when you have large update volumes.

However, the case against joins is also strong:

  • Normalization makes your schemas complex. I have seen university databases made of hundreds of tables. The average query is well over 256 characters and involves dozens of joins. It is simply impossible to make sense of the content of any one table. Building new applications on top of this mess is expensive and bug prone. Complexity is bad for your health.
  • Database engines can compress the data automagically so normalization to save space is a waste of time.

The dogma of normalization too often leads to over-engineering. We are so afraid that a programming error could leave the database in a wrongful state that we invest massively in inflexible schemas. In turn, this over-engineering comes back to haunt us when we need to be more agile, or to scale out.

Example:

Suppose you want to design a database of research papers. Let us simplify the problem by omitting the paper identifiers, the dates, and so on. Let us also assume that there is only one author per paper. Maybe your main table looks like this:

authorID author name publisher title
smith01 John Smith Springer Databases are bad
lampron01 Nathalie Lampron IEEE The other guy is wrong, databases are good

Being helpful, your friendly database expert points out that your database schema is not even in the second normal form. Clearly, you are an amateur. Being helpful, he creates a secondary table which maps the authorID field to an author name. And voilà! You have saved storage, and won’t ever get someone’s name wrong. Updates to someone’s name will be much faster in the future.

But wait?!? What if Nathalie gets married and changes name? And indeed, people have their names changed all the time. Yet, we never retroactively change the names of the authors on a paper. Maybe you never thought about it, but many ladies hold two or more names in their lifetime. Did the bunch of guys in IT knew about this? (As an aside, are the digital librarians worried at all about researchers changing name and seeing their publication list cut in half? Yes: See update below.)

My point is that normalization effectively enforces dependencies decided upon when you created the schema. These envisioned dependencies break down all the time. Life is complicated. I could come up with hundreds of examples. Strict normalization makes as much sense as the waterfall model.

What about the physical layer? Because normalization has removed entire fields from the main table, you might think that normalization will save storage! That may well be true in the database engine you are using. However, other database engines will automatically detect the dependencies and compress the data accordingly. In this case, it is trivial to discover that there  is a bijective (1-to-1) mapping between author ID and author name. And if the bijectivity breaks down, the database engine will simply have to work a bit harder to compress the data. Your code won’t break down. It won’t need to be retested. (To be fair, I don’t know if any database system gets this right.)

Update: Apparently, Otfried Cheong—a Computer Science professor in Korea—once published as Otfried Schwarzkopf. At least, the two names are merged on DBLP. It suggests that DBLP can cope with researchers changing their name.

Why you may not like your job, even though everyone envies you

In a provoking post, Matt Welsh—a successful tenured professor at Harvard—left his academic job for an industry position. It created a serious malaise: his department chair (Michael Mitzenmacher) wrote a counterpoint answering the improbable question: “why I’m staying at Harvard?” To my knowledge, it was the first time a departement chair from a prestigious university answered such a question publicly. Michael went even as far as arguing that, yes, indeed, he could get a job elsewhere. These questions are crazy if we consider that for every job advertised at Harvard, there are probably hundreds of highly qualified applicants.

But let me get back at Matt’s reason for leaving a confortable and prestigious job at Harvard:

(…) all of that extra work only takes away from time spent building systems, which is what I really want to be doing. (…) At Google, I have a much more direct route from idea to execution to impact. I can just sit down and write the code and deploy the system, on more machines than I will ever have access to at a university. I personally find this far more satisfying than the elaborate academic process.

In other words, Matt is happier when his work is more immediately useful. But where does the malaise about his decision comes from? After all, he will probably make as much or even more money at Google. Matt is not alone, by the way, Matthew Crawford—a Ph.D. in philosophy—left a high paying job in an American think tank for a job repairing motor bikes. His book Shop Class as Soulcraft tells his story.

I think that Matt’s decision might be hard to understand—at least, his departement chair feels the need to explain it to us—because he is putting into question the very core values of our society. These core values were explored by Veblen in his unconventional book The Theory of the Leisure Class. He argued that we are not driven by utility, but rather by social status. In fact, our society pushes us to seek high prestige jobs, rather than useful and productive jobs. In effect, a job doing research in Computer Science is more prestigious than an industry job building real systems, on the mere account that it is less immediately useful. Here are some other examples:

  • The electrician who comes and wires your house has a less prestigious job than the electrical engineer who manages vague projects within a large organization.
  • The programmer who outputs useful software has a less prestigious job than the software engineer who runs software projects producing software that nobody will ever use.
  • The scientist who tinkers in his laboratory has a less prestigious job than the scientist who spends most of his time applying for research grants.

Note how money is not always immediately relevant. While it is normally the case that manual labor has lower pay, it is almost irrelevant. And indeed, plumbers make more than software developers in some parts of the world (like Montreal)… Even though software jobs are usually considered more desirable.

There are at least three problems with this social-status system:

  • Nature is the best teacher. Working on real problems makes you smart. The German philosopher Heidegger famously made this point with a Hammer. To paraphrase him, it is not by staring at a hammer that we learn about hammers. Similarly, scientists who do nothing but abstract work in the context of funding applications are missing out. The best scientists work in the laboratory, in the field; they tinker.
  • By removing ourselves from the world, we risk becoming alienated. We become strangers to the world around us. Instead, we construct this incoherent virtual reality which has often much to do with soviet-era industrialism. We must constantly remain vague because truth has become subjective. Whereas the hammer hits, whereas the software crashes, whereas the experiment fails… projects are always successfully, marketing is always right and truth is arrived at by consensus. Yet, we know deep down that this virtual reality is unreal and we remain uneasy, trapped between reality and virtuality. The perfect example are the financial markets which are creating abstract products with agreed-upon values. As long as everyone plays along, the system works. Nobody must ever say that the emperor is naked. Everyone must accept the lies. Everything becomes gray.
  • Human beings like to make their own stuff. We value considerably more what we did ourselves. You may be able to buy computers for $200, but nothing will ever replace the computer you made yourself from scratch. It may be more economical to have some Indian programmers build your in-house software, but the satisfaction of building your own software is far more than what you get by merely funding it. Repairing your own house is a lot more satisfying than hiring handymen.

To summarize: trading practical work for high-level positions is prestigious, but it may make you dumber, alienated and unhappy. Back when I was a graduate student, we used to joke about the accident. The accident is what happens to successful professors: they suddenly become uninteresting, pompous, and… frankly… a tad stupid.

Thankfully, there is hope. The current financial crisis, mostly because it couldn’t happen according to most economists, was a waking call. The abstract thinkers may not be so reliable after all! The millions of college graduates who are underemployed in wealthy countries all around the globe have unanswered questions. Weren’t these high-level abstract college degrees supposed to pay for themselves?

How do we fix this broken caste system and bring back a healthier relationship with work? Alas, we cannot all become plumbers and electricians. But it seems to me that more and more people are realizing that the current system, with its neat white collar jobs and rising inequalities, could be improved upon drastically. The Do it yourself (or Do it with others) wave has been a revelation for me. Yes: Chinese factories can build digital thermometers much cheaper than I can. But making your own digital thermometer is far more satisfying. Saving money by abstracting out reality is not a good deal. And of course, building real systems is not the same as finding money for your students to do it for you.

Further reading: Working long hours is stupid and Formal definitions are less useful than you think.

You probably misunderstand XML

When I took my current position, I was invited to teach a course on unstructured data. It is a sensible topic for a course: some say that between 80% to 90% of all enterprise data is unstructured. But I objected to the title for marketing reasons. How many students would take a course on unstructured data? I can hear the students asking “what’s that course about?” Thus, I proposed a better title for the course: information retrieval and filtering. Indeed, everyone wants to filter and retrieve data, right?

Meanwhile, there were already courses on structured data (that is, on databases and information systems). However, there was no course on semi-structured data. So I proposed one. But I couldn’t call it semi-structured data as hardly any student would know what the title meant. Instead, I proposed a course which, roughly translated, is called “Information Management with XML.”

Immediately, I got into trouble: how could I dare omit  SOAP and web services from a course on XML? I was annoyed by these comments. With some sense of irony, I decided to start dumping on my students some SOAP examples so that they could see the “beauty” [I’m being ironic] of using XML for data exchange on the web. So, there I was, trying to teach my students about semi-structured data, and I was asked to tell them about remote procedure calls, an irrelevant topic for my purposes.

Thankfully, it appears that history is on my side. Developers got tired of getting these annoying XML payloads. In time, they  started using JSON, a much more appropriate format for passing small loads of structured data between a server and an ECMAScript client. It uses fewer bytes and, more importantly, JSON is an order of magnitude faster than XML. When you ask on Stack Overflow whether you should be using SOAP you are being told to avoid SOAP at all costs. The developers have spoken. And as a result, the organization behind the SOAP stack decided to close shop.

Where does that leave XML at? Precisely where it started. XML is a great meta-example on how to deal with semi-structured data. And it is just as useful as ever. Want to deal with documents? DocBook and OpenDocument are great formats. Want to add semantic information to web pages? Microformats can do it. You want to exchange complex business data? The Universal Business Language probably does what you need. Some people are having luck with the SVG image format. You want to subscribe to my blog? Grab my atom feed. For these applications, you couldn’t easily replace XML by flat files or JSON. Nor should you try.

Alas, we ended up torturing XML by applying it to ill-suited purposes. We must learn how to select the best format. Does your data look like a table? Can a flat file do the job? Do you need a key-value format like JSON? Or maybe a simple text file? Or is your data more like an XML document? Take a good look at your data before picking a format for it.

Further reading: Indexing XML and Native XML databases: have they taken the world over yet?

Update: I don’t include configuration files in my list of proper XML applications.

How do search engines handle special characters? Should you care?

Matt Cutts is Google’s search engine optimization expert. He runs a great YouTube channel called Google Webmaster Central. He was recently asked how Google handles special characters such as ligatures, soft hyphens, interpuncts and hyphenation points. His answer? He doesn’t know.

Being a scientist, I decided to compare Google to the young upstart (Bing).  Using Google, the result sets from Kurt Goedel differ from those with Kurt Gödel. For example, I could only find the Kurt Goedel article from the uncyclopedia when searching for Kurt Goedel. Similarly, Google fails to realize that cœur and coeur are the same words. However, Bing knows that Goedel and Gödel is the same person. Bing knows that cœur and coeur is the same word.

While the consequences are small, they are nevertheless real:

  • Students may fail to find great references on Kurt Gödel because they search for Kurt Goedel. Indeed, most academic papers seem to prefer Gödel to Goedel.
  • A writer who tries to be typographically correct and writes cÅ“ur may get penalized when people search for coeur.

Score: Bing 1. Google 0.

Source: Will Fitzgerald. Special thanks to Mark Reid, Marek Krajewski, Jeff Erickson and Christer Ericson for the debate on Twitter motivating this blog post.

Who is going to need a database engine in 2020?

Given the Big Data phenomenon, you might think that everyone is becoming a database engineer. Unfortunately, writing a database engine is hard:

  • Concurrency is difficult. Whenever a data structure is modified by different processes or threads, it can end up in an inconsistent state. Database engines cope with concurrency in different ways: e.g., through locking or multiversion concurrency control. While these techniques are well known, few programmers have had a chance to master them.
  • Persistence is also difficult. You must somehow keep the database on a slow disk, while keeping some of the data in RAM. At all times, the content of the disk should be consistent. Moreover, you must avoid data loss as much as possible.

So, developers almost never write their own custom engines. Some might say that it is an improvement over earlier times when developers absolutely had to craft everything by hand, down to the B-trees.  The result was often expensive projects with buggy results.

However, consider that even a bare-metal language like C++ is getting support for  concurrency and threads and esoteric features like regular expressions. Moreover, Oracle working hard at killing the Java Community Process will incite Java developers to migrate to better languages.

Meanwhile, in-memory databases are finally practical and inexpensive. Indeed, whereas a 16 GB in-memory database was insane ten years ago, you can order a desktop with 32 GB of RAM from Apple’s web site right now. Moreover, memory capacity grows exponentially: Apple will sell desktops with 1 TB of RAM in 2020. And researchers predict that  non-volatile Resistive RAM (RRAM) may replace DRAM. Non-volatile internal memory would make persistence much easier.

But why would you ever want to write your own database engine?

  • For speed, some engines force you use nasty things like stored procedures. It is a drastically limited programming model.
  • The mismatch between how the programmer thinks and how the database engine works can lead to massive overhead. As crazy as it sounds, I can see a day when writing your engine will save time. Or, at least, save headaches.
  • Clever programmers can write much faster specialized engines.

Obviously, programmers will need help. They will need great librairies to help with data processing, data compression, and data architecture. Oh! And they will need powerful programming languages.

The future is already here – it’s just not very evenly distributed

It is not 9am yet. Nevertheless, I got a lot done:

  • I attended the thesis proposal of my student Eduardo via Skype. I was literally in my basement with a fresh cup of coffee, attending a presentation hundreds of kilometers away. Beside myself, there were professors from two different cities attending.
  • I watched a recorded concert of the latest Japanese rockstar, Hatsune Miku, on YouTube. Oh! And she is a synthetic character.
  • I read my morning news on my Star Trek-like iPad. I received my news from a hand-picked lot of 50 or so people worldwide. It has been months since I last read a newspaper.

Though I tried to convince my wife that the future has finally arrived, she insists that we always live in the present.

Well. Some people certainly insist on living in the past:

  • Canada’s largest science funding agency (NSERC) sent me a funding proposal to review. They had to send it to me on paper, by mail. The file has dozens of pages. I had to carry it home. I will have to find a way to recycle it after shredding it. What is wrong with giving me access to a PDF or HTML copy of the proposal? Can’t the person in charge of printing the proposal find another occupation?
  • Even though I have not had to look up a research article on physical paper in nearly a decade, a leading journal asked me to include page numbers in all my references. But these page numbers only make sense if you are holding the physical copy of the journal. Why would anyone look up articles from the physical copy when they can find and download research papers in seconds? I’m even told that graduate students in several schools are trained in how to use a (physical) library to look up references. Oh! The pain! Why do librarians insist on teaching obsolete skills? There is no shortage of useful skills a librarian could teach students in 2010. Indeed, I am amazed at how little most graduate students know about tracking research papers online.
  • My employer still sends me, every two weeks, a little enveloppe with a single paper in it which tells me how much money they put into my bank account. Of course, I know exactly what this amount is, as I can see it listed through online banking. So what is the point of this piece of paper? Create employment at a local printing press? Tracking pieces of paper is time consuming. Worse: the information is incomplete, even incomprehensible at times because of the  lack of physical space on the paper. Why can’t Human Resources setup a web site where I could get all the information I need to understand all the mysterious deductions appearing on my fake pay check? Why must they insist on printing useless pieces of paper? Why can’t they provide useful and modern tools instead?

Source: The title of my post is a quote from William Gibson.

Can you trust fixed-bit computer arithmetic?

Suppose that you have 10 pictures, and all lined up, they take 100 pixels. Is it safe to say that each picture has a width of x pixels if 10 x = 100?

We all know that a x = b has a unique solution x as long as a is non-zero. If you work with integers, then you can say that there is at most one solution.

Unfortunately, computer arithmetic is uglier. Over 32-bit integers, the equation a x= 0 has 231 solutions if a = 231. Indeed, as long as x is even, 231 x = 0.

But notice how I had to choose a large value of a to make computer arithmetic look bad. What if a is small? Certainly, if a = 1, then there is exactly one solution. If a = 2, then there are at most two solutions: 2 x can take any odd even and the most significant bit of x is discarded.

We can generalize this result a bit: if a = 2L, then there are at most 2L solutions x.

But we can generalize it even further! We have that a x = b has at most 2L integer solutions x if a is an L-bit non-zero integer. That is, as long as a is small, then a x = b has few solutions. The proof is technical, but not difficult:

Proof (Sketch). Let a is an L-bit non-zero integer and suppose we work with K-bit arithmetic. We consider the equation a x div 2L = b modulo 2K – L. Let j be the first non-zero bit of a. E.g., if a = 7 then j = 1. Because ai = 0 for i < j, we have that the value of a x 2L is independent of the last j -1 bits of x. Moreover, we have that the most significant bit of x which matters (xK-j+1) only matters for the last bit of a x. We can solve for this last effectual bit xK-j+1 in (ax)K = bK-L as a function of bK-L, a and the lesser bits of x. We can continue solving for the lesser bits of x by consider a x = b minus the last bit. (In effect, we have reduced the problem in an integer ring with K bits to a similar problem in integer ring with K-1 bits.) After K-L steps, the process terminates (K=L) leaving L bits in x that can be set freely generating 2L different solutions. QED

Getting back to our orignal problem: Because the integer 10 can be written using 4 bits (0b1010), the equation 10 x = 100 has at most 24 = 16 solutions. So if x was just picked at random, chances are good that the assertion would fail: for 32-bit integers, the probability of a false positive is at most 15/232 which is much, much less than 1%.

Further reading: M. Dietzfelbinger, Universal hashing and k-wise independent random variables via integer arithmetic without primes, in: STACS’96, 1996.

Write a Twitter application in 5 minutes

I spend much time alone, writing and thinking. Twitter helps me stay connected. I love the platform.

On Friday, I wanted to find the intersection between the users followed by any two individuals. Indeed, suppose that you like both Joe and Jill, and they have similar interests. Maybe whoever they both read is also interesting? I could not find a tool to do it, so I built it with python-twitter.

Anybody with a working knowledge of Python can do it in less than 5 minutes. I used only twenty lines of code (in total!!!). The code proved immediately useful.

If you do not know Python or Ruby. Learn one or the other. Tonight. It is powerful stuff.

Update: A few days after I wrote this post, Twitter made my script obsolete by changing their authentification requirements.

Working long hours is stupid

We do too much. We carry too many projects. This overproduction creates problems which we try to fix by working even more.

We value most what we create (see Made by hand and The upside of irrationality). To be happy, you want to focus on making interesting stuff. This takes time and dedication. Yet as Graham’s essay The top idea in your mind stresses, we often fall into the trap of thinking mostly about money and personal disputes. These thoughts pull us away from our interests and prevent us from doing great work. As an example, I hear that Tiger Woods isn’t playing great golf. I bet he is either stuck into money problems or personal disputes, or both.

It is hard to be overworked by writing a book, by writing research articles or by playing golf. People are overworked dealing with email, context switching, money, and touchy relationships. This abundance of work makes people sad and boring. And this type of work tends to reproduce. The more you have, the more you will have.

Unemployment and pollution are visible results of our overproduction. Yet there are many more negative side effects. In academia, we train more and more Ph.D.s every year. Yet we have had too many Ph.D.s in the job market since the seventies. We write more and more research papers every year, and spend more and more time applying for research grants… but professors spend less and less time on curiosity-driven research.

It is cool to produce great work, but it is not cool to work 60 hours a week unless it is out of passion. And nobody is passionate about grant applications, marking papers or handling difficult people. Moreover, working long hours does not scale: you can’t increase your output continuously.

Our productivity will keep improving. I can write software faster and better than ever. I can research prior work with ease. I can ask fancy mathematical questions on the Web and get answers in minutes. Instead of investing back this productivity into more silly work, we need to get smarter:

  • Focus on the essential: programming great software, writing a fun book, a set of inspiring lecture notes or an insightful article.
  • Automate, reduce or delegate. Reduce is best: doing fewer things is cool!
  • A focus on money or on personal disputes makes you stupid. Yet that’s where success often takes you. Watch out!
  • Airplanes pollute. Travel takes you away from your family. Cars pollute and make you fat. Do you need all that junk?

Is multiplication slower than addition?

Earlier, I asked whether integer addition was faster than bitwise exclusive or. My tests showed no difference, and nobody contradicted me.

However, everyone knows that multiplication is slower than addition? Right? In cryptography, there are many papers on how to trade multiplications for additions, to speed up software.

So? Can you predict which piece of code runs faster?

scalar product (N multiplications):

for(int k =0; k < N ; ++k)
answer += vector1[k] * vector2[k];

scalar product two-by-two (N multiplications):
for(int k =0; k < N ; k+=2)
answer += vector1[k] * vector2[k]
+vector1[k+1] * vector2[k+1];

non-standard scalar product (N/2 multiplications):
for(int k =0; k < N ; k+=2)
answer += ( vector1[k] + vector2[k] )
* ( vector1[k+1] + vector2[k+1] );

just additions (no multiplication):
for(int k =0; k < N ; ++k)
answer += vector1[k] + vector2[k];

Answer: Merely reducing the number of multiplications has no benefit, in these tests. Hence, simple computational cost models (such as counting the number of multiplications) may not hold on modern superscalar processors.

My results using GNU GCC 4.2.1 on both a desktop and a laptop:

algorithm Intel Core i7 Intel Core 2 Duo
scalar product 0.30 0.39
scalar product (2×2) 0.25 0.39
fewer multiplications 0.25 0.39
just additions 0.16 0.23

Times are in seconds. The source code is available without pointer arithmetics. The same test with pointer arithmetics gives faster results, but the same conclusion. I tried a similar experiment in Java. It confirms my result.

Code: Source code posted on my blog is available from a github repository.