NoSQL or NoJoin?

Several major players built alternatives to conventional database systems: Google created BigTable, Amazon built Dynamo and Facebook initiated Cassandra. There are many other comparable open source initiatives such as CouchDB and MongoDB. These systems are part of a trend called NoSQL because it is not centered around the SQL language. While there has always been non SQL-based database systems, the rising popularity of these alternatives in industry is drawing attention.

In The “NoSQL” Discussion has Nothing to Do With SQL, Stonebraker opposes the NoSQL trend in those terms:

(…) blinding performance depends on removing overhead. Such overhead has nothing to do with SQL, but instead revolves around traditional implementations of ACID transactions, multi-threading, and disk management.

In effect, Stonebraker says that all of the benefits of the NoSQL systems have nothing to do with ditching the SQL language. Of course, because the current breed of SQL is Turing complete, it is difficult to argue against SQL at the formal level. In theory, all Turing complete languages are interchangeable. You can do everything (bad and good) in SQL.

However, in practice, SQL is based on joins and related low-level issues like foreign keys. SQL entices people to normalize their data. Normalization fragments databases into smaller tables which is great for data integrity and beneficial for some transactional systems. However, joins are expensive. Moreover, joins require strong consistency and fixed schemas.

In turn, avoiding join operations makes it possible to maintain flexible or informal schemas, and to scale horizontally. Thus, the NoSQL solutions should really be called NoJoin because they are mostly defined by avoidance of the join operation.

How do we compute joins? There are two main techniques :

  • When dealing with large tables, you may prefer the sort merge algorithm. Because it requires sorting tables, it runs in O(n log n). (If your tables are already sorted in the correct order, sort merge is automatically the best choice.)
  • For in-memory tables, hash joins are preferable because they run in linear time O(n). However, the characteristics of modern hardware are increasing detrimental to the hash join alternative (see C. Kim, et al. Sort vs. Hash revisited. 2009).

(It is also possible to use bitmap indexes to precompute joins.) In any case, short of precomputing the joins, joining large tables is expensive and requires source tables to be consistent.

Conclusion: SQL is a fine language, but it has some biases that may trap developers. What works well in a business transaction system, may fail you in other instances.

Published by

Daniel Lemire

A computer science professor at the University of Quebec (TELUQ).

15 thoughts on “NoSQL or NoJoin?”

  1. Dan, there are more formal arguments for and against MySQL than just its Turing completeness. Come on!

    BTW, your spam protection is very elitist. But that’s what you want, right?

  2. Thanks for the overview on NoSQL. I’m not convinced that imposing a strict structure on data or even how we access it is suitable for all circumstances, even though it seems that as an industry and a research field, we seem to have convinced ourselves that it is.

    @Olivier: I don’t find the spam protection elitist at all. If Daniel wanted elitists, he’d ask questions only researchers could answer.

  3. Yes, some people believed, e.g., that you can use SQL DB to handle a search engine efficiently. All search engines fall short in terms of performance.

  4. This is one thing I keep failing to understand. Every database-powered application I have ever written, SQL or not, a join has been necessary at some point. If it’s a no join database, then I have to write the join in my application, by hand.

    I would assume that the writers of such databases as PostgreSQL and MySQL are much better than I am. Their join should definitely be faster and better than my join. Therefore, what’s the benefit of using a nojoin database if I have an application that requires joins? And if every application requires joins, what’s the point of ever using a nojoin database?

    Now for the hardest question. What if the join I wrote in my application using a nojoin database is somehow faster than the ones in MySQL and Postgres? In that case, why do we even need join databases at all? And why don’t the nojoin databases add a join command?

  5. If it’s a no join database, then I have to write the join in my application.

    No, it is not necessarily true. If you don’t normalize, you may not need a join. Second, a join can be precomputed. Third, a relational DB is a compromise (OLTP vs retrieval) solution. If you don’t need all fancy transaction staff you can do much better.

  6. Nice article. On this topic, I have a question which I have been trying to figure out for some time now. Whenever someone looks into NoSQL as a means to avoid costly join operations, there is usually a suggestion of using column-oriented databases (like HBase or Cassandra) to fix that issue. I’ve been thinking for quite some time on how column-oriented should be able to fix expensive join operations and I haven’t figured it out. Anyone care to enlighten me?

  7. Hm is that it? I know Cassandra also allows you to have more than one value on the same field (sorry if the terminology is not correct), which allows you to avoid many-to-many relationships therefore avoiding join operations.

  8. @André Silva
    Yes, it does not require normalization, which, can also be seen as a precomputed join.

    In addition, I believe that some of the column-oriented DBs work with compacted semi-static data. At least, BigTable does.

    I don’t remember how they handle transactions, I just remember that their data is mostly static. Therefore, it can be compressed and retrieved efficiently.

  9. Interesting article; I found this point a bit puzzling though:

    “SQL is based on joins and related low-level issues like foreign keys”.

    I think when it comes to models and languages for discussing data, many would agree that the relational model (and SQL, as an approximation of it) is *higher-level* than object-based database models. The model is purer and works at a higher level of abstraction, not being concerned with lower-level optimisations for particular physical storage layouts and data access paths, but instead modelling the data declaratively in a very pure flexible normalised form without redundancies.

    Also: SQL is declarative language – be careful to distinguish ‘join’ as a logical expression in a declarative language, and the implementation techniques used by some DMBSes to compute these joins. DBMSes are by no means compelled to use any one technique to compute a query. Sometimes they need to be given hints. But it’s worth separating the logical model from the physical storage model and any physical storage optimisations, at least in your mind when reasoning about these concepts.

  10. NoSQL systems could do joins (in principle) but they can’t garantuee its consistency.

    So “they” say that if you want to do a join you have to do it in the application. Doing a join in the application is always slower because it means that you have to call the db twice instead of once. There is also a greater change of inconsistency because stuff can change between the first call and the second call to the db.

    It is all about who the problem owner is when it comes to inconsistency? The database or the application?

    Cassandra (R + W > N) only offers consistency when you retrieve one “record”. Oracle offers consistency when you retrieve one, two or a million (aggregation) records.

    Cassandra and MongoDB “records” can contain a lot of information so there is less need to join.

    Read here: http://www.dbms2.com/2010/05/01/ryw-read-your-writes-consistency/

  11. Joins are necessary and v e r y f a s t on SSD drives (speaking about “the characteristics of modern hardware are increasing detrimental to the hash join alternative”).
    Sure it also helps if the implementation is a cache oblivious algorithm.

Leave a Reply

Your email address will not be published.

To create code blocks or other preformatted text, indent by four spaces:

    This will be displayed in a monospaced font. The first four 
    spaces will be stripped off, but all other whitespace
    will be preserved.
    
    Markdown is turned off in code blocks:
     [This is not a link](http://example.com)

To create not a block, but an inline code span, use backticks:

Here is some inline `code`.

For more help see http://daringfireball.net/projects/markdown/syntax

You may subscribe to this blog by email.