Several major players built alternatives to conventional database systems: Google created BigTable, Amazon built Dynamo and Facebook initiated Cassandra. There are many other comparable open source initiatives such as CouchDB and MongoDB. These systems are part of a trend called NoSQL because it is not centered around the SQL language. While there has always been non SQL-based database systems, the rising popularity of these alternatives in industry is drawing attention.

In The “NoSQL” Discussion has Nothing to Do With SQL, Stonebraker opposes the NoSQL trend in those terms:

(…) blinding performance depends on removing overhead. Such overhead has nothing to do with SQL, but instead revolves around traditional implementations of ACID transactions, multi-threading, and disk management.

In effect, Stonebraker says that all of the benefits of the NoSQL systems have nothing to do with ditching the SQL language. Of course, because the current breed of SQL is Turing complete, it is difficult to argue against SQL at the formal level. In theory, all Turing complete languages are interchangeable. You can do everything (bad and good) in SQL.

However, in practice, SQL is based on joins and related low-level issues like foreign keys. SQL entices people to normalize their data. Normalization fragments databases into smaller tables which is great for data integrity and beneficial for some transactional systems. However, joins are expensive. Moreover, joins require strong consistency and fixed schemas.

In turn, avoiding join operations makes it possible to maintain flexible or informal schemas, and to scale horizontally. Thus, the NoSQL solutions should really be called NoJoin because they are mostly defined by avoidance of the join operation.

How do we compute joins? There are two main techniques :

  • When dealing with large tables, you may prefer the sort merge algorithm. Because it requires sorting tables, it runs in O(n log n). (If your tables are already sorted in the correct order, sort merge is automatically the best choice.)
  • For in-memory tables, hash joins are preferable because they run in linear time O(n). However, the characteristics of modern hardware are increasing detrimental to the hash join alternative (see C. Kim, et al. Sort vs. Hash revisited. 2009).

(It is also possible to use bitmap indexes to precompute joins.) In any case, short of precomputing the joins, joining large tables is expensive and requires source tables to be consistent.

Conclusion: SQL is a fine language, but it has some biases that may trap developers. What works well in a business transaction system, may fail you in other instances.

I often come across the following type of arguments in research papers:

  • You could save 3 bits of storage for every value in your database. Surely that’s irrelevant. Nobody cares about saving 3 bits!
  • You can sort arrays in 10 ms. Surely, that cannot be improved upon? You are already down to 10 ms and nobody cares about such small delays.

I hope you can see what is wrong with these statements?

I call it the fallacy of absolute numbers: you express a measure or a gain in absolute value, and then conclude to optimality or near optimality because the number appears small (or large).

Remember: Saving 3 bits of storage out of 6 bits is a 2:1 compression ratio. Sorting in 5 ms instead of 10 ms doubles the speed.

Disclaimer: I am sure that someone else has documented this fallacy, but I could not find any reference to it.

It is often important to index XPath queries. Not only is XPath useful on its own, but it is also the basis for the FLWOR expressions in XQuery.

A typical XPath expression will select only a small fraction of any XML document (such as the value of a particular attribute). Thus, a sensible strategy is to represent the XML documents as tables. There are several possible maps from XML documents to tables. One of the most common is ORDPATH.

In the ORDPATH model, the root node receives the identifier 1, the first node contained in the root node receives the identifier 1.1, the second one receives the identifier 1.2, and so on. Given the ORDPATH identifiers, we can easily determine whether two nodes are neighbors, or whether they have a child-parent relationship.

As an example, here’s an XML document and its (simplified) ORDPATH representation:


<liste temps="janvier" >
<bateau />
<bateau >
<canard />
</bateau>
</liste>

ORDPATH name type value
1 liste element -
1.1 temps attribute janvier
1.2 bateau element -
1.3 bateau element -
1.3.1 canard element -

Given a table, we can easily index it using standard indexes such as B trees or hash tables. For example, if we index the value column, we can quickly process the XPath expression @temps=”janvier”.

Effectively, we can map XPath and XQuery queries into SQL. This leaves relatively little room for XML-specific indexes. I am certain that XML database designers have even smarter strategies, but do they work significantly better?

Reference: P. O’Neil, et al.. ORDPATHs: insert-friendly XML node labels. 2004.

Further reading: Native XML databases: have they taken the world over yet?

A common advice given out to young researchers is to find a niche. (See Michael’s Branding Your Research). That is certainly good advice. Instead of being another young researcher, you can be the new guy working on topic X. But it always seems to happen no matter what: most Ph.D. thesis address a narrow topic. I believe that the real advice people would like to give is: find yourself a nice topic, and make sure this topic becomes fashionable. Of course, this implies that you can somehow predict the future, or have a thesis supervisor with enough clout that he can either initiate new trends, or have inside knowledge regarding the upcoming trends.

A more interesting question is what you should do with the rest of your career, assuming you landed a research job, somehow. Should you find yourself one or two niche topics and stay there for the rest of your life? That is a common strategy. You save precious time: instead of having to skim 100 research articles a year, you may get by with 20 or 30 research articles, or even less. Moreover, because you are the leading authority on one or two topics, you can never be caught unaware. You never have to worry about finding new topics: you just keep on iteratively improving whatever you are doing right now. With some luck, you can reuse your funding proposals year after year. Finally, you can quickly get to know everyone that matters regarding these narrow topics. And that is a perfectly good strategy.

The problems begin when we associate the lack of a steady trajectory with failure. Encouraging static research topics leads to conservatism. Meanwhile, some of the most innovative researchers have cultivated varied interests. Von Neumann was a set theorist, but he wrote 20 papers in Physics, and even in Mathematics, he covered a wide range of topics (set theory, logic, topological groups, measure theory, ergodic theory, operator theory, and continuous geometry). Would we have been better off had von Neumann remained a pure set theorist?

And I tend to have more trust in researchers who have their eggs in different baskets. They can afford to be a bit more critical.

Warning: I am not urging Ph.D. students to change topic repeatedly while writing up their thesis. Finish whatever you start. And be aware that approaching a new research topic can be costly.

Technological progress tends to increase the available information. Thus, our capacity to manage this information becomes overloaded (hence the term information overload). As Clay Shirky explained: it is not so much an information overload, as a filter failure. The abundance of information is never a problem. The real problem is the lack of efficient strategies to index, summarize, filter, cross-reference and archive information.

But information overload is nothing new. In Reading Strategies for Coping With Information Overload ca. 1550-1700, Blair surveys the techniques our ancestors invented to cope with the abundance of books :

  • the alphabetical index;
  • the reference book,
  • copy and paste (with actual scissors) to save time in note-taking.

What I find fascinating is the historical perspective: while still useful, the alphabetical index is hardly exciting anymore. It has been supplanted by full text search (in e-books). There are still reference books (such as dictionaries), but they are being replaced with online tools. Information overload continues to generate many inventions: the search engine (such as Google), the recommender system (as on Amazon.com), and the social networks (such as Twitter). Literally, these tools expand our minds. We become smarter.

Yet, every time I finish writing a research article, I am amazed at how old fashioned the format is.

  • Research journals still ask for silly metadata such as keywords, even though most researchers rely on full text search.
  • The format is clearly meant for paper, even though most of my collaborators browse research articles on their computers.
  • We have silly things like page limitations.
  • It is excessively difficult to correct or improve a “published” article.

There is hope. The PLoS One journal presents research articles in an innovative format. The article is interactive: anyone can rate and comment it. Many journals allow the authors to upload supplementary material. Yet, I predict that in 20 years, we will look back and think that academic publishing in 2010 was archaic. (I admit that it is not a daring prediction.) There is much room for innovation.

Source: Erik Duval.

Next Page »

Powered by WordPress