Over-normalization is bad for you

I took a real beating with my previous post where I argued against excessive normalization on the grounds that it increases complexity and inflexibility, and thus makes the application design more difficult. Whenever people get angry enough to post comments on a post of mine, I conclude that I am onto something. So, let’s go at it again.

On the physical side, developers use normalization to avoid storing redundant data. While this might be adequate with modern data database systems, I do not think this is well founded, in principle. Consider this Java code:

String x1="John Smith";

String x2="Lucy Smith";

String x3="John Smith";
Is this code inefficient? Won’t the Java compiler create 3 strings whereas only 2 are required? Not at all. Java is smart enough to recognize that it needs to store only 2 strings. Thus, there is no reason for this non-normalized table to be inefficient storage-wise even though Jane Wright appears twice:

Customer ID First Name Surname Telephone Number
123 Robert Ingram 555-861-2025
456 Jane Wright 555-403-1659
456 Jane Wright 555-776-4100
789 Maria Fernandez 555-808-9633

Nevertheless, the First normal form article on wikipedia suggests normalizing the data into two tables:

Customer ID Telephone Number
123 555-861-2025
456 555-403-1659
456 555-776-4100
789 555-808-9633
Customer ID First Name Surname
123 Robert Ingram
456 Jane Wright
789 Maria Fernandez

Pros of the normalized version:

  • It does look like the normalized version uses less storage. However, the database engine could compress the non-normalized version so that both use the same space (in theory).
  • We can enforce the constraint that a customer can have a single name by requiring that the Customer ID is a unique key (in the second table). The same constraint can be enforced on the non-normalized table, but less elegantly.

Pros of the non-normalized version:

  • A customer could have different names depending on the phone number. For example, Edward Buttress could use his real name for his work phone number, but he could report the name Ed Butt for his home phone number. The power to achieve this is entirely in the hands of the software developer: there is no need to change the schema to add such a feature.
  • If you start with existing data or try to merge user accounts, the non-normalized version might be the only sane possibility.
  • The database schema is simpler. We have a single table! You can understand the data just by reading it. Most of your queries will be shorter and more readable (no join!).

To a large extent, it seems to me that the question of whether to normalize or not is similar to the debate of static versus dynamic typing. It is also related to the debate between the proponents of the waterfall method versus the agile crowd. Some might say that working with a non-normalized table is cowboy coding. In any case, there is a trade-off. Flexibility versus safety. But to my knowledge, the trade-off is largely undocumented. What are the opportunity costs of complex database schemas? How many databases do get unusable by lack of normalization?

I think that part of the NoSQL appeal for many developers is strong data independence. Having to redesign your schema to add a feature to your application is painful. It may even kill the feature in question because the cost of trying it out is too high. Normalization makes constraints easier, but it also reduces flexibility. We should, at least, be aware of this trade-off.

Note: Yes, my example goes against the current practice and what is taught in all textbooks. But that is precisely my intent.

Update: Database views can achieve a related level of data independence.

Who is going to need a database engine in 2020?

Given the Big Data phenomenon, you might think that everyone is becoming a database engineer. Unfortunately, writing a database engine is hard:

  • Concurrency is difficult. Whenever a data structure is modified by different processes or threads, it can end up in an inconsistent state. Database engines cope with concurrency in different ways: e.g., through locking or multiversion concurrency control. While these techniques are well known, few programmers have had a chance to master them.
  • Persistence is also difficult. You must somehow keep the database on a slow disk, while keeping some of the data in RAM. At all times, the content of the disk should be consistent. Moreover, you must avoid data loss as much as possible.

So, developers almost never write their own custom engines. Some might say that it is an improvement over earlier times when developers absolutely had to craft everything by hand, down to the B-trees.  The result was often expensive projects with buggy results.

However, consider that even a bare-metal language like C++ is getting support for  concurrency and threads and esoteric features like regular expressions. Moreover, Oracle working hard at killing the Java Community Process will incite Java developers to migrate to better languages.

Meanwhile, in-memory databases are finally practical and inexpensive. Indeed, whereas a 16 GB in-memory database was insane ten years ago, you can order a desktop with 32 GB of RAM from Apple’s web site right now. Moreover, memory capacity grows exponentially: Apple will sell desktops with 1 TB of RAM in 2020. And researchers predict that  non-volatile Resistive RAM (RRAM) may replace DRAM. Non-volatile internal memory would make persistence much easier.

But why would you ever want to write your own database engine?

  • For speed, some engines force you use nasty things like stored procedures. It is a drastically limited programming model.
  • The mismatch between how the programmer thinks and how the database engine works can lead to massive overhead. As crazy as it sounds, I can see a day when writing your engine will save time. Or, at least, save headaches.
  • Clever programmers can write much faster specialized engines.

Obviously, programmers will need help. They will need great librairies to help with data processing, data compression, and data architecture. Oh! And they will need powerful programming languages.

NoSQL or NoJoin?

Several major players built alternatives to conventional database systems: Google created BigTable, Amazon built Dynamo and Facebook initiated Cassandra. There are many other comparable open source initiatives such as CouchDB and MongoDB. These systems are part of a trend called NoSQL because it is not centered around the SQL language. While there has always been non SQL-based database systems, the rising popularity of these alternatives in industry is drawing attention.

In The “NoSQL” Discussion has Nothing to Do With SQL, Stonebraker opposes the NoSQL trend in those terms:

(…) blinding performance depends on removing overhead. Such overhead has nothing to do with SQL, but instead revolves around traditional implementations of ACID transactions, multi-threading, and disk management.

In effect, Stonebraker says that all of the benefits of the NoSQL systems have nothing to do with ditching the SQL language. Of course, because the current breed of SQL is Turing complete, it is difficult to argue against SQL at the formal level. In theory, all Turing complete languages are interchangeable. You can do everything (bad and good) in SQL.

However, in practice, SQL is based on joins and related low-level issues like foreign keys. SQL entices people to normalize their data. Normalization fragments databases into smaller tables which is great for data integrity and beneficial for some transactional systems. However, joins are expensive. Moreover, joins require strong consistency and fixed schemas.

In turn, avoiding join operations makes it possible to maintain flexible or informal schemas, and to scale horizontally. Thus, the NoSQL solutions should really be called NoJoin because they are mostly defined by avoidance of the join operation.

How do we compute joins? There are two main techniques :

  • When dealing with large tables, you may prefer the sort merge algorithm. Because it requires sorting tables, it runs in O(n log n). (If your tables are already sorted in the correct order, sort merge is automatically the best choice.)
  • For in-memory tables, hash joins are preferable because they run in linear time O(n). However, the characteristics of modern hardware are increasing detrimental to the hash join alternative (see C. Kim, et al. Sort vs. Hash revisited. 2009).

(It is also possible to use bitmap indexes to precompute joins.) In any case, short of precomputing the joins, joining large tables is expensive and requires source tables to be consistent.

Conclusion: SQL is a fine language, but it has some biases that may trap developers. What works well in a business transaction system, may fail you in other instances.

Indexing XML

It is often important to index XPath queries. Not only is XPath useful on its own, but it is also the basis for the FLWOR expressions in XQuery.

A typical XPath expression will select only a small fraction of any XML document (such as the value of a particular attribute). Thus, a sensible strategy is to represent the XML documents as tables. There are several possible maps from XML documents to tables. One of the most common is ORDPATH.

In the ORDPATH model, the root node receives the identifier 1, the first node contained in the root node receives the identifier 1.1, the second one receives the identifier 1.2, and so on. Given the ORDPATH identifiers, we can easily determine whether two nodes are neighbors, or whether they have a child-parent relationship.

As an example, here’s an XML document and its (simplified) ORDPATH representation:


<liste temps="janvier" >
<bateau />
<bateau >
<canard />
</bateau>
</liste>

ORDPATH name type value
1 liste element
1.1 temps attribute janvier
1.2 bateau element
1.3 bateau element
1.3.1 canard element

Given a table, we can easily index it using standard indexes such as B trees or hash tables. For example, if we index the value column, we can quickly process the XPath expression @temps=”janvier”.

Effectively, we can map XPath and XQuery queries into SQL. This leaves relatively little room for XML-specific indexes. I am certain that XML database designers have even smarter strategies, but do they work significantly better?

Reference: P. O’Neil, et al.. ORDPATHs: insert-friendly XML node labels. 2004.

Further reading: Native XML databases: have they taken the world over yet?

Sorting is fast and useful

I like to sort things. If you should learn one thing about Computer Science is that sorting is fast and useful.

Here’s a little example. You want to check quickly whether an integer belongs to a set. Maybe you want to determine whether a userID is valid. The solutions:

I wrote a Java benchmark to compare the three solutions:

Binary search over a sorted array is a only 10% slower than the HashSet. Yet, the sorted array uses half the memory. Hence, using a sorted array is the clear winner for this problem.

If you think that’s a little bit silly, consider that column-oriented DBMSes like Vertica use binary search over sorted columns as an indexing technique.

Code: Source code posted on my blog is available from a github repository.

Run-length encoding (part 2)

(This is a follow-up to my previous blog post, there is also a follow-up: part 3.)

Any run-length encoding requires you to store the number of repetitions. In my example, AAABBBBBZWWK becomes 3A-5B-1Z-2W-1K, we must store 5 counters (3,5,1,2,1) and 5 characters.

Storing counters using a fixed number of bits

Most programming languages represent integers using a fixed number of bits in binary format. For example, Java represents integers (int) using 32 bits. However, in my example (3A-5B-1Z-2W-1K) storing the counters using 32 bits and the characters using 8 bits means that I use 25 bytes which is more than twice the length of the original string (AAABBBBBZWWK).

Thus, we have a simple optimization problem: determine the best number of bits.  In practice, it might be better to store the data in a byte-aligned way. That is, you should be using 8, 16, 32 or 64 bits. Indeed, reading numbers represented using an arbitrary number of bits may involve a CPU processing overhead.

If you use too few bits, some long runs will have to count as several small runs. If you use too many bits, you are wasting storage. Unfortunately, determining on a case-by-case basis the best number of bits requires multiple scans of the data. It also implies added software complexity.

But you don’t have to use the binary format!

You can still use  a fixed number of bits for your counters, but with quantized codes instead of the binary format. For example, using 3 bits, you could only allow the counter values 1,2,16, 24, 32,128,256,1024. In this example, the sequence of bits 000 is interpreted as the value 1, the sequence of bits 001 as the value 2, the sequence 010 as 16, and so on. Determining the best codes implies that you must scan the data, compute the histogram of the counters, and then apply some optimization algorithm (such as dynamic programming). The decoding speed might be slight slower as you need to look-up the codes from a table.

Using variable-length counters for optimal compression

If you are willing to sacrifice coding and decoding speed, then you can represent the counters using universal codes. Thus, instead of using a fixed number of bits and optimizing the representation (as in the quantized coding idea), you seek an optimal variable-length representation of the counters. With this added freedom, you can be much more efficient.

The illustrate the idea behind variable-length codes, we consider Gamma codes: the code 1 is 1, the code 01 is 2, the code 001 is 3, the code 0001 is 4, and so on. Thus, we use x bits to represent the number x.

Fortunately, we can do much better than Gamma codes and represent the number x using roughly 2 log x bits with delta codes. Firstly, write x as x=2N +(x modulo 2N) where N is the logarithm. Then we can represent the number N-1 as a Gamma code using N-1 bits, and then store (x modulo 2N) in binary format (using N-1 bits). Thus, we can represent all numbers up to 2N-1 using 2N-2 bits.

Unfortunately, the current breed of microprocessors are not kind to variable-length representations so the added compression is at the expense decoding speed.

Continue with part 3.

References and further reading: Holloway et al., How to Barter Bits for Chronons, 2007. See also the slides of my recent talk Compressing column-oriented indexes.

Run-length encoding (part I)

(This is part 1, there is also a part 2 and a part 3.)

Run-length encoding (RLE) is probably the most important and fundamental string compression technique. Countless multimedia formats and protocols use one form or RLE compression or another.

RLE is also deceptively simple. It represents repeated values as a counter and a character. Thus, the string AAABBBBBZWWK becomes 3A-5B-1Z-2W-1K.

If that is all there was to RLE, then the wikipedia page on run-length encoding would be just fine. Yet, I think it needs help.

Why do we use RLE?

  • You can read and write RLE data in one pass, using almost no memory.
  • Given a vector V compressed with RLE, you can apply any scalar function f to its component in time O(|V |) where |V | is the compressed size of the vector.
  • Given two vectors V and V‘ compressed with RLE, you can do arithmetic (e.g. V+V‘) in time O(|V |+|V’|).

(Some RLE formats have worse complexity bounds.)

Any downsides to RLE?

  • Random access is slower. Sometimes, only sequential read (from the beginning) is possible. Updating an RLE-compressed array can be difficult.
  • You need long runs of identical values.
  • Some RLE formats negatively affect CPU vectorization. Thus, if the compression rates are modest, it could actually take longer to process an RLE-compressed array.

What is the RLE format?

There is no unique RLE format. How you use the RLE idea depends on your goals such as (1) maximize the compression rate (2) maximize the processing speed.

Here are some common variations:

  • Instead of using a counter for each run of characters, you only add a counter after a value has been repeated twice. For example, the string AAABBBBBZWWK becomes AA1-BB3-Z-WW-K. Thus, if many characters are not repeated, you will rarely use an unnecessary counter.
  • You can use a single bit to decide whether a counter is used. For example, the string AAABBBBZWWK becomes A-True-3, B-True-5, Z-False, W-True-2, K-False. Again, this may avoid many unnecessary counters if values are often not repeated.
  • Instead of a counter, you may store the location of the run in the array. For example, the string AAABBBBBZWWK becomes 1A-4B-9Z-10W-11K. The benefit of this approach is to allow random access in logarithmic time using binary search. However, it is also incompatible with some techniques to avoid unnecessary counters. So, it is a compression-speed trade-off. For even more speed, you can store both the location of the run and its length (thus avoiding a subtraction).
  • To help vectorization, you can group the characters into blocks of k characters. For example, using blocks of two characters, the string AAABBBBBZWWK becomes 1AA-1AB-2BB-1ZW-1WK. Again, this is a compression-speed trade-off because there will be fewer runs to compress after grouping the characters into long blocks.

Continue reading to part 2.

Some References (to my own work):

More database compression means more speed? Right?

Current practical database compression techniques stress speed over compression:

  • Vectorwise is using Super-scalar RAM-CPU cache compression which includes a carefully implemented dictionary coder.
  • C-store—and presumably Vertica—is using similar compression techniques as well as simple run-length encoding of projection indexes.
  • Oracle is compressing its bitmaps using Byte-aligned bitmap compression—a variation on run-length encoding.
  • Wu et al.’s Fastbit as well as my very own Lemur Bitmap Index C++ Library use less aggressive compression techniques, for faster results. In fact, one my recent empirical results is that on a two-CPU dualcore machine, using 64-bit words instead of 32-bit words in word-aligned compression—which nearly halves the compression—can make the processing faster.
  • LucidDB similarly compresses its bitmap indexes with a simple variation on run-length encoding.

In a comment to my previous blog post, Rasmus Pagh asks more or less this question:

Given that we have more and more CPU cores—and generally more powerful CPUs—shouldn’t we compress the data more aggressively?

As the CPUs get more powerful, shouldn’t all database become I/O bound? That is, the main difficulty is to bring enough data into the CPUs?

Apparently not.

  • As we have more CPU cores, we also have more bandwidth to bring data to the the cores. Otherwise, CPU cores would be constantly data-starved in most multimedia and business applications.
  • Multicore CPUs are not the only game in town: RAM and storage have also been revolutionized. In 2009, it is not unpractical to run entirely database applications from RAM. After all, many business databases fit in 16 GB for RAM. And even when we do not have enough RAM, we have SSDs.
  • Lightweight compression techniques often favor vectorization which is also getting more and more important and powerful.

Hence, for most database applications fast decompression remains preferable to aggressive compression.

Which should you pick: a bitmap index or a B-tree?

Morteza Zaker sent me pointer to their work comparing bitmap indexes and B-trees in the Oracle database. They examine the folklore surrounding bitmap indexes—which are often thought to be mostly useful over low cardinality columns (columns having few distinct values, such as gender). Their conclusion is that the Bitmap index is the conclusive choice for data warehouse design for columns with high or low cardinality. Their claim is backed by experiments using columns with millions of distinct values. This confirms the observation made in my earlier post: The mythical bitmap index.

They make an interesting distinction between full cardinality columns, columns where each value appears only once, and high cardinality columns where at least a few values are often repeated (think about storing last names). A bitmap index is inadequate for a full cardinality column because there is no compression possible, and this is probably how the folklore around bitmap indexes came about. Yet, unlike transaction processing, data warehousing is usually not concerned with full cardinality columns.

Reference: Zaker et al., An Adequate Design for Large Data Warehouse Systems: Bitmap Index Versus B-Tree Index, IJCC 2 (2), 2008.

Further reading: Trading compression for speed with vectorization, To improve your indexes: sort your tables! and my research paper Sorting improves word-aligned bitmap indexes.

Trading compression for speed with vectorization

Bitmap indexes are used by search engines (such as Apache Lucene), they are available in DBMSes such as Oracle and PostgreSQL. They are used in column stores such as the Open Source engines Eigenbase and C-Store, as well as by many commercial solutions such as Vertica.

Bitmap indexes are silly data structures. Map each value to an array of booleans. Hence, if you have n rows in your table, and k distinct values, you get an n by k matrix containing booleans. Thus, some people falsely assume that bitmap indexes are only adequate when there are few distinct values (e.g., the gender column, male and female being the only two options). However—using techniques based on run-length encoding—the total size of your bitmaps is proportional to the size of the original table, irrespective of the number of distinct values!

Bitmap indexes are fast because they benefit from vectorization. Indeed, let the predicate “sex=male” is satisfied on rows 1, 5, 32, 45, 54 and 63. I can determine which rows satisfy the extended predicate “(sex=male) AND (city=Montreal)” using a single instruction! The secret? A bitwise AND between the bitmaps “sex=male” and “city=Montreal”. You can compute unions, differences and intersections between sets of integers in [1,N] using only N/64 operations. All microprocessors have built-in parallelism because they operate on several bits at once.

To benefit from vectorization, you need to store the data in a word-aligned manner: that is, you store consecutive segments of bits uncompressed. The longer the words, the less compression. Roughly speaking, 64-bit bitmap indexes are nearly twice as large as 32-bit bitmap indexes. What is the effect on the processing speed? We found that despite being much larger, 64-bit bitmap indexes were faster. That is right: it was faster to load twice as much data from disk!

Yet, we often equate concise data structures with more speed. This assumption can be misguided. Given a choice between more compression, or more vectorization, I would choose more vectorization.

References:

Further reading: See my posts Compressed bitmaps in Java, To improve your indexes: sort your tables!, and The mythical bitmap index.

Column stores and row stores: should you care?

Most database users know row-oriented databases such as Oracle or MySQL. In such engines, the data is organized by rows. Database researcher and guru Michael Stonebraker has been advocating column-oriented databases. The idea is quite simple: by organizing the data into columns, we can compress it more efficiently (using simple ideas like run-length encoding). He even founded a company, Vertica, to sell this idea.

Daniel Tunkelang is back from SIGMOD: he reports that column-oriented databases have grabbed much mindshare. While I did not attend SIGMOD, I am not surprised. Daniel Abadi was awarded the 2008 SIGMOD Jim Gray Doctoral Dissertation Award for his excellent thesis on Column-Oriented Database Systems. Such great work supported by influential people such as Stonebraker is likely to get people talking.

But are column-oriented databases the next big thing? No.

  • Column stores have been around for a long time in the form of bitmap and projection indexes. Conceptually, there is little difference. (See my own work on bitmap indexes.)
  • While it is trivial to change or delete a row in a row-oriented database, it is harder in column-oriented databases. Hence, applications are limited to data warehousing.
  • Column-oriented databases are faster for some applications. Sometimes faster by two orders of magnitude, especially on low selectivity queries. Yet, part of these gains are due to the recent evolution in our hardware. Hardware configurations where reading data sequentially is very cheap favor sequential organization of the data such as column stores. What might happen in the world of storage and microprocessors in the next ten years?

I believe Nicolas Bruno said it best in Teaching an Old Elephant New Tricks:

(…) some C-store proponents argue that C-stores are fundamentally different from traditional engines, and therefore their benefits cannot be incorporated into a relational engine short of a complete rewrite (…) we (…) show that many of the benefits of C-stores can indeed be simulated in traditional engines with no changes whatsoever.  Finally, we predict that traditional relational engines will eventually leverage most of the benefits of C-stores natively, as is currently happening in other domains such as XML data.

That is not to say that you should avoid Vertica’s products or do research on column-oriented databases. However, do not bet your career on them. The hype will not last.

(For a contrarian point of view, read Adabi and Madden’s blog post on why column stores are fundamentally superior.)

Why senior researchers and managers should analyze data themselves…

Scientists, businessman and even spies are supposed to analyze data collaboratively. Are they?

If you are a scientist, you are familiar with the following type of research collaboration: a lowly student collects the data, crunches the numbers and plots the data. Other collaborators—such as the professor—merely comment on the tables and plots. Similarly, the CEO sees the pie chart, while the assistant crunches the numbers. That is vertical collaboration: you clean the basement and I will clean the main floor.

Yet, reliable data analysis requires horizontal collaboration.  Indeed, there are downsides to task specialization:

  • By never looking at the data, senior scientists and managers rely on experience and hearsay. Their incoming bandwidth is dramatically reduced. Nature is the best coauthor. Consider how the best American spies were fooled prior to 9/11 while all the data to catch the terrorists was available. Bandwidth is a requirement to be smart.
  • When a single person crunches the numbers, hard-to-detect errors creep in. The problem is serious: Ioannidis showed that most research findings are wrong.
  • With nobody to review the source data, the sole data analyst is more likely to cheat. Why rerun these tests properly, when you can just randomly dismiss part of the data? People are lazy: when given no incentive, we take the easy way out.

The common justification for task specialization is that senior researchers and managers do not have the time. Yet, 30 years ago, researchers and managers did not type their own letters. Improve the tools, and reduce task specialization.

With Sylvie Noël, I decided to have a closer look. My preliminary conclusions are as follows:

  • There are adequate tools to support rich collaboration over data analysis. Collaboratories have been around for a long time. We have the technology! Yet, we may need a disruption: inexpensive, accessible and convenient tools. The current migration tower Web-based applications might help.
  • Given a chance, everyone will pitch in. To make our demonstration, we collected user data from sites such as IBM Many Eyes and StatCrunch. We then ran an Ochoa-Duval analysis. We find that the network of users within web-based data analysis tools is comparable to other Web 2.0 sites.

As a database researcher, I think that further progress lies with loosely coupled data (no big tables! no centralized tool!) and flexible visualization tools (stop the pie charts! go with tag clouds!). I am currently looking for new research directions on this problem, any idea?

Further reading: 

Quantum databases: what are they good for?

Hu et al. just posted An efficient quantum search engine on unsorted database. They refer to an older paper by Patel (2001), Quantum database search can do without sorting. Apparently without any data structure or preprocessing, logarithmic-time search queries are possible in quantum databases.

Even if we did have affordable quantum computers, this would not be a big selling point. Building B-trees or sorting tables is hardly prohibitive.

I would be more interested in how quantum databases can handle low selectivity queries. For example, can a quantum computer sum up a large array of numbers in near-constant time? Our current technology solves these problems with a mix of parallelism, compression and brute-force.

All I know about quantum computers is that we do not think they can solve NP-hard problems in polynomial time. Is there any reason to see a bright future in quantum databases?

Its official: the standard programming language for multidimensional databases is MDX

Julian Hyde just announced that Oracle will support MDX: they were the last vendor to resist this Microsoft technology. MDX is to multidimensional databases what SQL is to relational database. 

I am disappointed that MDX is the best we could come up with. I don’t find MDX enjoyable.

Further reading: What are the computer langage people waiting for?

Compressed bitmaps in Java

A bitmap is an efficient array of boolean values. They are commonly used in bitmap indexes. The Java language has a bitmap class: BitSet.

Unfortunately, the Java BitSet class will not scale to large sparse bitmaps—the Sun implementation does not use compression.

I published a new free alternative: JavaEWAH.

References:

Credit:

  • Glen Newton gave me the idea for this project.

The Claremont report on database research

Every few years, the database research community prepares a report listing the most promising research directions. The previous one was called the Lowell report, and I was inspired by it. The latest one is called the Claremont report.

Some bits I found interesting:

  • There is a call to exploit remote RAM and Flash as persistent media, rather than relying solely on magnetic disk. Indeed, Solid State Drives are an order of magnitude faster than our spinning disks and large pools of RAM are becoming affordable. External-memory algorithms are no longer a hot topic? (Yes, it is not that simple.)
  • Web 2.0-style applications bring new database workloads. I did some work on merging Web 2.0 and OLAP and can testify that the Social Web is a good source of new database problems.
  • They recommend integrating compression and query optimization. My work on compression in bitmap indexes that there are still open issues regarding compression in databases. Mostly, whereas information theory has taught us much about how to optimally compress, we have learned relatively little about how to use compression to save CPU cycles.

Native XML databases: have they taken the world over yet?

Some years ago, the database research community jumped into XML. Finally, something new to work on! For about 5 years now, I have seen predictions that the XML databases would take the world over. Every organization would soon have its XML database. People would run web sites out of XML databases. Countless start-ups emerged ready to become the next Oracle.

What happened in practise is a bit underwhelming. Oracle, Microsoft, MySQL and others all included some XML support in their relational databases, but native XML databases failed to grab any market share.

Where are we?

  • Regarding programming languages, XQuery finally became a W3C recommendation in January 2007. More or less, XQuery together with XPath specify the equivalent of a select instruction in SQL.
  • What if you want to update your XML database?
    XUpdate has been around for some time, but it is not widely supported. The W3C is working on something called XQuery Update Facility.
  • Interfacing XQuery with your favorite programming language is still awkward. We have an API for XML databases (XML:DB), but I am not sure how well it is supported by the various vendors.

Want to take an XML database out for a spin? Some XML databases worthy of mention:

  • eXist is open source and free.
  • Sedna is another free XML database.

My take: Once again, the relational data model shows great resilience in the marketplace. It is entirely possible that XML databases may go the way of the objected-oriented databases: useful for some niche problems, but nothing more. We could blame the lack of standards for the failure of XML databases, but SQL was never standardized and still took off.

I like XML. I like CSS. I like XSLT/XPath. But I have always be less certain about XQuery.

XML databases look too much like a solution in search of a problem.

Reference: The W3C publishes the result of their XQuery conformance testing. There is a lot of room for improvement!

How to speed up retrieval without any index?

John Cook gives us a nice recipe to quickly find all squares in a set of integers. For example, given 3, 4, 9, 15, you want your algorithm to identify 4 and 9 as squares.

The naïve way to solve this problem goes as follows:

  1. For each element…
  2. check whether sqrt(x) is an integer.

This may prove too expensive since the square-root operation must be computed using a floating-point algorithm.

A better way is to look at the first 4 bits of each integer. If the integer is a square, then the first 4 bits must have value 0, 1, 4, or 9. If you have a random distribution of numbers, this means that you can quickly discard 3 out of 4 numbers.

It is not immediately obvious that you will speed up the retrieval because inserting this check will add some overhead. However, it doubles the speed according to John. It is even less obvious that checking the first 8 or 16 bits instead of just the first 4 bits, can help. John says it does not help in one C++ implementation, but it does in a C# implementation.

This sort of strategy is entirely general. The research question is how much work should you do on fast dismissal? Too much effort toward dismissing lots of candidates might be counterproductive. Too little and your performance might not improve optimally.

Recently I started to wonder whether we could make it multipass: you first dismiss a few candidates with a cheap test, then on the survivors you use a more expensive test and so on. For example, you first check the first 4 bits, and if you cannot dismiss the candidate, you check the next 4 bits and so on. It is not a surprising idea, but figuring out whether it is worth the effort is a research question.

To make my point, I have worked on fast retrieval under the Dynamic Time Warping (DTW) distance, a nonlinear distance measure between time series. The DTW does not satisfy a triangle inequality. It is commonly used as a pattern recognition technique when comparing time series. It was initially designed to compare voice samples, allowing for changes in voice rhythm.

Eamonn Keogh from UCIUCR has come up with a simple but nearly optimal way to compute a lower bound to the DTW between any two times series, called LB_Keogh (named after himself). Just like in the John Cook algorithm, this lower bound quickly discards the false negatives. If you are interested, Eamonn has applied LB_Keogh to just about every time series problem you can think of. (Update: one hundred people or more also used LB_Keogh in their work, see comments below.)

I improved over LB_Keogh as follows. If LB_Keogh is not good enough (and only if it is not good enough), I compute a tighter lower bound (called LB_Improved). Surprisingly, in many cases, I can improve the retrieval time by a factor of two or more.

I have published my work as a software library, but also as the following paper:

Daniel Lemire, Faster Retrieval with a Two-Pass Dynamic-Time-Warping Lower Bound, to appear in Pattern Recognition.

This sort of work is much more difficult than it appears. I could have easily made my method look good by optimizing it, while leaving the competing methods unoptimized. By publishing my implementation, I go a long way toward keeping me honest. If I fooled myself and the reviewers, someone might find out by surveying my source code.

Full text search in SQL with LuSql

MySQL supports natively full text search; many database engines do. However, few databases can match a dedicated search engine library like Lucene. Moreover, even if you do not need the power of Lucene, sometimes you are forced to use a database engine that does not support full text search (like raw CSV files).

It would be nice to be able to combine a true search engine with any database engine.

If you are willing to use Java, then Glen Newton from NRC has the solution: LuSql. It allows you to index with Lucene any database accessible by Java (through JDBC). He says it has been extensively tested. It is open source and free.

To improve your indexes: sort your tables!

Many database indexes, including bitmap indexes, are sensitive to the order of the rows in your table. Many data warehousing practitioners urge you to sort your tables to get better results, especially with Oracle systems. In fact, column-oriented database systems like Vertica are built on sorted tables.

What do I mean by sorting the rows? It turns out that finding the best row reordering is a NP-hard problem. You might as well use something cheap like lexicographical sort as a heuristic. Finding an equally scalable technique that work much better is probably impossible. Ah! But there are different ways to sort lexicographically!

We wrote a few papers on this issue including one for DOLAP 2008. Here are the slides of our presentation: