I started 2009 with an interest in Web 2.0 OLAP and collaborative data processing. The field of collaborative data processing has progressed tremendously. Last year, we got Google Fusion Tables and data warehousing products are getting more collaborative.
In 2010, my research might focus more on database theory—while maintaining a strong experimental bias. Specifically, I am currently thinking about:
- Lightweight compression. The goal of lightweight compression is save CPU cycles—not storage! Of course, the CPU architecture is critical. Thus, you have to worry about instruction-level parallelism. Measuring the quality of the compression by the compression ratio is wrong.
- Row reordering. Some compression formats, such as run-length encoding, are sensitive to the order of the objects in a database. Reordering them is NP-hard. The efficiency of the heuristics depends on the compression format. I will continue my earlier work on this topic.
- Concurrency and parallelism. Some believe that multicore CPUs can be used to compress data even more aggressively. It might be misguided. Instead, we must focus on embarrassingly parallel problems. Already, we can scan a large in-memory table using several CPU cores quite fast. In 2010, we should empower our users so that they can explore their data more freely.
- String hashing. I have argued on this blog that universal hashing of long strings is impossible. While hashing strings is textbook material, our understanding of hashing can be improved further.
Further reading: Search Questions for 2010: What’s On My Mind by Daniel Tunkelang