Who is going to need a database engine in 2020?

Given the Big Data phenomenon, you might think that everyone is becoming a database engineer. Unfortunately, writing a database engine is hard:

  • Concurrency is difficult. Whenever a data structure is modified by different processes or threads, it can end up in an inconsistent state. Database engines cope with concurrency in different ways: e.g., through locking or multiversion concurrency control. While these techniques are well known, few programmers have had a chance to master them.
  • Persistence is also difficult. You must somehow keep the database on a slow disk, while keeping some of the data in RAM. At all times, the content of the disk should be consistent. Moreover, you must avoid data loss as much as possible.

So, developers almost never write their own custom engines. Some might say that it is an improvement over earlier times when developers absolutely had to craft everything by hand, down to the B-trees.  The result was often expensive projects with buggy results.

However, consider that even a bare-metal language like C++ is getting support for  concurrency and threads and esoteric features like regular expressions. Moreover, Oracle working hard at killing the Java Community Process will incite Java developers to migrate to better languages.

Meanwhile, in-memory databases are finally practical and inexpensive. Indeed, whereas a 16 GB in-memory database was insane ten years ago, you can order a desktop with 32 GB of RAM from Apple’s web site right now. Moreover, memory capacity grows exponentially: Apple will sell desktops with 1 TB of RAM in 2020. And researchers predict that  non-volatile Resistive RAM (RRAM) may replace DRAM. Non-volatile internal memory would make persistence much easier.

But why would you ever want to write your own database engine?

  • For speed, some engines force you use nasty things like stored procedures. It is a drastically limited programming model.
  • The mismatch between how the programmer thinks and how the database engine works can lead to massive overhead. As crazy as it sounds, I can see a day when writing your engine will save time. Or, at least, save headaches.
  • Clever programmers can write much faster specialized engines.

Obviously, programmers will need help. They will need great librairies to help with data processing, data compression, and data architecture. Oh! And they will need powerful programming languages.

Published by

Daniel Lemire

A computer science professor at the Université du Québec (TELUQ).

12 thoughts on “Who is going to need a database engine in 2020?”

  1. Wondering what language you recommend as alternatives to Java?

    Scala springs to mind of course as an obvious choice for reforming Java devs.

    Could it be something more unusual like Haskell, Erlang, O’Caml, Scheme, Google Go or the likes of Ruby/Python/JS?

  2. I think that the C++ of today has nothing to do comparing with the one that lost The battle agallas Java yeraa ago. It only needs standard libs for standard problems (threads, sockets, …) there are a few good libs like qt, but they Are not standard.
    I believe something like jsr’s for C++ could be a solution…

  3. Just as a sidenote, regular expressions are not an esoteric feature, but a standard tool for many programmers (on Unix systems: every programmer).

  4. Especially data compression, and in particular content-aware compression methods designed specifically for structured data.

    Database volumes are growing faster than Moore’s law, but the state of the art of database compression has never kept pace. In the absence of systematic methods designed for database records, decades-old, one-dimensional, conventional methods, intended for long-obsolete hardware, are instead being brought to bear ad-hoc on database data.

  5. @Shaw

    Wondering what language you recommend as alternatives to Java?

    I use Python, Java and C++ myself. They are all getting better all the time.

    Scala is interesting, but it feels challenging.

    I won’t make predictions except to say that I expect new and more powerful programming languages to replace the existing ones. I’ll be pretty sad if in 2020, I’m still primarily using Python, Java and C++. There is so much innovation out there that something strong has to emerge out of it.

    For reference, hardly anyone was using Python and Ruby ten years ago. C# didn’t even exist. We are not standing still. (Though some would object that we have never improved over Lisp. I won’t get into this debate.)

    @Zeno

    regular expressions are not an esoteric feature

    No. They are not. I’m sorry I wasn’t clear: it was irony. My point was that 20 years ago, regular expressions would have appeared as an esoteric feature whereas it is now taken for granted. Thus, programmers are much more powerful than they were 20 years ago. A program that had taken months to write can now be written much faster. I conjecture that the trend will continue.

    @Paul

    Even if we switch to RAM storage, the concepts will remain in accessing data across a network.

    Good point.

    I recall a lot of database work being figuring out how to make sure you’re accessing contiguous chunks whenever possible to avoid having to wait around for a spinning platter to get to the next bit you need. Storage advances look to be moving away from that model, which will make writing db code and interacting with them that much easier

    Yes. I agree.

  6. There will always be this shift as “db” problems start fitting in memory and “impossible” problems turn into feasible db problems. Thus there will always be a need for quick, in memory solutions; for general purpose databases; and for specialized solutions. Even if we switch to RAM storage, the concepts will remain in accessing data across a network.

    The one thing I do see changing is the quirks of databases necessitated by spinning disks. From my DB class back in college, I recall a lot of database work being figuring out how to make sure you’re accessing contiguous chunks whenever possible to avoid having to wait around for a spinning platter to get to the next bit you need. Storage advances look to be moving away from that model, which will make writing db code and interacting with them that much easier, when, e.g. you don’t need to pick a primary variable to index on, but can have every variable indexed at identical speeds

  7. @Stanley

    All the database engines currently listed in the database engine wikipedia article are related to MySQL. Seems a bit biased to me.

    As for a review of database engines, that would make a blog post of its own. Maybe later… 😉

  8. Given the mention of Java and databases, I’m almost surprised to see no one mention Clojure yet. The basics of the STM system should be familiar to many, while the reliance on the JVM should help some move towards it as well.

    I’m only just getting into it now and really liking what I see. Take a look if you haven’t heard of it: http://clojure.org

  9. Daniel,
    the title of your post is “Who is going to need a database engine in 2020?” But then in the post you go to talk about a different (albeit related) issue: who would want to write their own database engine? As you point out, that’s kind of crazy (unless you work at Google/Twitter/Amazon/Facebook). There are already many distinct database engines available (SQL and no-SQL, in-memory and in-disk; centralized and distributed). It feels like reinventing the wheel.
    Once this said, I think the original question (title) is much more interesting. I hope you write a post on that. We can discuss why databases are not used for data analysis/analytics.

  10. Just what is “Big Data”?

    I am curious about the performance of the H2 database. Given the preference for running in-memory, the low impedance when embedded in a Java application, and the increasing size of main memory – when does the problem become too big for a single instance? Put differently, what fraction of “Big Data” problems can be handled in-memory on a single box, using a very fast single-instance SQL database?

    The H2 database is *very* fast when embedded in a Java application, and operating in-memory. I believe you can also write your stored procedures (when needed) in Java.

    If we can partition the problem, we could fire up a herd of single-instances to operate on segments of the data. Using an SQL database we can easily do some fairly complex transformations. How does this compare in performance to non-SQL databases?

    If this approach works, there is no (or less) need to write custom engines.

    Expanding your question, more than answering. 🙂

    The JVM is also one of my concerns. I distrust Oracle. Can we port the JVM used by Google on Android?

Leave a Reply

Your email address will not be published. Required fields are marked *

To create code blocks or other preformatted text, indent by four spaces:

    This will be displayed in a monospaced font. The first four 
    spaces will be stripped off, but all other whitespace
    will be preserved.
    
    Markdown is turned off in code blocks:
     [This is not a link](http://example.com)

To create not a block, but an inline code span, use backticks:

Here is some inline `code`.

For more help see http://daringfireball.net/projects/markdown/syntax