Data processing on modern hardware

If you had to design a new database system optimized for the hardware we have today, how would you do it? And what is the new hardware you should care about? This was the topic of a seminar I attended last week in Germany at Dagstuhl.

Here are some thoughts:

  • You can try to offload some of the computation to the graphics processor (GPU). This works well for some tasks (hint: deep learning) but not so well for generic data processing tasks. There is an argument that says that if you have the GPU anyhow, you might as well use it even if it is inefficient. I do not like this argument, especially given how expensive the good GPUs are. (Update: Elias Stehle pointed out to me that GPU memory is growing exponentially which could allow GPUs to serve as smart memory subsystems.)
  • The problem, in general, with heterogeneous data processing systems is that you must, somehow, somewhen, move the data from one place to the other. Moving lots of data is slow and expensive. However, pushing the processing to the data is appealing. So you can do some of the processing within the memory subsystem (processing in memory or PIM) or within the network interface controllers. I really like the idea of applying a filter within the memory subsystem instead of doing it at the CPU level. It is unclear to me at this point whether this can be made into generically useful technology. The challenges are great.
  • There are more and more storage systems, including persistent memory, solid state drives, and so forth. There is expensive fast storage and cheaper and slower storage. How do you decide where to invest your money? Jim Gray’s 5-minute rule is still relevant, though it needs to be updated. What is this 5-minute rule? You would think that slow and inexpensive storage is always cheaper… but being half as fast and half as expensive is no bargain at all! So you always want to be comparing the price per query. Frequent queries justify more expensive but faster storage.
  • There is much talk about FPGAs: programmable hardware that can be more power efficient than generic processors at some tasks. Because you have more control over the hardware, you can do nifty things like make use of all processing units at once in parallel, feed the output of one process directly into another process, and so forth. Of course, though it is used more efficiently, the silicon you have in an FPGA costs more. If your application is a great fit (e.g., signal processing), then it is probably a good idea to go the FPGA route… but it is less obvious to me why data processing, in general, would fit in this case. And if you need to outsource only some of your processing to the FPGA, then you need to pay a price for moving data.
  • The networking folks like something called RDMA (remote direct memory access). As I understand it, it allows one machine to access “directly” the memory of another machine, without impacting the remote CPU. Thus it should allow you to take several distinct nodes and make them “appear” like a multi-CPU machine. Building software on top of that requires some tooling and it is not clear to me how good it is right now. You can use MPI if it suits you.
  • There is talk of using “blockchains” for distributed databases. We have been doing a lot of work to save energy and keep computing green. Yet it is useless because we are burning CPU cycles like madmen for bitcoin and other cryptocurrencies. People talk about all the new possible applications, but details are scarce. I do not own any bitcoin.
  • We have cheap and usable machine learning; it can run on neat hardware if you need it to. Meanwhile, databases are hard to tune. It seems that we could combine the two to tune automagically database systems. People are working on this. It sounds a bit crazy, but I am actually excited about it and I plan to do some of my own work in this direction.
  • Cloud databases are a big deal. The latest “breakthrough” appears to be Amazon Aurora: you have a cheap, super robust, extendable relational database system. I have heard it described as an “Oracle killer”. The technology sounds magical. Sadly, most of us do not have the scale that Google and Amazon have, so it is less clear how we can contribute. However, we can all use it.
  • Google has fancy tensor processors. Can you use them for things that are not related to deep learning and low-precision matrix multiplications? It seems that you can. Whether it makes sense is another story.
  • People want more specialized silicon, deeper pipelines, more memory requests in-flight. It is unclear whether vendors like Intel are willing to provide any of it. There was some talk about going toward Risc-V; I am not sure why.

I am missing many things, and I am surely misreporting much of what was said. Still, I will write some more about some of these ideas over the next few weeks.

Some subjects that people did not cover much at the seminar I was at, as far as I know:

  • Data processing on mobile devices or “Internet of things” was not discussed.
  • I felt that there was relatively little said about large chips made of many, many cores. Vectorization was touched upon, but barely.

The female representation at the event was low in number, but not in quality.

Credit: The Dagstuhl seminar I attended was organized by Peter A. Boncz, Goetz Graefe, Bingsheng He, and Kai-Uwe Sattler. The list of participants include Anastasia Ailamaki, Gustavo Alonso, Witold Andrzejewski, Carsten Binnig, Philippe Bonnet, Sebastian Breß, Holger Fröning, Alfons Kemper, Thomas Leich, Viktor Leis, myself (Daniel Lemire), Justin Levandoski, Stefan Manegold, Klaus Meyer-Wegener, Onur Mutlu, Thomas Neumann, Anisoara Nica, Ippokratis Pandis, Andrew Pavlo, Thilo Pionteck, Holger Pirk, Danica Porobic, Gunter Saake, Ken Salem, Kai-Uwe Sattler, Caetano Sauer, Bernhard Seeger, Evangelia Sitaridi, Jan Skrzypczak, Olaf Spinczyk, Ryan Stutsman, Jürgen Teich, Tianzheng Wang, Zeke Wang, and Marcin Zukowski. The beer was quite good.

Published by

Daniel Lemire

A computer science professor at the University of Quebec (TELUQ).

15 thoughts on “Data processing on modern hardware”

  1. I was looking at sizing a data warehouse on AWS the other day, and it really seems to come down 10 Gb/s networking, and sharing it for everything: loading, inter-cluster, storage, and query (S3 or EBS). Each hop basically redistributes the data by a different scheme (eg the load stream has to be parsed to get the target node for each row), so network is pretty much guaranteed to be the bottleneck.

  2. so network is pretty much guaranteed to be the bottleneck.

    Assume that this is true. Then what is the consequence? Do you go for cheap nodes, given that, in any case, they will always be starved… Or do you go for powerful nodes to minimize network usage?

    1. The query end depends heavily on SSD cache; basically terabytes of hot data. This stuff is often joined and aggregated locally, so network is less of a factor there (part of my job is to ensure this tendency).

      SSD is only available on larger SKU’s, so using many small nodes would likely decrease query performance while increasing load or insert throughput. However there is also a split strategy of having storageless “compute” nodes which handle loading while the main cluster handles storage and querying.

  3. I feel like an even simpler thing we are lacking is the ability to easily coordinate all the cores in a single machine to work on the same data. The cores tend to share a common L3, it seems agreed that getting data from RAM to L3 is one of the bottlenecks, but there don’t seem to be any efficient low level ways to help the cores share their common resources rather than competing for them.

    This may seem like a silly complaint, since OS’s have smoothly managed threads and processes across cores for decades, but for a lot of problems this is far to high level, and the coordination is far too expensive. For example, I’d like is to be able to design “workflows” where each core can concentrate on each stage of a multistage process, where data can be passed between cores using the existing cache line coherence mechanisms.

    Is there anything out there that does this? Or otherwise allows inter-core coordination at speeds comparable to L3 access?

    1. It seems that our Knight Landing box has some fancy core-to-core communication. You have to have that when the number of cores goes way up.

      And what about AMD’s Infinity Fabric and Intel’s mesh topology for Skylake?

      1. I’d guess that the hardware support is already present. What’s lacking is a reasonable interface that allows access from userspace.

        Consider the MONITOR/MWAIT pair: http://blog.andy.glew.ca/2010/11/httpsemipublic.html. It’s always seemed like there should be a way to use these to do high performance coordination between cores, and apparently they are finally available outside the kernel on KNL, but I’ve never seen a project that makes use of them.

        Alternatively, consider the number of papers you read that provide good whole processor numbers versus those that provide only single-threaded numbers. Almost everything that shows multicore results does it by running multiple parallel instantiations of the same code. Is there really no way further optimizations possible?

  4. Most times you have to move data from main memory to accelerator memory, but a lot of new platforms allow for cache coherent unified memory that blurs the line between CPU core and accelerator; the accelerator memory, if any, is more of a cache than anything else.

    The fundamental problem is not how to move lots of data, but how to move enough to keep the GPU/FPGA busy.

  5. Sorry, I don’t get this exponentiality argument. GPU memory is much faster and much more expensive that commodity RAM. It is just not feasible to use GPU memory to store things: you can do it much cheaper without GPU.

        1. There is also NVlink (NVIDIA Volta, IBM Power9) and CPU+GPU products (e.g., AMD Kaveri, nearly all smartphone processors).

          The discussion should not be about interconnects or unified memory or not, but how much is the latency to access non-local memory and how you can hide it.

          1. The last time I checked on hybrid GPU/CPU solutions, I found them severely underpowered compared to the lineup of NVIDIA products.

            I don’t get your comment about latency and interconnects. The quality of the interconnect directly affects latency. Also having a fancy interconnect adds complexity.

    1. They most certainly do not record it (they are not presentations but discussions), but there will be a written (free) report. Moreover, I will write more about some of these topics in the coming weeks and months.

  6. Agree wholeheartedly to the issue of moving things into the GPU from RAM. But… It is now possible to have direct SSD-GPU without going through CPU or main memory. The GPU maps a portion of its memory into the PCIe bus. See pg-strom (though it doesn’t seem to show much after 2016).

    A problem GPUs with lots of memory are quite expensive. An Nvidia Tesla 24GB costs $500-$3000. And the way to use GPUs efficiently is with proprietary interfaces.

Leave a Reply to Daniel Lemire Cancel reply

Your email address will not be published.

To create code blocks or other preformatted text, indent by four spaces:

    This will be displayed in a monospaced font. The first four 
    spaces will be stripped off, but all other whitespace
    will be preserved.
    
    Markdown is turned off in code blocks:
     [This is not a link](http://example.com)

To create not a block, but an inline code span, use backticks:

Here is some inline `code`.

For more help see http://daringfireball.net/projects/markdown/syntax

You may subscribe to this blog by email.