When doing data engineering, it is common for engineers to want to first load all of the data in memory before processing the data. If you have sufficient memory and the loaded data is not ephemeral or you have small volumes, it is a sensible approach. After all, that is how a spreadsheet typically works: you load the whole speadsheet in memory.
But what if your application falls in the “extract-transform-load” category: as you scan the data, you discard it? What if you have large volumes of data? Then I consider the “load everything in memory at once” a performance anti-pattern when you are doing high performance data engineering.
The most obvious problem with a big load is the scalability. As your data inputs get larger and larger, you consume more and more memory. Though it is true that over time we get more memory, we also tend to get more processing cores, and RAM is a shared ressource. Currently, in 2021, some of the most popular instance types on the popular cloud system AWS have 4 GB per virtual CPU. If you have the means, AWS will provide you with memory-optimized virtual nodes that have 24 TB of RAM. However, these nodes have 448 logical processors sharing that memory.
Frequent and large memory allocations are somewhat risky. A single memory allocation failure may force an entire process to terminate. On a server, it might be difficult to anticipate what other processes do and how much memory they use. Thus each process should seek to keep its memory usage predictable, if not stable. Simply put, it is nicer to build your systems so that, as much as possible, they use a constant amount of memory irrespective of the input size. If you are designing a web service, and you put a hard limit on the size of a single result, you will help engineers build better clients.
You may also encounter various limits which reduce your portability. Not every cloud framework will allow you to upload a 40 GB file at once, without fragmentation. And, of course, on-device processing in a mobile setting becomes untenable if you have no bound on the data inputs.
But what about the performance? If you have inefficient code (maybe written in JavaScript or bad C++), then you should have no worries. If you are using a server that is not powerful, then you will typically have little RAM and a big load is a practical problem irrespective of the performance: you may just run out of memory. But if you are concerned with performance and you have lots of ressources, the story gets more intricate.
If you are processing the data in tiny increments, you can keep most of the data that you are consuming in CPU cache. However, if you are using a big-load, then you need to allocate a large memory region, initialize it, fill it up and then read it again. The data goes from the CPU to the RAM and back again.
The process is relatively expensive. To illustrate the point, I wrote a little benchmark. I consider a function which allocates memory and populates an array of integer with the values 0,1,2…
int * content = new int[volume/sizeof(int)]; init(content, volume); delete[] content;
It is a silly function: everything you would do that involves memory allocation is likely far slower. So how fast is this fast function? I get the following numbers on two of my machines. I pick the best results within each run.
1 MB | 1 GB | |
alloc-init (best) – AMD Rome Linux | 33 GB/s | 7 GB/s |
alloc-init (best) – Apple M1 | 30 GB/s | 9 GB/s |
Simply put, allocating memory and pushing data into it gets slower and slower with the volume. We can explain it in terms of CPU cache and RAM, but the principle is entirely general.
You may consider 7 GB/s or 9 GB/s to be a good speed, and indeed these processors and operating systems are efficient. However, consider that it is actually your starting point. We haven’t read the data yet. If you need to actually “read” that data, let alone transform it or do any kind of reasoning over it, you must then bring it back from RAM to cache. So you have the full cache to RAM and RAM to cache cycle. In practice, it is typically worse: you load the whole huge file into memory. Then you allocate memory for an in-memory representation of the content. You then rescan the file and put it in your data structure, and then you scan again your data structure. Unavoidably, your speed will start to fall… 5 GB/s, 2 GB/s… and soon you will be in the megabytes per second.
Your pipeline could be critically bounded because it is built out of slow software (e.g., JavaScript code) or because you are relying on slow networks and disk. To be fair, if the rest of your pipeline runs in the megabytes per second, then memory allocation might as well be free from a speed point of view. That is why I qualify the big-load to be an anti-pattern for high-performance data engineering.
In a high-performance context, for efficiency, you should stream through the data as much as possible, reading it in chunks that are roughly the size of your CPU cache (e.g., megabytes). The best chunk size depends on many parameters, but it is typically not tiny (kilobytes) nor large (gigabytes). If you bypass such an optimization as part of your system’s architecture, you may have hard limits on your performance later.
It is best to view the processor as a dot at the middle of a sequence of concentric circles. The processor is hard of hearing: they can only communicate with people in the inner circle. But there is limited room in each circle. The further you are from the processor, the more expensive it is for the processor to talk to you because you first need to move to the center, possibly pushing out some other folks. The room close to the processor is crowded and precious. So if you can, you should have your guests come into the center once, and then exit forever. What a big load tends to do is to get people into the inner circle, and then out to some remote circle, and then back again into the inner circle. It works well when there are few guests because everyone gets to stay in the inner circle or nearby, but as more and more people come in, it becomes less and less efficient.
It does not matter how your code looks: if you need to fully deserialize all of a large data file before you process it, you have a case of big load. Whether you are using fancy techniques such as memory file mapping or not, does not change the equation. Some parameters like the size of your pages may help, but they do not affect the core principles.
Adding more memory to your system is likely to make the problem relatively worse. Indeed, systems with lots of memory can often pre-empt or buffer input/output accesses. It means that their best achievable throughput is higher, and thus the big-load penalty relatively worse.
How may you avoid the big-load anti-pattern?
-
- Within the files themselves, you should have some kind of structure so that you do not need to consume the whole file at once when it is large. It comes naturally with popular formats such as CSV where you can often consume just one line at a time. If you are working with JSON data files, you may want to adopt to JSON streaming for an equivalent result. Most data-engineering formats will support some concept of chunk or page to help you.
- Consider splitting your data. If you have a database engine, you may consider sharding. If you are working with large files, you may want to use smaller files. You should be cautious not to fall for the small-load anti-pattern. E.g., do not store only a few bytes per file and do not fragment your web applications into 400 loadable ressources.
- When compressing data, try to make sure you can uncompress small usable chunks (a few megabytes).
Note: If you are an experienced data engineer, you might object that everything I wrote is obvious. I would agree. This post is not meant to be controversial.
That’s often too small, although if you’re talking about L3 caches on many present day CPUs, it’s probably okay. The cost of context switching and I/O is significantly worse than RAM speed on most setups, so you often want to tune for I/O as opposed to CPU cache, such as making large read requests to disk (particularly if it’s a harddisk).
What’s probably more important is doing async I/O, so your processing can occur whilst the disk is pulling in data for you. Also, avoiding random I/O, as sequential I/O is generally fastest (i.e. be careful with reading too many things at the same time).
Since I/O costs typically dwarf RAM speeds, large files are better than smaller ones (lowers random access costs and performance penalties with opening files) – it’s actually why many games go to the effort of packing all their assets into archives because it’s faster to deal with one big file than many smaller ones.
Yes, using many tiny files would be a performance anti-pattern of its own (which I refer to as “small-load”).
Note that in my context, I assume a high-performance SSD. If you are using a spinning drive, then you may be limited to a few megabytes per second (if that) and my whole post is irrelevant.
It’s not just many tiny files, any splitting can slow you down because each file you have incurs overhead with opening/closing it (and can reduce sequential accesses because filesystems are less likely to place these specific files sequentially on disk). Granted, if the files are large enough, the cost is pretty small, but it doesn’t mean you should arbitrarily introduce splitting with the assumption that it improves performance.
Most SSDs are limited by SATA bandwidth, so around 500MB/s max. If you’re using a cloud hosting platform, you’ll also be limited by network bandwidth, so possibly even less.
Even if you’re using a high performance modern NMVe drive, you’re still stuck with needing to do kernel traps when issuing reads.
As such, the point stands that you shouldn’t optimise based on CPU cache, but on whatever is most efficient for I/O. 1GB is likely too big, but 1MB is likely too small.
SSDs obviously have different performance characteristics to HDDs, but fortunately, for sequential access, the guidelines for the two are the same.
I take your point that many people’s IO is limited to megabytes per second. Certainly, my home Internet peaks at about 50 MB/s (download).
On the latest servers I purchased, I am measuring bandwidths for sequential reads that are > 3 GB/s (using standard methodologies). AWS provides gigabytes of network bandwidth (I measured over 4 GB/s for large files in node-to-node transfer).
AWS scales available network bandwidth based on your instance size. All but the few largest instance sizes will not achieve GB/s network bandwidth.
Also note that you’re measuring sequential I/O. This is usually achievable if everything is in one file, however, if it’s spread out across multiple files, you’re less likely to get fully sequential I/O.
That’s true. Small instances also have relatively little RAM (e.g., 4 GB per core).
This is absolutely true and worth noting.
Memory-mapping the file (e.g. with mmap) can give you the best of both worlds. You can code as though the file were all in memory, which is simpler, and you can even backtrack or peek forward if necessary. But the computer doesn’t have to copy the whole file into RAM.
(Depending on the OS, you may lose some performance because the file is read in smaller chunks, possibly as small as 4KB. But in practice the kernel will usually see your access pattern and read ahead in bigger chunks, which btw gives you async I/O for free.)
Memory mapping is a great approach but underneath it is limited by the same fundamental principles. It does not bypasses CPU cache and RAM.
(I am aware that you know this, but I am clarifying it.)
I would argue that memory mapping is fundamentally different than the big allocate-read-process approach. Effectively it is like your suggested streaming approach: the memory only takes one trip to the inner circle.
Consider the case where the mapped pages aren’t in the page cache. You map the pages, and only a small structure is created in the kernel recording the mapping (a so-called VMA in the case of Linux). Then you start processing the region: each time you get to a new 4k page, the kernel will map in that page, including it reading from storage. Then, in userspace the page will be hot in cache.
Some of the details may vary (e.g., the kernel may “read ahead” more than 4k page to make the IO faster), but these generally make things better, not worse.
You can do the same analysis for the case were the page are already hot in the page cache, and again this method is competitive with streaming.
Or, as you usually prefer, you can test this. I have found the mmap approach generally somewhat faster than the
read()
-based streaming approach, but this varies a bit by OS, kernel version, etc. Here’s a bit more on mmap vsread()
— more from the side of why mmap isn’t just way better than read() but rather competitive with the winner decided by smaller factors.@Travis
Suppose that I have a CSV file. It is huge. I memory map it. Then I load the CVS file into a tabular data structure, with parsed entries (think excel spreadsheet).
How does it help that the file was memory mapped?
But I agree, of course, that if you are reading the file, say line by line, a memory map may be effectively equivalent to code that looks like line-by-line reading. (I would expect the memory map to be less efficient in practice though it does not have to be.)
What I am warning people against is the belief that memory mapping is somehow intrinsically different from loading the data into memory.
Sure, let’s be explicit about it using your example.
Let’s read a spreadsheet of size
N
on a system with a single cache level of sizeC
. As I understand it, your point is that “loading” (reading from disk and/or page cache) the whole file, prior to processing the entire file is slow because most bytes in the file must be brought into RAM at least twice, once when the file is loaded and once when it is processed.To be even more explicit, in pseudo-code, the “big load” scenario looks something like this, I think:
// allocate a buffer of size N (size of the file)
// we assume here this doesn't do much
// work, it doesn't touch N bytes!
byte buffer[] = allocate(file.size);
// read the entire file into the buffer
// this necessarily brings all N bytes of the
// file into RAM, but only at most C bytes
// can remain in cache, since that's the size
// of the cache
read(file.path, buffer);
// now, parse the array of bytes into the
// in-memory DOM or whatever you want
// to call it. This necessarily touches all N
// bytes again, so brings at least N - C bytes
// into cache. We assume N >> C in the "big"
// scenario, so N - C ~= N.
spreadsheet_DOM = parse(buffer);
Does that look right?
As an alternative, you suggest that you process the file in chunks, which might look like:
byte buffer[C / 2]; // a buffer half the size of cache
// start with empty DOM
spreadsheet_DOM = empty();
for (c = 0; c < file.size; c += C / 2) {
// read C / 2 bytes
read(file.path, buffer);
// then parse those new bytes into cells which
// we parse and add to the spreadsheet
// the new cells in the buffer
spreadsheet_DOM.add_cells(buffer);
}
Here, we only bring bytes into memory once, in the read call, they stay in cache for the subsequent add_cells call. Performance will be much better (surprisingly, perhaps even better than 2x better as your test shows).
So we are calling the second case “fundamentally different” than the first right? You say that using mmap will be like the first in the fundamental sense of memory movement, and I say it will be like the second, right?
We might as well agree on that before looking at the mmap-the-entire-file scenario.
Sorry, when I said “must be brought into RAM at least twice” I meant “brought from RAM to the core at least twice”.
No I do not agree with the position you attribute to me. I agree with the position you attribute to yourself.
You are missing the step here where you actually use the data. You are only deserializing. So try adding a step where you sum up the columns for example.
I see. Yes, then I agree: if you have more than 1 processing operation (including here de-serialization in that category) that wants to iterate over the in-memory data (including output from a previous step), mmap won’t help you avoid bringing the entire data in at least once. It only helps for the phase directly following the mmap’ing (which I think is all you can ask from it).
In that sense it is different: if you can organize your processing, including deserialization, into a single-pass,
mmap
ing the entire file will allow the whole thing to be streaming, whileread()
ing into a large buffer won’t. So my point stands if you are able to write adeserialize_and_sum()
operation that works on a large buffer in a single pass.Indeed, so much so that even for data that is processed by looking at other data in the same file (in your spreadsheet example, another column), we look at how far ahead/behind we need to look, and read a window of data only that wide and then read single columns and process what is in memory.
We typically perform 2-5 different mathematical data processes on each column of data, and each is performed in it’s own thread, so we observe some multi-processing taking place even against a slow disk read and write at each end.
Not only does this approach allow us to deal with data sets that are as large as can fit on disk, it also allows us to accurately predict memory requirements up front (and keeps them very small).