Is reading memory faster than writing on a PC?

For human beings, reading is much faster than writing. The same is often true for computers: adding a record to a database may take an order of magnitude longer than reading it. But the speed differential can often be explained by the overhead of concurrency. For example, writing to a database may involve some form of locking where the database systems must first acquire the exclusive writing rights over a section of memory.

If you minimize the overhead, by writing straight C/C++ code, it is less obvious that reading and writing to memory should have different speeds. And, indeed, in my tests, the memset instruction in C which initializes a range of bytes with a given value, takes exactly half the running time of memcpy which reads data from one location and writes it to another. Because memset involves just writing to memory, and memcpy does as much reading as it does writing, we can conclude that these tests show symmetry: reading is as fast as writing.

However, these tests (with memset and memcpy) are maybe not representative of how writing and reading is done most commonly. Thus I designed a second batch of tests. In these tests, I have two arrays:

  • A small array (32KB or less) that fits easily in fast L1-L2 CPU cache. It has size M.
  • A larger array (between 1MB to 512MB) that might not fit in CPU cache. It has size M times N so that the small array fits N times in the larger array.

Then I consider two tests that should run at the same speed if reading and writing are symmetrical:

  1. I copy the small array over the large one N times so that all of the larger array is overwritten. In theory, this test reads and writes M times N elements. In practice, we expect the computer to read the small array only once and to keep its data in close proximity with the CPU afterward. Hence, we only need to read the M elements of the small array once and then write N times M elements in the large array. Hence, this test measures mostly writing speed. The simplified source code of this test is as follows:
    for(size_t x = 0; x<N; ++x) {
        bigdata += M;
  2. I copy N segments from the large array to the small array so that at the end, all of the large array has been read. Again, this test reads and writes M times N elements. However, we can expect the processor to delay copying the short array to memory: it might keep it in fast CPU cache instead. So that, in effect, this second tests might be measuring mostly reading speed. The simplified source code is:
    for(size_t x = 0; x<N; ++x) {
        bigdata += M;

Notice how similar the source codes are!

Can you predict which code is faster? I ran these tests on a desktop Intel core i7 processor:

  M     N   test 1 (time)/ test 2 (time)
2048 32 M 3.4
8192 16384 M 1.6

Analysis: The test 1 (writing test) is significantly slower than test 2 (reading test). That is, it is slower to take a small array (located near the CPU) and repeatedly write it to slower memory regions, than doing the reverse. That is, reading is faster than writing.

“But Daniel! That’s only one test and 4 lines of code, it proves nothing!”

Ok. Let us consider another example example. In C/C++, boolean values (bool) are stored using at least one byte. That’s rather wasteful! So it is common to pack the boolean values. For example, you can store 8 boolean values in one byte using code such as this one:

void pack(bool * d, char * compressed, size_t N) {
  for(size_t i = 0; i < N/8; ++i) {
    size_t x = i * 8;
    compressed[i] = d[x] | 
                    d[x+1]<<1 |
                    d[x+2]<<2 |
                    d[x+3]<<3 |
                    d[x+4]<<4 |
                    d[x+5]<<5 |
                    d[x+6]<<6 |
                    d[x+7]<<7 ;

The reverse operation is not difficult:

void unpack(char * compressed, bool * d, size_t N) {
  for(size_t i = 0; i < N/8; ++i) {
    size_t x = i * 8;
    d[x] = compressed[i] & 1;
    d[x+1] = compressed[i] & 2;
    d[x+2] = compressed[i] & 4;
    d[x+3] = compressed[i] & 8;
    d[x+4] = compressed[i] & 16;
    d[x+5] = compressed[i] & 32;
    d[x+6] = compressed[i] & 64;
    d[x+7] = compressed[i] & 128;

These two pieces of code (pack and unpack) are similar. The main difference is that the pack function reads 8 bools (1 byte each) and writes one byte whereas unpack function reads one byte and writes 8 bools. That is, the pack function is a reading test whereas the unpack function is a writing test.

Can you guess which one is faster? In my tests, unpacking is 30% slower than packing. This is consistent with the result of my other analysis: it seems that reading is faster than writing.

Of course, you should only expect to see a relative slowdown if you write much more than you would otherwise read (say by a factor of 8). Also, I expect the results of these tests to vary depending on the compiler and the machine you use. Always run your own tests!

Source code: As usual, you can find my source code online.

Further reading: I know why DRAM is slower to write than to read, but why is the L1 & L2 cache RAM slower to write? via Reverend Eric Ha

Update: Nathanaël Schaeffer points out that the latency of the MOV operation differs depending on whether you are copying data from memory to a register, or from a register to memory. On my test machine, it takes at least 3 cycles to copy data from a register to memory, but possibly only 2 cycles to copy data from memory to a register (Fog, 2012).

Credit: Thanks to L. Boystov, S. H Corey, R. Corderoy, N. Howard and others for discussions leading up to this blog post.

Daniel Lemire, "Is reading memory faster than writing on a PC?," in Daniel Lemire's blog, October 29, 2012.

Published by

Daniel Lemire

A computer science professor at the University of Quebec (TELUQ).

5 thoughts on “Is reading memory faster than writing on a PC?”

  1. Interesting. In theory, writing is fundamentally cheaper than reading, because you typically have to wait for the value you read but not for the completion of your write requests.

    However, a cache using an allocate-on-write policy will frequently end up reading a cache line so that it can then write to that cache line (because caches are maintained at line granularity and you can’t just write a word without knowing its “surroundings” or else you’ll eventually overwrite these surroundings with garbage); and then eventually it’ll have to evict the dirty cache line – making writing more work than reading, after all. And if large arrays are involved, high-end CPUs will eliminate the waiting for read results through prefetching.

    An interesting thing to test is scattered reads and writes, where prefetching doesn’t kick in so reading gets more expensive (but writing still involves reading, just not necessarily waiting for its completion).

    A more “extreme” thing to test is writing to non-cachable memory regions (so no reading of “surroundings” is involved), and still more “extreme” would be using some sort of DMA engine and local memory instead of cache; thus, the large latency of read and a lack thereof for write would be very visible. Of course this type of “close to the metal” work can be irrelevant if your target is commodity CPUs and OSes.

  2. Hi Daniel, unpack()’s “d[x] = compressed[x];” needs a mask above; you’ve already fixed it in the code proper.

    Also, Nathan Baum pointed out the code looks wrong elsewhere and I agree. A single index, x, is used in pack() and unpack() and strides on eight at a time. This means the compressed bytes holding eight bools have seven unused bytes between them. Referring to the github code, char comp[N / 8] would be too small given the long stride but fortunately, pack() and unpack() are both wrongly told there’s N / 8 bools in play, rather than N, so comp[]’s big enough but only the first eighth of bool data[] is used. Clearing the source array in the middle of the test loop before unpacking back into it should show the assert failing.

  3. Reading is two times faster than writing, due to DDR design (the R stands for read, not write), assuming sequential access.
    Writing is faster than reading, because the processor does not need to wait, see Yossi.

Leave a Reply

Your email address will not be published.

You may subscribe to this blog by email.