If you program in C/C++, you have many options to read files:

  • The standard C library offers a low-level read function. It is as simple as it gets.
  • The standard C library also offers a higher level fread function. Unlike the read function, you can set a buffer size. Buffers can be good or bad. On the one hand, they reduce the number of disk accesses. On the other hand, they introduce an intermediate step between the disk and you data. That is, they may cause the data to be copied needlessly. Buffers usually makes software faster because copying data in memory is much faster than reading it from disk.
  • In C++, you have ifstream. It is very similar to fread, but with a more object-oriented flavor.
  • Finally, you can use memory mapping. Instead of reading blocks of data, you map the content of the file to a pointer and the operating system is responsible with filling in the data. It has the reputation to be very fast because the data on disk can be mapped directly to memory without any undue copying. However, in my experience, it is also less stable: you are unlikely to cause a bus error with fread or ifstream, but the slightest mistake with memory mapping and your program can crash.

For my work, a lot of the IO is based on sequential access. For this kind of access pattern, I have never found memory mapping to be useful. To support my claim, I created a little program that reads arrays of integers from a file, and does some minor computations on them. Memory mapping is not beneficial:

method  millions of int. per s 
C read 70
C fread 124
C++ ifstream 124
mmap 125

As usual, my benchmark code is available for inspection. I used a Linux desktop with an Intel Core i7 processor and GCC 4.7 with the -O3 flag for my tests.

Conclusion: For sequential access, both fread and ifstream are equally fast. Unbuffered IO (read) is slower, as expected. Memory mapping is not beneficial.

Warning: Benchmarking IO reliably is difficult. Results will vary depending on your configuration.

26 Comments

  1. Results on my code i7 laptop don’t match up:

    fread 52.8416 63.133
    fread w sbuffer 53.9027 64.6808
    fread w lbuffer 55.4619 63.2864
    read2 73.746 49.1903
    mmap 78.9516 84.0752
    Cpp 54.5601 60.8912

    (so mmap actually turns out to be fastest)

    When I add MAP_POPULATE so as to prefault the pages, mmap gets even better:

    fread 49.8951 58.354
    fread w sbuffer 50.2688 60.7751
    fread w lbuffer 52.6344 62.8038
    read2 65.793 48.9292
    mmap 106.522 106.855
    Cpp 47.5949 59.6341

    But your point stands that it’s worth benchmarking these things.

    -Todd

    Comment by Todd Lipcon — 26/6/2012 @ 12:39

  2. Todd, did you mean “mmap gets even WORSE”? This is very-very strange, because in all tests that I have heard about mmap beats everything (by a wide margin). Assuming that you check with files “warmed up” and cached by the OS.

    Comment by Itman — 26/6/2012 @ 12:42

  3. No, mmap gets better – higher numbers are better here, unless I’m misreading the benchmark code.

    Comment by Todd Lipcon — 26/6/2012 @ 12:43

  4. I don’t really get why, but I also have a Intel iCore 7, and, like Todd, I find mmap to be the fastest on my machine. Here are the results:

    fread 79.1329 82.2618
    fread w sbuffer 81.6359 82.9706
    fread w lbuffer 78.7614 82.0594
    read2 73.0988 48.8926
    mmap 93.9841 94.7367
    Cpp 86.4751 79.8615

    Now, I ran it on a Linux box using debian-testing. Also, like for Todd, adding MAP_POPULATE makes mmap quite faster:

    fread 85.8116 82.6478
    fread w sbuffer 79.5079 82.5918
    fread w lbuffer 82.6412 79.6012
    read2 70.0466 46.6896
    mmap 110.734 125.265
    Cpp 82.4382 76.3553

    (and as you can see there’s quite a bit of variations from one run to the next).

    -Pierre

    Comment by Pierre Barbier de Reuille — 26/6/2012 @ 12:45

  5. That is correct, the little program reports the speed, so higher numbers are better.

    I don’t get better speed with mmap on any of my machines, but if you read the last paragraph of my blog post, I had expected people to get vastly different results.

    Unfortunately, IO is difficult to benchmark reliably.

    I have changed my program to use MAP_POPULATE. It does improve speed quite a bit, but even so, mmap is slower than fread on my machines.

    Comment by Daniel Lemire — 26/6/2012 @ 13:00

  6. The default policy of mmap() is pretty poor for reading large sequential streams because mmap() has no idea what your access pattern is going to be and conservatively straddles the fence on behavior by default. Defining the behavior and policy with madvise() to something other than default is important if performance matters.

    This is one of those cases where setting madvise() to MADV_SEQUENTIAL|MADV_WILLNEED over the file should make a significant difference. In principle, mmap() with madvise() flags properly set should be as fast as any other mechanism since most other mechanisms are using something like this under the hood.

    I am not sure it is apples-to-apples to modify the default buffering behavior of the fread() case but not altering the default access policy of mmap().

    Comment by J. Andrew Rogers — 26/6/2012 @ 13:00

  7. @Rogers

    Thanks. Even with these hints, I get that mmap is significantly slower. Here is what I get on my desktop:

    $ ./ioaccess

    fread 130.308 122.366
    fread w sbuffer 119.837 122.812
    fread w lbuffer 125.437 122.767
    read2 104.045 71.4784
    mmap 95.8698 43.1566
    fancy mmap 96.5595 77.5446
    Cpp 118.777 116.532

    where fancy mmap is what I get with madvise.

    Of course, there are variations from run to run, but mmap is never faster in my tests.

    I’m testing on a Linux destop and a mac laptop. I vary the GCC compiler version, for fun… but no luck. I always find that memory mapping is slower.

    I should stress that another reason to worry about memory mapping is how quickly it can bring down your program. For production code, hard crashes should be a concern.

    Comment by Daniel Lemire — 26/6/2012 @ 13:20

  8. Ohhh, I see. I (as usual) confuse milliseconds (MIs) and millions of integers per seconds (MIs).

    My results actually do match those of Daniel (on Linux/CentOS) and mmap beats everything else, but a difference is small 10-20% (with and without MAP_POPULATE).

    A more interesting scenario would be to re-use the same file many times and not to re-map the data.

    Comment by Itman — 26/6/2012 @ 13:55

  9. I’m confused what you are trying to measure:

    1) Speed of shuffling data from the buffer cache (i.e. memory) into the process namespace?
    2) or speed of reading data from disk with the generic io scheduler though different interfaces (and therefore presumably different hints to the kernel reg. expected access patterns).

    If the latter did everyone running the benchmark flush their buffer caches e.g. with [1]?

    [1]
    echo 3 > /proc/sys/vm/drop_caches

    Comment by Tom — 26/6/2012 @ 15:40

  10. You’re testing with a file in /tmp. I suspect there’ll be a big difference between a tmpfs and a disk-based file system.

    Comment by Peter De Wachter — 26/6/2012 @ 15:45

  11. @Tom

    Because the file is so big, it is very unlikely that it resides in the buffer. This being said, the benchmark could be improved, that is why I post the source code on github.

    @Peter

    You’re testing with a file in /tmp. I suspect there’ll be a big difference between a tmpfs and a disk-based file system

    On my machine, files in /tmp are on disk.

    Comment by Daniel Lemire — 26/6/2012 @ 16:41

  12. Caches, filesystem, ioscheduler, device readahead settings (/sys/devices/…/readahead_kb) etc. heavily.

    as for stability, unless you jump outside the mmaped memory bounds only one thing will crash you. Its not a segv. it’s a sigbus. You get this when the.memory is validly mapped but can’t be accessed. example when you get i/o errors on your disk. This can be handled via a sigbus signal handler. Map in a page of /dev/zero on the page with the problem, set a flag, and check this flag at least once per page read. :) handle failure appropriately for your situation.

    Comment by Rasterman — 26/6/2012 @ 18:48

  13. fread 34.9852 37.3454
    fread w sbuffer 33.3594 37.9046
    fread w lbuffer 33.7706 38.4986
    read2 54.0629 27.0563
    mmap 35.7301 50.655
    fancy mmap 36.2806 50.3316
    Cpp 41.2026 39.5535

    I got differing results on Centos 5.2. I suspect it’s misreporting cpu times, as I would see large variations in cpu-based throughput, but similar wallclock speeds run-to-run.

    There were a few questionable things in the loop, like a vector that isn’t used, so I took it out and had minimal improvement. Changing to MAP_SHARED had about a 10% positive effect.

    Linus made some comments here: http://lkml.indiana.edu/hypermail/linux/kernel/0004.0/0728.html

    Single map-and-scan is probably the worst scenario :( .

    Comment by KWillets — 26/6/2012 @ 19:07

  14. I have a laptop with intel i7 and gcc4.7 too, on Fedora Linux:

    fread 92.2916 91.798
    fread w sbuffer 75.9051 75.531
    fread w lbuffer 84.1882 83.7542
    read2 42.3798 42.2044
    mmap 99.2518 67.327
    fancy mmap 90.0927 88.8752
    mmap (shared) 89.4623 88.51
    fancy mmap (shared) 101.197 100.393
    Cpp 95.7135 95.232

    Comment by Neoh — 26/6/2012 @ 21:16

  15. i/o is highly kernel and device dependent. people should post kernel versions (mmap(2) has different code paths for readahead than read(2)) and disk models (mostly because they affect the device drivers being picked up) besides cpu and compiler.

    Comment by vicaya — 26/6/2012 @ 23:26

  16. Ubuntu 12.04, 3.2.0-26-generic, ext4

    fread 91.1748 90.8086
    fread w sbuffer 93.3305 93.0044
    fread w lbuffer 94.3807 94.1626
    read2 55.2302 55.0486
    mmap 109.469 108.818
    fancy mmap 108.408 107.583
    mmap (shared) 109.469 108.861
    fancy mmap (shared) 108.233 107.534
    Cpp 100.909 100.607

    Comment by Cartesius — 29/6/2012 @ 11:43

  17. Well, I’ve tuned a little bit your read(2) implementation and here are my final numbers for the above mentioned architecture. Read(2) on the __proper__ sized buffer cannot be slower than fread.

    fread 94.3807 94.0505
    fread w sbuffer 94.3807 94.098
    fread w lbuffer 94.5136 94.3137
    read2 100.607 100.331
    mmap 107.712 107.07
    fancy mmap 106.515 105.687
    mmap (shared) 107.54 107.07
    fancy mmap (shared) 106.685 105.791
    Cpp 95.4547 95.1243

    Comment by Cartesius — 29/6/2012 @ 11:57

  18. Hi Daniel,
    sorry for three posts but after another tuning I finally have the expected results, that is, the read(2) and mmap(2) should provide pretty comparable results for this specific task. It was said that mmap is much better suited for repeated and random reading.
    Here are my tuned results: :-)

    fread 93.3305 93.0661
    fread w sbuffer 93.5909 93.226
    fread w lbuffer 94.116 93.8494
    read2 105.18 104.814
    mmap 105.345 104.869
    fancy mmap 104.687 104.045
    mmap (shared) 105.51 104.924
    fancy mmap (shared) 104.687 104.047
    Cpp 100.456 100.151

    Comment by Cartesius — 29/6/2012 @ 12:05

  19. Cartesius : What kind of tuning did you do? Could you please share the information with us please!

    Comment by maxime caron — 2/7/2012 @ 23:50

  20. Maxime: In testread:
    1. I changed the for-cycle into while-cycle and changed the conditions != sizeof(…) which I guess are not correct because read(2) syscall can give you also a partial result. Definitely on network socket, maybe also on block device.
    2. I removed the first read(2) in the cycle which slows down whole computation.
    3. I read the blocksize in a bigger chunk of data.
    4. All reads are performed with a fixed IO buffer of size, say 64kB, no repeated vector.resize calls.

    And as I’ve said, read(2) by definition, cannot be any slower for this kind of scenario.

    And the results prove it.

    Comment by Cartesius — 3/7/2012 @ 3:49

  21. As I’m learning Go at the moment I converted the code here:

    https://gist.github.com/3172562

    Interestingly it runs at almost identical speed to the “basic sum (C++-like)” using gcc 4.6.3 and g++ -funroll-loops -O3 -o cumuls cumuls.cpp

    Go’s compiler isn’t particularly well optimised at the moment but I thought it did OK here.

    Comment by Nick Craig-Wood — 24/7/2012 @ 15:52

  22. Nick:
    Good to know. But I guess that this particulary simple scenario isn’t a tough job for compiler because I assume that Go uses many well-implemented library functions e.g. for I/O.

    Comment by Cartesius — 24/7/2012 @ 15:57

  23. Actually I rather stupidly posted that comment on the wrong blog post so please ignore!

    Comment by Nick Craig-Wood — 25/7/2012 @ 4:14

  24. nmap

    Intel pentium dual-core t2390
    debian 6.0.5
    gcc 4.4.5

    fread 12.7745 12.7489
    fread w sbuffer 12.8726 12.798
    fread w lbuffer 12.7382 12.5046
    read2 10.6465 10.5533
    mmap 15.8191 15.7597
    fancy mmap 15.849 15.7929
    mmap (shared) 15.7931 15.4621
    fancy mmap (shared) 15.6933 15.6158
    Cpp 11.1488 11.0111

    fread 12.8775 12.8624
    fread w sbuffer 12.8676 12.8215
    fread w lbuffer 12.6327 12.1366
    read2 10.5345 10.4342
    mmap 15.8602 15.8455
    fancy mmap 15.8228 15.7794
    mmap (shared) 15.6239 14.9385
    fancy mmap (shared) 15.5227 15.3789
    Cpp 11.1174 10.9043

    Comment by jg — 2/10/2012 @ 18:03

  25. When reading from file with your own buffering you’ve got 3 levels of buffering overall:
    1. Your buffer.
    2. glib buffer for files
    3. Kernel buffer managing pages
    And then you have a mmap function, which allows you directly read from pages avoiding glibc buffering.
    Do you believe that when reading directly from low-level buffers, avoiding library and system calls plus various checks and then reading such small amounts of data is so slow?

    The answer lies in the bad benchmark code, giving false results. Cartesius has already pointed some hints about read() fix. Another hint would be opening all files first, setting them buffering (setvbuf) to avoid allocating space by library. Another step would be just to get timings after files were opened plus using bigger buffers. With small amount of data to be read the obvious winner is mmap, but the situation can change with bigger buffers and sequential reading, which can be interesting experiment.

    Comment by Jacek — 5/2/2013 @ 6:55

  26. @Jacek

    Do you believe that when reading directly from low-level buffers, avoiding library and system calls plus various checks and then reading such small amounts of data is so slow? (…) The answer lies in the bad benchmark code, giving false results.

    It is one thing to claim that the benchmark is faulty, it is another to propose a better one. The latter action is much more useful.

    Comment by Daniel Lemire — 5/2/2013 @ 9:52

Sorry, the comment form is closed at this time.

« Blog's main page

Powered by WordPress