Compilers align data structures so that if you read an object using 4 bytes, its memory address is divisible by 4. There are two reasons for data alignment:

  • Some processors require data alignment. For example, the ARM processor in your phone might crash if you try to access unaligned data. However, your x86 laptop will happily process unaligned data most times. Your laptop only needs alignment for fancy operations, such as SSE instructions where 16-byte data alignment is required.
  • It is widely reported that data alignment improves performances even on processors that support unaligned processing such as your x86 laptop. For example, an answer on Stack Overflow states that it is significantly slower to access unaligned memory (as in, several times slower). The top page returned by Google for data alignment states that if the data is misaligned of 4-byte boundary, CPU has to perform extra work (…) this process definitely slows down the performance (…).

So, data alignment is important for performance.

Is it?

I decided to write a little program to test it out. My program takes a long array, it initializes it, and it computes a Karp-Rabin-like hash value from the result. It repeats this operation on arrays that have a different offset from an aligned boundary. For example, when it uses 4-byte integers, it will try offsets of 0, 1, 2 and 3. If aligned data is faster, then the case with an offset of 0 should be faster.

I repeat all tests 20 times and report the average wall clock time (in milliseconds). My source code in C++ is available.

4-byte integers

offset  Core i7   Core 2 
0 28.1 34.1
1 28.7 38.1
2 28.7 38.1
3 28.7 38.1

8-byte integers

offset  Core i7   Core 2 
0 33.6 69.1
1 33.6 77
2 33.6 77
3 33.6 77
4 33.6 77
5 33.6 77
6 33.6 77
7 33.6 77

Analysis:

I see no evidence that unaligned data processing could be several times slower. On a cheap Core 2 processor, there is a difference of about 10% in my tests. On a more recent processor (Core i7), there is no measurable difference.

On recent Intel processors (Sandy Bridge and Nehalem), there is no performance penalty for reading or writing misaligned memory operands. There might be more of a difference on some AMD processors, but the busy AMD server I tested showed no measurable penalty due to data alignment. It looks like even the data alignment requirements of SSE instructions will be lifted in the future AMD and Intel processors.

Conclusion: On recent Intel processors, data alignment does not make processing measurably faster. Data alignment for speed is a myth.

Acknowledgement: I am grateful to Owen Kaser for pointing me to the references on this issue.

Further reading:

Update: Laurent Gauthier provided a counter-example where unaligned access is significantly slower (by 50%). However, it involves a particular setup where you read words separated by specific intervals.

20 Comments

  1. Hi Daniel,

    You are testing this using new hardware. I asked the very same question in 2009 and in 2011.

    Results are speaking for themselves. This is not just a fluke. Intel did change the architecture.

    Then in 2009:
    time 33837 us -> 0%
    time 47012 us -> 38%
    time 47065 us -> 39%
    time 47001 us -> 38%
    time 33788 us -> 0%
    time 47018 us -> 39%
    time 47049 us -> 39%
    time 47014 us -> 39%

    Now in 2011:

    time 89400 us -> 0%
    time 90374 us -> 1%
    time 90299 us -> 1%
    time 90365 us -> 1%
    time 89348 us -> 0%
    time 90672 us -> 1%
    time 90372 us -> 1%
    time 90318 us -> 1%

    Comment by Itman — 31/5/2012 @ 12:08

  2. Interesting.
    The version of the myth that I heard said the slowdown was because the processor will have to do two aligned reads and then construct the unaligned read from that. If I read your code correctly, you’re accessing memory sequentially. In that case, the extra reads might not hit memory. If the compiler figures it out, there might not even be extra reads. Well, actually, I don’t really know what I’m talking about with these low-level things. Still. I’m not saying your test is wrong, but it is always important to be careful about what you’re actually testing and how that generalises.

    That said, these low-level tests you’re posting are really cool :) . Thanks.

    Comment by Thomas — 31/5/2012 @ 12:33

  3. Ah, from the page you linked to: “On the Sandy Bridge, there is no performance penalty for reading or writing misaligned memory operands, except for the fact that it uses more cache banks so that the risk of cache conflicts is higher when the operand is misaligned. Store-to-load forwarding also works with misaligned operands in most cases.”
    Sorry for commenting before reading the linked material ;) .

    Comment by Thomas — 31/5/2012 @ 12:40

  4. @Thomas

    Thanks for the good words.

    I agree that you have to be careful, but that is why I post all my source code. If I’m wrong, then someone can hopefully show it by tweaking my experiment.

    Comment by Daniel Lemire — 31/5/2012 @ 12:52

  5. Testing this is going to be tricky. Last I looked closely, CPUs fetch and cache chunks of words from memory. So unaligned sequential access are going to make no extra trips to memory (for the most part). If the cache handles misaligned references efficiently, then you might see little or no cost to misaligned access.

    If a misaligned reference crosses a chunk boundary, so that two chunks are pulled from memory, and only a single word is used of the two chunks, then you might see a considerable cost.

    Without building in knowledge of specific CPUs, you could construct a test. Allocate a buffer much larger than CPU cache. Access a single word, bump the pointer by a power of two, increasing the exponent on each sweep of the buffer (until the stride is bigger than the cache line). Repeat the set of sweeps, bumping the starting index by one byte, until the starting index exceeds a cache line. (You are going to have to lookup the biggest cache line for any CPU you test.)

    What I expect you to see is that most sweeps are fairly uniform, with spikes where a misaligned access crosses a cache line boundary.

    What that means (if true) is that most misaligned access will cost you very little, with the occasional optimally cache misaligned joker. (Requires the right starting offset and stride.)

    Still worth some attention, but much less likely you will get bit.

    Comment by Preston L. Bannister — 31/5/2012 @ 13:06

  6. This cache-chunk business sounds reasonable, but luckily also sounds like it might be relatively rare. And then you would care for cache-chunk alignedness, not something like word alignedness.

    I just had a look at the disassembly of an optimised build by Visual Studio 10. It looks to me like it is indeed doing unaligned reads.

    p.s.: It seems to have unrolled the Rabin-Karp-like loop 5(?!) times.

    Comment by Thomas — 31/5/2012 @ 13:15

  7. @Thomas

    5 times?

    Comment by Daniel Lemire — 31/5/2012 @ 13:17

  8. Right, I guess “unrolled 5 times” doesn’t mean what I wanted to say. I mean: it does 5 iterations, then jump back.

    Comment by Thomas — 31/5/2012 @ 13:21

  9. Thomas,

    BTW, unrolling loops did not get you a performance boost either. In the case of new hardware, in most situations.

    Comment by Itman — 31/5/2012 @ 13:33

  10. For most general random logic, this is unlikely to bite. The special cases are a *little* less unlikely than they appear. Power-of-two sized structures are not at all unlikely for some sorts of large problems. Accessing only the first or last word of a block is a common pattern. If the block start is misaligned, and the block is a multiple of a cache line size … you could get bit.

    Comment by Preston L. Bannister — 31/5/2012 @ 14:13

  11. @Bannister

    There is now an example closely related to your analysis in github. I have also updated my blog post accordingly.

    Comment by Daniel Lemire — 31/5/2012 @ 17:36

  12. Daniel,

    I do not agree with your comment saying that it is a cache issue. All the memory used in this test is most likely in the cache.

    It really is a case of two 256-bit wide read instead of one the word you are reading is crossing the boundary.

    (Note: I guess it is 256-bit wide, it might really be 128-bit wide reads, its just that 256-bit wide reads seems more likely)

    Comment by Laurent Gauthier — 31/5/2012 @ 17:49

  13. @Laurent

    I have removed the misleading comment.

    Comment by Daniel Lemire — 31/5/2012 @ 20:32

  14. The speed of unaligned access is architecture-dependent. On the DEC Alphas, it would slow down your program by immense amounts because each unaligned access generated an interrupt. Since we don’t know what the future holds, its best to design your programs so they have aligned accesses when possible. After all, unaligned access has NEVER been faster than aligned access.

    Comment by A. Non — 2/6/2012 @ 13:39

  15. @A. Non

    Certainly, avoiding unaligned accesses makes your code more portable, but if you are programming in C/C++ with performance in mind, you are probably making many portability trade-offs anyhow (e.g., using SSE instructions).

    Comment by Daniel Lemire — 4/6/2012 @ 8:45

  16. if other operations in your code is slower than memory access, then most of the time is spent on the other operations, so you can’t see the difference between align vs un-align.

    In your source code, the other operation is the multiply.

    You can try memory copy instead — use a for loop to copy an array manually.

    Comment by Yifei — 4/6/2012 @ 20:01

  17. @Yifei

    After disabling the multiplication, I get the same sort of result.

    Multiplication over integers is not expensive on modern processors due to pipelining.

    Comment by Daniel Lemire — 4/6/2012 @ 20:31

  18. Interesting article thanks!

    I used to do a lot of ARM coding, and from what I remember exactly what the ARM does on a unaligned access depends on how the supporting hardware has been set up.

    You can either get an abort which then gives the kernel an opportunity to fixup the nonaligned access in software (very slow!)

    Or you can read a byte rotated word, so if you read a word at offset 1 you would read the 32 byte word at offset 0 but rotated by 8 bits. That was actually useful sometimes!

    I’m not sure about newer ARMs though.

    Comment by Nick Craig-Wood — 6/6/2012 @ 11:40

  19. DarkShikari has some great insight on this issue, 4 years ago now.

    Cacheline splits, aka Intel hell (Feb 2008)
    http://x264dev.multimedia.cx/archives/8

    Nehalem optimizations: the powerful new Core i7 (Nov 2008)
    http://x264dev.multimedia.cx/archives/51

    Comment by Alecco — 7/1/2013 @ 11:06

  20. You need to include stdint.h before using uintptr_t.

    Comment by Richard — 18/3/2013 @ 0:52

Sorry, the comment form is closed at this time.

« Blog's main page

Powered by WordPress