Data alignment for speed: myth or reality?

Compilers align data structures so that if you read an object using 4 bytes, its memory address is divisible by 4. There are two reasons for data alignment:

  • Some processors require data alignment. For example, the ARM processor in your phone might crash if you try to access unaligned data. However, your x86 laptop will happily process unaligned data most times. Your laptop only needs alignment for fancy operations, such as SSE instructions where 16-byte data alignment is required.
  • It is widely reported that data alignment improves performances even on processors that support unaligned processing such as your x86 laptop. For example, an answer on Stack Overflow states that it is significantly slower to access unaligned memory (as in, several times slower). The top page returned by Google for data alignment states that if the data is misaligned of 4-byte boundary, CPU has to perform extra work (…) this process definitely slows down the performance (…).

So, data alignment is important for performance.

Is it?

I decided to write a little program to test it out. My program takes a long array, it initializes it, and it computes a Karp-Rabin-like hash value from the result. It repeats this operation on arrays that have a different offset from an aligned boundary. For example, when it uses 4-byte integers, it will try offsets of 0, 1, 2 and 3. If aligned data is faster, then the case with an offset of 0 should be faster.

I repeat all tests 20 times and report the average wall clock time (in milliseconds). My source code in C++ is available.

4-byte integers

offset Core i7  Core 2 

8-byte integers

offset Core i7  Core 2 


I see no evidence that unaligned data processing could be several times slower. On a cheap Core 2 processor, there is a difference of about 10% in my tests. On a more recent processor (Core i7), there is no measurable difference.

On recent Intel processors (Sandy Bridge and Nehalem), there is no performance penalty for reading or writing misaligned memory operands. There might be more of a difference on some AMD processors, but the busy AMD server I tested showed no measurable penalty due to data alignment. It looks like even the data alignment requirements of SSE instructions will be lifted in the future AMD and Intel processors.

Intel processors use 64-byte cache lines and if you need to load a register overlapping two cache lines, it might limit the best speed you can get. But we are not talking about a severalfold penalty. Thus it only matters in very specific code where loading and storing data from the fastest CPU cache is a critical bottleneck, and even then, you should not expect a large difference.

Conclusion: On recent Intel processors, data alignment does not make processing a lot faster. It is a micro-optimization. Data alignment for speed is a myth.

Acknowledgement: I am grateful to Owen Kaser for pointing me to the references on this issue.

Further reading:

Update: Laurent Gauthier provided a counter-example where unaligned access is significantly slower (by 50%). However, it involves a particular setup where you read words separated by specific intervals.

30 thoughts on “Data alignment for speed: myth or reality?”

  1. Hi Daniel,

    You are testing this using new hardware. I asked the very same question in 2009 and in 2011.

    Results are speaking for themselves. This is not just a fluke. Intel did change the architecture.

    Then in 2009:
    time 33837 us -> 0%
    time 47012 us -> 38%
    time 47065 us -> 39%
    time 47001 us -> 38%
    time 33788 us -> 0%
    time 47018 us -> 39%
    time 47049 us -> 39%
    time 47014 us -> 39%

    Now in 2011:

    time 89400 us -> 0%
    time 90374 us -> 1%
    time 90299 us -> 1%
    time 90365 us -> 1%
    time 89348 us -> 0%
    time 90672 us -> 1%
    time 90372 us -> 1%
    time 90318 us -> 1%

  2. Interesting.
    The version of the myth that I heard said the slowdown was because the processor will have to do two aligned reads and then construct the unaligned read from that. If I read your code correctly, you’re accessing memory sequentially. In that case, the extra reads might not hit memory. If the compiler figures it out, there might not even be extra reads. Well, actually, I don’t really know what I’m talking about with these low-level things. Still. I’m not saying your test is wrong, but it is always important to be careful about what you’re actually testing and how that generalises.

    That said, these low-level tests you’re posting are really cool :). Thanks.

  3. Ah, from the page you linked to: “On the Sandy Bridge, there is no performance penalty for reading or writing misaligned memory operands, except for the fact that it uses more cache banks so that the risk of cache conflicts is higher when the operand is misaligned. Store-to-load forwarding also works with misaligned operands in most cases.”
    Sorry for commenting before reading the linked material ;).

  4. Daniel,

    I do not agree with your comment saying that it is a cache issue. All the memory used in this test is most likely in the cache.

    It really is a case of two 256-bit wide read instead of one the word you are reading is crossing the boundary.

    (Note: I guess it is 256-bit wide, it might really be 128-bit wide reads, its just that 256-bit wide reads seems more likely)

  5. @Thomas

    Thanks for the good words.

    I agree that you have to be careful, but that is why I post all my source code. If I’m wrong, then someone can hopefully show it by tweaking my experiment.

  6. Testing this is going to be tricky. Last I looked closely, CPUs fetch and cache chunks of words from memory. So unaligned sequential access are going to make no extra trips to memory (for the most part). If the cache handles misaligned references efficiently, then you might see little or no cost to misaligned access.

    If a misaligned reference crosses a chunk boundary, so that two chunks are pulled from memory, and only a single word is used of the two chunks, then you might see a considerable cost.

    Without building in knowledge of specific CPUs, you could construct a test. Allocate a buffer much larger than CPU cache. Access a single word, bump the pointer by a power of two, increasing the exponent on each sweep of the buffer (until the stride is bigger than the cache line). Repeat the set of sweeps, bumping the starting index by one byte, until the starting index exceeds a cache line. (You are going to have to lookup the biggest cache line for any CPU you test.)

    What I expect you to see is that most sweeps are fairly uniform, with spikes where a misaligned access crosses a cache line boundary.

    What that means (if true) is that most misaligned access will cost you very little, with the occasional optimally cache misaligned joker. (Requires the right starting offset and stride.)

    Still worth some attention, but much less likely you will get bit.

  7. This cache-chunk business sounds reasonable, but luckily also sounds like it might be relatively rare. And then you would care for cache-chunk alignedness, not something like word alignedness.

    I just had a look at the disassembly of an optimised build by Visual Studio 10. It looks to me like it is indeed doing unaligned reads.

    p.s.: It seems to have unrolled the Rabin-Karp-like loop 5(?!) times.

  8. Right, I guess “unrolled 5 times” doesn’t mean what I wanted to say. I mean: it does 5 iterations, then jump back.

  9. Thomas,

    BTW, unrolling loops did not get you a performance boost either. In the case of new hardware, in most situations.

  10. For most general random logic, this is unlikely to bite. The special cases are a *little* less unlikely than they appear. Power-of-two sized structures are not at all unlikely for some sorts of large problems. Accessing only the first or last word of a block is a common pattern. If the block start is misaligned, and the block is a multiple of a cache line size … you could get bit.

  11. The speed of unaligned access is architecture-dependent. On the DEC Alphas, it would slow down your program by immense amounts because each unaligned access generated an interrupt. Since we don’t know what the future holds, its best to design your programs so they have aligned accesses when possible. After all, unaligned access has NEVER been faster than aligned access.

  12. @A. Non

    Certainly, avoiding unaligned accesses makes your code more portable, but if you are programming in C/C++ with performance in mind, you are probably making many portability trade-offs anyhow (e.g., using SSE instructions).

  13. if other operations in your code is slower than memory access, then most of the time is spent on the other operations, so you can’t see the difference between align vs un-align.

    In your source code, the other operation is the multiply.

    You can try memory copy instead — use a for loop to copy an array manually.

  14. Interesting article thanks!

    I used to do a lot of ARM coding, and from what I remember exactly what the ARM does on a unaligned access depends on how the supporting hardware has been set up.

    You can either get an abort which then gives the kernel an opportunity to fixup the nonaligned access in software (very slow!)

    Or you can read a byte rotated word, so if you read a word at offset 1 you would read the 32 byte word at offset 0 but rotated by 8 bits. That was actually useful sometimes!

    I’m not sure about newer ARMs though.

  15. This is an interesting post. I did a related analysis a little while ago looking at the specific case of speeding up convolution (very much a real world example):
    and code here:

    Even with modern hardware (an i7-5600), there is a substantial improvement (~20%) when aligned loads are used in the inner loop, at least for SSE instructions, even when additional necessary array copies are factored in.

    In my example, I compared SSE initially but extended it to AVX (in the code), though without repeating the aligned vs unaligned experiments., though it turns out the benefit is not so apparent when going to 256 byte alignment from 128 byte alignemt (the half alignment is good enough for AVX).

    1. Interesting. Can you point me directly to two functions where the only difference is that one uses aligned load SSE instructions and the other one uses unaligned load instructions? What I see in convolve.c is a function using unaligned loads, but it seemingly relies on a different (more sophisticated) algorithm.

      1. I extended my code a bit to test some additional cases.

        It seems that the results show that aligned SSE is important (compare `convolve_sse_partial_unroll` and `convolve_sse_in_aligned`) but aligned AVX makes very little difference (`convolve_avx_unrolled_vector` versus `convolve_avx_unrolled_vector_unaligned`).

        The fastest I can achieve is pretty close to the maximum theoretical throughput (about 90% of the peak clock speed * 8 flops), which is using the unaligned load, which agrees with your assessment that alignment is pretty unimportant – this is the `convolve_avx_unrolled_vector_unaligned` case.

        It’s interesting that the SSE operations don’t benefit from the better AVX unaligned loads.

  16. I too did not really believe that misaligned data could significantly affect runtimes until I fixed the alignment of my data structures and the performance jumped by about 750% (from about 3.9 seconds to 0.04 seconds).

    What’s more interesting is that I fixed the alignment by changing from floats to doubles on a Intel i7 (64 bit) which made all my data structures and data (doubles) have the same 8 byte alignment (sometimes more is more). Using floats (4 byte aligned) meant that my data was unaligned with the other data structures. Alignment matters. Intel says so. Please stop suggesting otherwise.

    1. Can you share a code sample that shows that by changing data alignment, you are able to multiply by 100 the speed (from 4 s to 0.04 s)?

      Please keep the data types the same. Alignment and data types are distinct concerns.

  17. Coming from embedded side with older processors such as MIPS32 and PPC, properly aligned data structure was de facto.
    Seeing is believing, both test results ( and showed no noticeable differences on PPC e500v2 and Xeon E5-2660 v2, whether or not aligned.
    However, per, ‘The fundamental rule of data alignment is that the safest (and most widely supported) approach relies on what Intel terms “the natural boundaries.”‘
    Very interesting! Thanks for sharing.

  18. Daniel,

    Thanks for that! I have long felt that alignment requirements spoiled computing because they aren’t just issues for compiler writers, they complicate general programming as well. For example, a structure containing 32-bit integers and doubles is no longer just a collection of adjacent items, gaps are inserted to maintain alignment.

    To me (I am a bit long in the tooth) they remind me of the way many architectures used to divide memory into segments. This structure was supposedly useful, but it gradually became clear that the disruption as you crossed a segment boundary was definitely not useful!

    I hope computer architectures soon evolve into being totally alignment free – including operations such as MULPD, which actually faults if the data isn’t 16-byte aligned, even though the individual data items are 8-bytes long!

  19. Note: unaligned access has an impact beyond performance, though.

    Specifically, the C and C++ Standard specify that unaligned access is Undefined Behavior. This, in turn, means that a C++ compiler can reasonably expect that if `int` is 4-bytes aligned, then an `int*` has an address divisible by 4.

    At the very least, I remember an instance of gcc optimizing away the “misalignment checks” performed on an `int*`, which in turn resulted in accessing an array out-of-bounds (because it did not have a number of bytes divisible by 4).

    I would be very wary of using unaligned access directly in C or C++, unless specifically supported by the compiler (packed structures). Assembly can get away with it, but in C and C++ it’s dangerous.

    1. @Matthieu

      A very good point: it is potentially unsafe in C even if the underlying hardware is happy with unaligned loads and stores. But you can make it safe by calling memcpy which gcc will translate into a load on an x64 machine, without performance penalty.

Leave a Reply

Your email address will not be published. Required fields are marked *