Compilers align data structures so that if you read an object using 4 bytes, its memory address is divisible by 4. There are two reasons for data alignment:
- Some processors require data alignment. For example, the ARM processor in your phone might crash if you try to access unaligned data. However, your x86 laptop will happily process unaligned data most times. Your laptop only needs alignment for fancy operations, such as SSE instructions where 16-byte data alignment is required.
- It is widely reported that data alignment improves performances even on processors that support unaligned processing such as your x86 laptop. For example, an answer on Stack Overflow states that it is significantly slower to access unaligned memory (as in, several times slower). The top page returned by Google for data alignment states that if the data is misaligned of 4-byte boundary, CPU has to perform extra work (…) this process definitely slows down the performance (…).
So, data alignment is important for performance.
I decided to write a little program to test it out. My program takes a long array, it initializes it, and it computes a Karp-Rabin-like hash value from the result. It repeats this operation on arrays that have a different offset from an aligned boundary. For example, when it uses 4-byte integers, it will try offsets of 0, 1, 2 and 3. If aligned data is faster, then the case with an offset of 0 should be faster.
I repeat all tests 20 times and report the average wall clock time (in milliseconds). My source code in C++ is available.
|offset||Core i7||Core 2|
|offset||Core i7||Core 2|
I see no evidence that unaligned data processing could be several times slower. On a cheap Core 2 processor, there is a difference of about 10% in my tests. On a more recent processor (Core i7), there is no measurable difference.
On recent Intel processors (Sandy Bridge and Nehalem), there is no performance penalty for reading or writing misaligned memory operands. There might be more of a difference on some AMD processors, but the busy AMD server I tested showed no measurable penalty due to data alignment. It looks like even the data alignment requirements of SSE instructions will be lifted in the future AMD and Intel processors.
Intel processors use 64-byte cache lines and if you need to load a register overlapping two cache lines, it might limit the best speed you can get. But we are not talking about a severalfold penalty. Thus it only matters in very specific code where loading and storing data from the fastest CPU cache is a critical bottleneck, and even then, you should not expect a large difference.
Conclusion: On recent Intel processors, data alignment does not make processing a lot faster. It is a micro-optimization. Data alignment for speed is a myth.
Acknowledgement: I am grateful to Owen Kaser for pointing me to the references on this issue.
- Recent ARM processors do support unaligned memory accesses though it is unclear what the performance penalty is.
- In 2008, Alexander Sandler reported that unaligned accesses could require twice the number of clock cycles.
Update: Laurent Gauthier provided a counter-example where unaligned access is significantly slower (by 50%). However, it involves a particular setup where you read words separated by specific intervals.