Which is faster: integer addition or XOR?

The bitwise exclusive or (e.g., 1110 XOR 1001 = 0111) looks simpler to compute than integer addition (e.g., 2 + 9 = 11). Some research articles claim that XOR is faster. It appears to be Computer Science folklore. But is it true?

Which line runs faster? (The symbol “^” is the XOR.)

for(int k = 0; k < N; ++k) sum+= k;

for(int k = 0; k < N; ++k) sum^= k;

My result: In C++ and Java, both run at the same speed (within 1%).

Code: My source code is on github.

Daniel Lemire, "Which is faster: integer addition or XOR?," in Daniel Lemire's blog, March 12, 2010.

Published by

Daniel Lemire

A computer science professor at the University of Quebec (TELUQ).

20 thoughts on “Which is faster: integer addition or XOR?”

  1. Unless there is more context here, there should be no difference, as both ops map to a single-cycle operation on modern processors. Arbitrary-precision numbers maybe?

    However, at the pure hardware level, XOR is faster than addition since there is no carry bit, but other details obscure that and for all practical purposes, they run at the same speed as instructions.

  2. They’re probably both the same number of machine cycles, if only because there’s no way for an ALU to take advantage of fewer stages of transistors in the XOR versus the ADD (the outputs of the combinatorial logic are just latched into the output after the same amount of time regardless of the operation of that type. Is integer multiplication any different these days (I’m too lazy to check for myself)?

  3. XOr requires less transistor than addition. This is one measure of complexity on which XOr wins.

    XOr would probably be harder to teach to my kid than sums. This is a measure of complexity on which addition wins.

    We have a stalemate…

    (As a sidenote, have you thought of updating your spam protection mechanism to ask for XOrs instead of sums? ;))

  4. The answer is going to depend on the CPU.

    For anything like current desktop and server CPUs, they use a ridiculous amount of hardware to accelerate a single instruction stream. Both operations are common and pretty much dead simple, so I’d expect both to be among the very-fastest instructions.

    This is likely true of embedded CPUs, like those used in the iPad and cell phones.

    Multiplication might show a bigger difference (compared to simpler operations) between high-end and low-end CPUs.

  5. I’m finding that when I compile with gcc, xor is 10-20% faster. When I compile the same code with g++, add is about 10% faster. Here is the code I used: http://pastebin.com/PzmmFsai .

    uname -a ouputs:
    Linux lappy 2.6.32-ARCH #1 SMP PREEMPT Tue Feb 23 19:43:46 CET 2010 x86_64 Intel(R) Core(TM)2 Duo CPU T7500 @ 2.20GHz GenuineIntel GNU/Linux

    And I’m using gcc 4.4.3.

  6. Isn’t this because there are complex carry lookahead adder circuitry already incorporated in the hardware ? XOR is a bit wise parallel simpler operation. So in that sense, ADD finishes at the same
    same as XOR because hardware has been optimised for AND.

  7. @Frank With the same code, and testing over many runs, I get that both run at the same speed.

    I have modified your code so that it loops around more:


    $ g++ -O2 -o code1 code1.cpp
    $ g++ -O2 -o code2 code2.cpp
    $ time ./code1 9999999
    real 0m5.953s
    user 0m5.944s
    sys 0m0.002s
    $ time ./code2 9999999
    real 0m5.974s
    user 0m5.971s
    sys 0m0.002s

  8. @Bannister I have a bunch of counter-measures against spammers for this blog. They work very well, but a small percentage of spam is unavoidable. I prune spam comments every day, believe it or not.

  9. Well Daniel, you have achieved a measure of success with your weblog – the spammers have arrived, despite your protection.

    Either that, or we have someone (a student?) with an odd sense of humor.

  10. Folk, you are wasting your time comparing XOR against addition – both are dirt cheap and going to take the minimum amount of time possible for an instruction that takes two operands and stores a result.

    The difference in complexity between the two operations (if any) is insignificant compared to the transistor budget of the CPU designer, and has been for a couple decades.

    If you find a difference, it will be to either (1) a measurement error, or (2) something wonky in your favorite language interpreter.

    I used to pay a lot of attention to CPU architecture, counting clock cycles per instruction, and writing benchmarks to verify the variants of instruction sequences. As the CPU designers got a bigger transistor budget, they dropped in bigger circuit blocks to perform common operations in a single cycle, or as close as possible.

    I was delighted when CPUs got barrel shifters. Ever try to write efficient graphics ops when bit-shift time is linear to the size of the shift? That was a long time back. (And I found a solution.)

    When the Intel 486 came out, I lost interest. Most common operations were at that point were very fast.

    There is another aspect to this. The unique logic in a CPU is like lines of code in software (and VLSI design became more like software design). Since then CPUs have become very large programs indeed. Large programs almost invariably have bugs. We normally attribute failures to bugs in software, but some are bugs in hardware, and we do not know the proportion.

    With current CPUs, massive hardware and overlapped speculative execution has even removed or minimized the cost of subroutine calls and pointer chasing – in many cases. Instruction clock-cycle counting has not been effective for a long time.

  11. FYI: Any test that counts the number of cpu cycles is going to have issues (this is essentially due to the Heisenberg uncertainty principle). Also, to answer a question from above, multiplication (i.e. the IMULT assembly command) usually runs at about 4 clock cycles. However, this is only when you are doing a lot of multiplication. If you are doing only a few this speed is slower. See an discussion of optimized multiplexers for more understanding. Also, if the two number being multiplied are greater than the size of register (ie. 32 bits or 64 bits) the speed is O(lg(n)) where n is the number of bits. While this may seem odd to mention, it is very common on any system doing cryptological calculation (particular those for the ElGamal and RSA encryption systems).

  12. I have done the same experiment but with xor and sub, since k -= k is 0 and k ^= k is also 0
    the result is :
    sub : 151 secs. (7 tries)
    xor : 171 secs. (also 7 tries)
    this shows that sub is faster



    using namespace std;

    int main()
    int k = 1000;
    for(int i= 0 ; i<10000; i++)
    k -= k;
    cout << i << endl;
    k = 1000;




    using namespace std;

    int main()
    int k = 1000;
    for(int i= 0 ; i<10000; i++)
    k ^= k;
    cout << i << endl;
    k = 1000;


  13. Addition takes longer to compute in *hardware* because the carry bit has to be propogated through N sequential calculations — each of which involves gate delays. When computing XOR, each bit can be calculated in parallel.

    Whether it takes longer on a particular hardware implementation is implementation-dependent. Instruction timing on state-of-the-art Intel processors is complicated, to say the least. But, according to the Intel Architecture Optimization Manula, On intel architectures, The pair of integer ALU pipeline stages can each execute xor and addition operations in one clock cycle. They execute at exactly the same speed.

    That’s true for Pentium and later processors. It’s possible that smaller and leaner processors do have different execution times for XOR and ADD; but I would think that — given the prevalence of addition operations in computing, that even tiny processors would use the propogation delay time of an addition operation would for the low limit for a single pipleline stage in any modern microprocess, and all or almost all ancient ones as well. (Despite having taken high-school latin, I also think your spam filter is inappropriate).

  14. Agner Fog reckons latency and throughput is the same for XOR and ADD on all the CPU’s I’ve looked at, Agner Fog’s word is pretty much Gospel. So if we’re not talking about hardware and transistors, the actual computation times in clock cycles are identical.

  15. First, looping on one instruction is useless. The cost of looping is much larger than the cost of the instruction. There’s no way you can understand whether xor is faster with a single-instruction loop.

    So I put together the test below. It does 16 instructions per loop, so the cost of the loop is reduced by an order of magnitude.

    Compile with -DOP=”^=” for xor and -DOP=”+=” for addition. My results:

    > ./testx
    197000 us
    15:50:19 [lithium] /tmp
    > ./testp
    474000 us

    += is 2.5 times slower.

    The problem is that if you look at the disassembled code, the compiler is doing a *lot* of rearrangement in the xor version (don’t try to take the ++’s out–it will excise almost all the code). In general, xor satisfies so many useful equations that it is incredibibly difficult to convince the compiler not to change radically your code. It is possible that fiddling some more you can convince the compiler to actually do those xor in that order, but I think I made my point: if you need to add, you must add. xor is absolutely fine if you need just to ensure data dependency, but it cannot replace an add.

    uint64_t getusertime() {
    struct rusage rusage;
    getrusage( 0, &rusage );
    return rusage.ru_utime.tv_sec * 1000000ULL + ( rusage.ru_utime.tv_usec / 1000 ) * 1000;
    int main(int argc, char *argv[]) {
    int64_t val0 = time(NULL);
    int64_t val1 = time(NULL);
    int64_t val2 = time(NULL);
    int64_t val3 = time(NULL);
    int64_t val4 = time(NULL);
    int64_t val5 = time(NULL);
    int64_t val6 = time(NULL);
    int64_t val7 = time(NULL);
    int64_t val8 = time(NULL);
    int64_t val9 = time(NULL);
    int64_t val10 = time(NULL);
    int64_t val11 = time(NULL);
    int64_t val12 = time(NULL);
    int64_t val13 = time(NULL);
    int64_t val14 = time(NULL);
    int64_t val15 = time(NULL);
    int64_t x = 0;
    const int64_t start = getusertime();
    for( int64_t i = 100000000LL; i--; ) {
    x OP val0++;
    x OP val1++;
    x OP val2++;
    x OP val3++;
    x OP val4++;
    x OP val5++;
    x OP val6++;
    x OP val7++;
    x OP val8++;
    x OP val9++;
    x OP val10++;
    x OP val11++;
    x OP val12++;
    x OP val13++;
    x OP val14++;
    x OP val15++;
    int64_t volatile do_not_excise = x;
    const int64_t elapsed = getusertime() - start;
    printf("%lld us\n", elapsed );
    return 0;

  16. Yea, I just confirmed Sebastiano Vigna’s results. I’m getting about 2.3x improvement on XOR vs. IADD. I compiled with “g++ testadd.cpp -lprofiler -Wall -O3 -o testadd” and “g++ testxor.cpp -lprofiler -Wall -O3 -o testxor” and optimized with a 4-way unrolled loop for both programs.

Leave a Reply to Eric LaForest Cancel reply

Your email address will not be published.

You may subscribe to this blog by email.