In software, we typically work with binary values. That is, we have arbitrary streams of bytes. To encode these arbitrary stream of bytes in standard formats like email, HTML, XML, JSON, we often need to convert them to a standard format like base64. You can encode and decode base64 very quickly.

But what if you do not care for standards and just want to go fast and have simple code? Maybe all you care about is that the string is ASCII. That is, it must be a stream of bytes with the most significant bit of each byte set to zero. In such a case, you want to convert any 7-byte value into an 8-byte value, and back.

I can do it in about five instructions (not counting stores and moves) both ways in standard C: five instructions to encode, and five instructions to decode. It is less than one instruction per byte. I could not convince myself that my solution is optimal.

// converts any value in [0, 2**56) into a value that // could pass as ASCII (every 8th bit is zero) // can be inverted with convert_from_ascii uint64_t convert_to_ascii(uint64_t x) { return ((0x2040810204081 * (x & 0x80808080808080)) & 0xff00000000000000) + (x & 0x7f7f7f7f7f7f7f); } // converts any 8 ASCII chars into an integer in [0, 2**56), // this inverts convert_to_ascii uint64_t convert_from_ascii(uint64_t x) { return ((0x102040810204080 * (x >> 56)) & 0x8080808080808080) + (x & 0xffffffffffffff); }

Under recent Intel processors, the pdep/pext instructions may serve the same purpose. However, they are slow under AMD processors and there is no counterpart under ARM.

Is the input range actually up to 2

^{56}, rather than 2^{48}as the comment currently suggests?Yes. You are correct.

To encode in four instructions:

`uint64_t convert_to_ascii(uint64_t x) {`

return ((0x2040810204081 * (x & 0x80808080808080)) & 0xff80808080808080) ^ x;

}

If you are willing to use a bigger chunk size, you can probably do it faster with something like:

`uint64_t encode(uint64_t bits, uint64_t& high) {`

high = (high >> 1) | (bits & 0x8080808080808080) ;

return bits & 0x7F7F7F7F7F7F7F7F;

}

`// encode a 64-byte input chunk into 72 output bytes`

uint64_t high = 0;

for (size_t i = 0; i < 8; i++)

result[0] = encode(input[0], high);

result[8] = high;

By my count this is fewer operations and all basic ops: no potentially expensive multiply. Also, it processes 8 bytes at a time, not 7, which keeps all accesses aligned (this matters much more on non-x86).

On x86, you could take advantage of LEA, which can do (bits << 1) + x in a single instruction, but because you’re shifting left, not right, I think you can only process 7 bytes at a time (like your suggested algorithm), losing some benefit. Still, even at 7 bytes i calculate this as fewer total ops.

You must stop your loop one step before, and encode 56bytes in 64bytes, otherwise result[8] will not fit in ascii.

Yes, that’s right: it should only be 7 iterations of the inner loop before storing the upper bits.

The formatter wrecks the indentation in the code samples, but that should be:

`for (size_t i = 0; i < 8; i++) result[i] = encode(input[i], high);`

result[8] = high;

Less instructions is not always faster.

Not sure what the best way to exploit ILP is though.