Suppose that I give you a short string of digits, containing possibly spaces or other characters (e.g., "20141103 012910"). We would like to pack the digits into an integer (e.g., 0x20141103012910) so that the lexicographical order over the string matches the ordering of the integers.
We can use the fact that in ASCII, the digits have byte values 0x30, 0x31 and so forth. Thus as a sequence of bytes, the string "20141103 012910" is actually 0x32, 0x30, 0x31… so we must select the least significant 4 bits of each byte, discarding the rest. Intel and AMD processors have a convenient instruction which allows us to select any bits from a word to construct a new word (pext).
A problem remains: Intel and AMD processors are little endian, which means that if I load the string in memory, the first byte becomes the least significant, not the most significant. Thankfully, Intel and AMD can handle this byte order during the load process.
In C, the desired function using Intel intrinsics might look like this:
#include <x86intrin.h> // Windows: <intrin.h> #include <string.h> // From "20141103 012910", we want to get // 0x20141103012910 uint64_t extract_nibbles(const char* c) { uint64_t part1, part2; memcpy(&part1, c, sizeof(uint64_t)); memcpy(&part2 , c + 7, sizeof(uint64_t)); part1 = _bswap64(part1); part2 = _bswap64(part2); part1 = _pext_u64(part1, 0x0f0f0f0f0f0f0f0f); part2 = _pext_u64(part2, 0x0f000f0f0f0f0f0f); return (part1<<24) | (part2); }
It compiles to relatively few instructions: only 4 non-load instructions. The memcpy calls in my code translate into 64-bit load instructions. The register loading instructions (movabs) are nearly free in practice.
movbe rax, QWORD PTR [rdi] movbe rdx, QWORD PTR [rdi+7] movabs rcx, 1085102592571150095 pext rax, rax, rcx movabs rcx, 1080880467920490255 sal rax, 24 pext rdx, rdx, rcx or rax, rdx
Prior to the AMD Zen 3 processors, pext had terrible performance on AMD processors. Recent AMD processors have performance on par with Intel, meaning that pext has a latency of about 3 cycles, and can run once per cycle. So it is about as expensive as a multiplication. Not counting the loads, the above function could nearly be complete in about 5 cycles, which is quite fast.
For ARM processors, you can do it with ARM NEON like so: mask the high nibbles, shuffle the bytes, then shift/or, then narrow (16->8), extract to general register.
#include <arm_neon.h> // From "20141103 012910", we want to get // 0x20141103012910 uint64_t extract_nibbles(const char *c) { const uint8_t *ascii = (const uint8_t *)(c); uint8x16_t in = vld1q_u8(ascii); // masking the high nibbles, in = vandq_u8(in, vmovq_n_u8(0x0f)); // shuffle the bytes const uint8x16_t shuf = {14, 13, 12, 11, 10, 9, 7, 6, 5, 4, 3, 2, 1, 0, 255, 255}; in = vqtbl1q_u8(in, shuf); // then shift/or uint16x8_t ins = vsraq_n_u16(vreinterpretq_u16_u8(in), vreinterpretq_u16_u8(in), 4); // then narrow (16->8), int8x8_t packed = vmovn_u16(ins); // extract to general register. return vget_lane_u64(vreinterpret_u64_u16(packed), 0); }
It might compile to something like this:
adrp x8, .LCPI0_0 ldr q1, [x0] movi v0.16b, #15 ldr q2, [x8, :lo12:.LCPI0_0] and v0.16b, v1.16b, v0.16b tbl v0.16b, { v0.16b }, v2.16b usra v0.8h, v0.8h, #4 xtn v0.8b, v0.8h fmov x0, d0
Small typo:
“Recent Intel processors have performance on par with Intel”
should be Recent AMD (or Zen) processors have performance…
(offtopic)
I wonder if there is a really fast SIMD-based calculation possible for itoa, maybe something along bitfield multiplication and pext extraction steps. Had to deal with it recently and was quite surprised, how hard the binary number to decimal digits conversion still is.
@Lemire Do you happen to have dealt with that in the past, or have some good pointers? Sorry if I overlooked it.
You may like…
Converting integers to fix-digit representations quickly
Many thanks you for the link. At a first glance it seems that I also went with the small table approach (2 digits lookup after divmod 100). Will see if I can get that somewhat faster without going into big table realms.
Here is the optimized implementation from ClickHouse:
https://github.com/ClickHouse/ClickHouse/blob/master/base/base/itoa.h
No SIMD there, though.