Most of us are familiar with IP addresses: they are strings typically of the form “ddd.ddd.ddd.ddd” where ddd is a decimal number of up to three digits in the range 0 to 255. For example, 127.0.0.1 or 192.168.0.2.
Each of the four number is a byte value, and the address is an IPv4 network address that fits in 32 bits. There is a more recent extension (IPv6) that spans 128 bits, but the conventional IPv4 addresses are common. They can be part of URLs (e.g., http://127.0.0.1/home).
I have a blog post on how you can go from the 32-bit binary value to the string quickly: Serializing IPs quickly in C++. It turns out that you can write the strings very, very fast.
For our fast URL parsing library, I also wrote the counterpart, where we go from the string to the 32-bit value. Our URL parsing library supports various IP address formats and lengths. However, I did not optimize it particularly well because there are many other challenges in efficient URL parsing.
Recently, Jeroen Koekkoek brought to my attention that my friend Wojciech has a recent article on the parsing of IP addresses. As usual, it is brillant. Koekkoek is working on accelerating DNS record parsing in his simdzone project and we expect to parse a lot of IP addresses.
Wojciech’s results are slightly depressing, however. He suggests that you can beat a standard approach by a factor of 2 or 2.5 in speed, which is excellent, but at the cost of relatively large tables.
Let summarize the strategy developed by Wojciech:
- Identify the location of the dots.
- Construct an integer (a dot “mask”) which has its bits set according to the location of the dots (if a dot is present as the second character, we set the second bit to 1). You can check that there are only 81 possible masks, after adding a virtual dot at the end of the address.
- Build a function which maps this dot mask to precomputed values which can be used to reorganize the digits and process them with single instruction, multiple data (SIMD) instructions.
Wojciech implemented many variations. Generally, he distinguishes between the case where we have no 3-digit numbers (e.g., 12.1.0.1), and so forth. I don’t expect that this optimization is very useful in practice. He came up with two different mapping functions, one that is slow but requires little memory, and one that is faster but requires a fair amount of memory. After some effort, I was able to compute with a compromise that is nearly as fast as his fastest approach, but without using much more memory than his slow routine. I end up using about 255 bytes for the mapping function, which is reasonable.
He has also separate steps to validate the input: i.e., ensure that we are only dealing with digits and dots. I feel that this can be accomplished as part of the computation. He loads the expected dot mask as part of the validation, but I expect that you can more simply validate the input by comparing the expected length of the address (which we get in any case). Overall, I only need 81 times sixteen bytes of data to do the main processing. All in all, I use nearly only half as much storage as Wojciech’s fastest routine with the expectation that I can go faster because I have simplified a few steps.
A standard C function (under Linux and similar system) to parse IP addresses is inet_pton. It is similar in performance to the routines that Wojciech tested against. I called my own function (coded in C), sse_inet_aton. Unlike the standard function, mine will always read sixteen bytes, though it will only process the specified number of bytes. This means that the input string must either be part of a large string or you must overallocate. This limitation was already there in Wojciech’s code, and I could not find a good way around it. If you have a fancy new processors with advanced SIMD instructions (AVX-512 or SVE), there are masked load and store instructions which solve this problem nicely. But like Wojciech, I decided to stick with old school SIMD instructions.
For my benchmark, I just generate a lot of random IP addresses, and I try to parse them as quickly as possible. I use an Intel IceLake server with GCC 11 and clang (LLVM 16). My source code is available.
instructions/address | speed | addresses per s | |
---|---|---|---|
inet_pton | 300 | 0.38 GB/s | 35 million |
sse_inet_aton (LLVM 16) | 52 | 2.7 GB/s | 200 million |
sse_inet_aton (GCC 11) | 63 | 2.3 GB/s | 170 million |
So the optimized function is six times faster than the standard one using GCC. Switching to clang (LLVM), you go seven times faster. The fact that LLVM has such an edge over GCC warrants further examination.
I suspect that my code is not nearly optimal, but it is definitively worth it in some cases in its current state. I should stress that the optimized function includes full validation: we are not cheating.
Future work: The code has the hard-coded assumption that you have a Linux or macOS system with an SSE 4.1 compatible processor: that’s virtually all Intel and AMD processors in operation today. The code could be faster on AVX-512 machines, but I leave this for future work. I have also not included support for ARM (through NEON instructions), but it should be easy. Furthermore, I only support the conventional IPv4 addresses in its most common format (ddd.ddd.ddd.ddd) and I do not support IPv6 at all.
Worth to mention. There are several IPv4 notations.
For example:
$ ping 0300.0250.1.0376
PING 0300.0250.1.0376 (192.168.1.254) 56(84) bytes of data.
…
That’s the reason why inet_pton checks “IPv4 field has octet with leading zero” condition.
I am benchmarking against the standard function inet_pton which does not support these variations.
Our URL parsing library supports all of these variations, and IPv6 as well.
Just to point out that inet_pton() differs from inet_addr() in that the latter accepts more formats. inet_ntop(3p) spells that out. This is why one can
ping 127.1
,ping 0x8080808
, orping 2130706433
. Callers of sse_inet_aton() need to be certain their inputs are compatible, especially if taking IP addresses from external users.The documentation of inet_pton is as follows:
I tested on both macOS and Linux, and inet_pton cannot parse 0x8080808. You can verify yourself by running this program:
If you run this program, you get 1 (success), 0 (failure), 0 (failure), and 0 (failure). Meaning that only “127.0.1.1” is recognized as a valid address.
Hi Daniel, you have misread my comment: I never claimed inet_pton() can parse 0x8080808. I’m pointing out it can’t.
I was actually not arguing “against you”. I was just clarifying what the function does, for the benefit of the readers.
IP addresses can also be IPv6…. or in IPv6 notation while being IPv4… So be wary of this 😉
::ffff:12.34.56.78
Yes, my code does not support IPv6 or uncommon IPv4 formats. It is equivalent to “inet_pton(AF_INET,…)”.
Has this been tested exhaustively?
If the perfect hash function hashes an invalid string to one of the valid entries…
then the shuffle would discard bytes without them ever being validated?
We have checks in place to verify that the string length is as expected.
Of course, bugs are always possible.
I think there are some opportunities for optimizing the performance of what you have already.
For string loading, the string length isn’t strictly necessary as long as it’s a null terminated string by comparing to 0 doing _mm_movemask_epi8 and then leading zero count.
I’m sure your hashing algorithm could be improved to include the string length, since there are only so many options, given a dotmask.
Lines 189 – 196 can be simplified by just checking no value is greater than 9, since any other value that has ‘0’ subtracted from it will be greater than 9 by either wrapping around or by having an ascii code greater than ‘9’.
Lines 175 through 182 and 198 through 209 can be simplified by re-arranging the weights to be in most-significant to least-significant order and leaving a gap for each octet like such: (100, 10, 0, 1). This obviates the restriction on leading zero digits. After this, use _mm256_maddubs_epi16 as before, but replace everything after with _mm256_hadd_epi16 and _mm_shuffle_epi8 to move the proper bytes to the least significant i32 for use with _mm_cvtsi128_si32.
I’m certainly no CS professor, so I could be way off on some of these things, but I wanted to take a crack at learning about SIMD and seeing if I could identify and areas for improvement.
Thanks!