Move or copy your strings? Possible performance impacts

You sometimes want to add a string to an existing data structure. For example, the C++17 template ‘std::optional’ may be used to represent a possible string value. You may copy it there, as this code would often do…

std::string mystring;
std::optional<std::string> myoption;
myoption = mystring;

Or you can move it:

std::string mystring;
std::optional<std::string> myoption;
myoption = std::move(mystring);

In C++, when ‘moving’ a value, the compiler does not need to create a whole new copy of the string. So it is often cheaper.

I wrote a little benchmark to assess the performance difference. It is a single test, but it should illustrate.

Firstly, for relatively long strings (a phrase or a sentence), the move is 5 times to 20 times faster.

copy move
Apple LLVM 14, M2 processor 24 ns/string 1.2 ns/string
GCC 11, Intel Ice Lake 19 ns/string 4 ns/string

Secondly, for short strings (a single word), the move is 1.5 times to 3 times faster but the absolute difference is small (as small as a fraction of a nanosecond). Your main concern should be with long strings.

copy move
Apple LLVM 14, M2 processor 2.0 ns/string 1.2 ns/string
GCC 11, Intel Ice Lake 7 ns/string 2.6 ns/string

My results illustrate that moving your sizeable data structure instead of copying them is beneficial.

But that’s not the fastest approach: the fastest approach is to just hold a pointer. Copying an address is unbeatably fast. A slightly less optimal approach is to use a lightweight object like an std::string_view: copying or creating an std::string_view is cheaper than doing the same with a C++ string.

International domain names: where does https://meßagefactory.ca lead you?

Originally, the domain part of a web address was all ASCII (so no accents, no emojis, no Chinese characters). This was extended a long time ago thanks to something called internationalized domain name (IDN).

Today, in theory, you can use any Unicode character you like as part of a domain name, including emojis. Whether that is wise is something else.

What does the standard says? Given a domain name, we should identify its labels. They are normally separated by dots (.) into labels: www.microsoft.com has three labels. But you may also use other Unicode characters as separators ( ., ., 。, 。). Each label is further processed. If it is all ASCII, then it is left as is. Otherwise, we must convert it to an ASCII code called “punycode” after doing the following according to RFC 3454:

  • Map characters (section 3 of RFC 3454),
  • Normalize (section 4 of RFC 3454),
  • Reject forbidden characters,
  • Optionally reject based on unassigned code points (section 7).

And then you get to the punycode algorithm. There are further conditions to be satisfied, such as the domain name in ASCII cannot exceed 255 bytes.

That’s quite a lot of work. The goal is to transcribe each Unicode domain name into an ASCII domain name. You would hope that it would be a well-defined algorithm: given a Unicode domain name, there should be a unique output.

Let us choose a common non-ASCII character, the letter ß, called Eszett. Let me create a link with this character:

What happens if you click on this link? The result depends on your browser. If you are using Microsoft Edge, Google Chrome or the Brave browser, you may end up at https://messagefactory.ca/. If you are using Safari or Firefox  you may end up at https://xn--meagefactory-m9a.ca. Of course, your results may vary depending on your exact system. Under ios (iPhone), I expect that the Safari behaviour will prevail irrespective of your browser.

Not what I expected.

Year 2022: Scientific progress

The year 2022 is over. As with every year that passes, we have made some scientific progress. I found the following achievements interesting:

Science and technology links (January 15 2023)

    1. For under $600, one can buy a 20-terabyte disk on Amazon. Unless you work professionally in multimedia, it is more storage than you need. However, having much storage it, by itself, of little use if you cannot access it. Thankfully, you can buy a 1-terabyte “disk” for $200 that provides over 6 GB/s of bandwidth. I have a similar disk in my game console. Is this as good as it gets? Researchers show that we can transmit data over a distance at more than a petabit per second. According to some estimates, that is more than the total data size of the books in the library of congress, per second.
    2. Transplanting rejuvenated blood stem cells extends lifespan of aged immunocompromised mice.
    3. Amazon is using drones for deliveries in California and Texas.
    4. People who think themselves as less attractive are more likely willing to wear surgical masks.
    5. Conversations rarely end when they should.
    6. Using legal restrictions, manufacturers are able to prevent their customers from repairing their own products. There may be hope. Farmers in the US will be allowed to repair John Deere tractors.
    7. For most of my life, nuclear power has been viewed with suspicion, and little effort has been done to exploit it further. The tide might be finally turning. The UK government plans to authorize small modular nuclear reactors, and other nuclear-power innovations.
    8. Cancer rates are falling rapidly in both men and women.
    9. There is evidence that vitamin D supplements might help your immune system if your levels are too low. However, you may also require magnesium supplements to benefit from the vitamin D.
    10. In the post-pandemic era, people work fewer hours.
    11. There is evidence that viruses dormant in our genomes can wake up when we are older and harm us.
    12. Research papers and patents are becoming less disruptive over time. However, we are producing many more research papers and patents.
    13. Genetically modified immune cells are used to fight cancer in a patient.
    14. Increasing house insulation may not lead to lower energy usage on the long run. It is not, per se, an argument against better insulation. Yet it suggests that we should plan for increase power production.
    15. In a paper published by Nature, Kleinherenbrink et al. find that global mean sea levels are likely rising according to a linear curve, as opposed to an accelerating curve. The current rate is estimated to be between 3 and 4 millimeters per year. Meanwhile, the most low-lying island nations on the planet are growing.
    16. Antartica might be getting cooler.
    17. People prefer to stay away from promiscuous men.
    18. Cold temperatures harm the antiviral immunity of your nose. Thus cold weather may make you more prone to catching a virus.
    19. Replacing grades with pass/fail scores in courses lead students to make less effort. In my view, it does not imply that we should not adopt pass/fail scores because there are diminish returns to more efforts. E.g., if the kids in a country spend much more time perfecting their knowledge of trigonometry, you may not end up with a more prosperous country. In some sense, intense competition may be a net loss.
    20. Using solar power generation in countries such as Switzerland results in a net energy loss: though the solar panels produce energy, they never recoup the energy investment needed to make them and deploy them.

Care is needed to use C++ std::optional with non-trivial objects

We often have to represent in software a value that might be missing. Different programming languages have abstraction for this purpose.

A recent version of C++ (C++17) introduces std::optional templates. It is kind of neat. You can write code that prints a string, or a warning if no string is available as follows:

void f(std::optional<std::string> s) {
  std::cout << s.value_or("no string") << std::endl;
}

I expect std::optional to be relatively efficient when working with trivial objects (an integer, a std::string_view, etc.). However if you want to avoid copying your objects, you should use std::optional with care.

Let us consider this example:

A f(std::optional<A> z) {
    return z.value_or(A());
}

A g() { 
    A a("message"); 
    auto z = std::optional<A>(a); 
    return f(z); 
}

How many instances of the string class are constructed when call the function g()? It is up to five instances:

  1. At the start of the function we construct one instance of A.
  2. We then copy-construct this instance when passing it to std::optional<A>.
  3. Passing std::optional<A> to the function f involves another copy.
  4. In the value_or construction, we have a default construction of A (which is wasteful work in this instance).
  5. Finally, we copy construct an instance of A when the call to value_or terminates.

It is possible for an optimizing compiler to elide some or all of the superfluous constructions, especially if you can inline the functions, so that the compiler can see the useless work and prune it. However, in general, you may encounter inefficient code.

I do not recommend against using std::optional. There are effective strategies to avoid the copies. Just apply some care if performance is a concern.

Transcoding Unicode with AVX-512: AMD Zen 4 vs. Intel Ice Lake

Most systems today rely on Unicode strings. However, we have two popular Unicode formats: UTF-8 and UTF-16. We often need to convert from one format to the other. For example, you might have a database formatted with UTF-16, but you need to produce JSON documents using UTF-8. This conversion is often called ‘transcoding’.

In the last few years, we wrote a specialized library that process Unicode strings, with a focus on performance: the simdutf library. The library is used by JavaScript runtimes (Node JS and bun).

The simdutf library is able to benefit from the latest and most powerful instructions on your processors. In particular, it does well with recent processors with AVX-512 instructions (Intel Ice Lake, Rocket Lake, as well as AMD Zen 4).

I do not yet have a Zen 4 processor, but Velu Erwan was kind of enough to benchmark it for me. A reasonable task is to transcode an Arabic file from UTF-8 to UTF-16: it is typically a non-trivial task because Arabic UTF-8 is a mix of one-byte and two-byte characters that we must convert to two-byte UTF-16 characters (with validation). The steps required (under Linux) are as follows:

git clone https://github.com/simdutf/simdutf && 
cd simdutf && 
cmake -B build && cmake --build build && 
wget –content-disposition https://cutt.ly/d2cIxRx && 
./build/benchmarks/benchmark -F Arabic-Lipsum.utf8.txt -P convert_utf8 

(Ideally, run the last command with privileged access to the performance counters.)

Like Intel, AMD has its own compiler. I did not have access to the Intel compiler for my tests, but Velu has the AMD compiler.

A sensible reference point is the iconv, as it is provided by the runtime library. The AMD processor is running much faster than the Intel core (5.4 GHz vs. 3.4 GHz). We use GCC 12.

transcoder Intel Ice Lake (GCC) AMD Zen 4 (GCC) AMD Zen 4 (AMD compiler)
iconv 0.70 GB/s 0.97 GB/s 0.98 GB/s
simdutf 7.8 GB/s 11 GB/s 12 GB/s

At a glance, the Zen 4 processor is slightly less efficient on a per-cycle basis when running the simdutf AVX-512 code (2.8 instructions/cycle for AMD versus 3.1 instructions/cycle for Intel) but keep in mind that we did not have access to a Zen 4 processor when tuning our code. The efficiency difference is small enough that we can consider the processors roughly on par pending further investigations.

The big difference that the AMD Zen 4 runs at a much higher frequency. If I rely on wikipedia, I do not think that there is an Ice Lake processor that can match the 5.4 GHz. However, some Rocket Lake processors come close.

In our benchmarks, we track the CPU frequency and we get the same measured frequency when running an AVX-512 as when running conventional code (iconv).  Thus AVX-512 can be really advantageous.

These results suggest that AMD Zen 4 is matching Intel Ice Lake in AVX-512 performance. Given that the Zen 4 microarchitecture is the first AMD attempt at supporting AVX-512 commercially, it is a remarkable feat.

Further reading: AMD Zen 4 performance while parsing JSON (phoronix.com).

Note: Raw AMD results are available: GCC 12 and AOCC.

Credit: Velu Erwan got the processor from AMD France. The exact specification is AMD 7950X, 2x16GB DDR5 4800MT reconfig as 5600MT. The UTF-8 to UTF-16 transcoder is largely based on the work of Robert Clausecker.

Emojis in domain names, punycode and performance

Most domain names are encoded using ASCII (e.g., yahoo.com). However, you can register domain names with almost any character in them. For example, there is a web site at 💩.la called poopla. Yet the underlying infrastructure is basically pure ASCII. To make it work, the text of your domain is first translated into ASCII using a special encoding called ‘punycode‘. The poopla web site is actually at https://xn--ls8h.la/.

Punycode is a tricky format. Thankfully, domain names are made of labels (e.g., in microsoft.com, microsoft is a label) and each label can use at most 63 bytes. In total, a domain name cannot exceed 255 bytes, and that is after encoding it to punycode if necessary.

Some time ago, Colm MacCárthaigh suggested that I look at the performance impact of punycode. To answer the question, I quickly implemented a function that does the job. It is a single function without much fanfare. It is roughly derived from the code in the standard, but it looks simpler to me. Importantly, I do not claim that my implementation is particularly fast.

As a dataset, I use all of the examples from the wikipedia entry of punycode.

GCC 11, Intel Ice Lake 75 ns/string
Apple M2, LLVM 14 27 ns/string

The punycode format is relatively complicated, it was not designed for high speed, so there are limits to how fast one can go. Nevertheless, it should be possible to go faster. Yet it is probably ‘fast enough’ and it is unlikely that punycode encoding would ever be a performance bottleneck if only because domain names are almost always pure ASCII and punycode is unneeded.

 

Quickly checking that a string belongs to a small set

Suppose that I give you a set of reference strings (“ftp”, “file”, “http”, “https”, “ws”, “wss”). Given a new string, you want to quickly tell whether it is part of this set.

You might use a regular expression but it is unlikely to be fast in general:

const std::regex txt_regex("https?|ftp|file|wss?");
// later... 
bool match = std::regex_match(v.begin(), v.end(), txt_regex);

A sensible solution might be to create a set and then to ask whether the string is in the set. In C++, a default set type is the unordered_set thus your code might look as follows:

static const std::unordered_set<std::string_view> special_set = {
    "ftp", "file", "http", "https", "ws", "wss"};

bool hash_is_special(std::string_view input) {
  return special_set.find(input) != special_set.end();
}

You can also use a tool like gperf which constructs a perfect-hash function for you.

You might also be more direct about it, and just do several comparisons:

bool direct_is_special(std::string_view input) {
  return (input == "https") || (input == "http") || (input == "ftp") ||
         (input == "file") || (input == "ws") || (input == "wss");
}

I call it a branchy version because of the ‘||’ operator which suggests to the compiler that you want to evaluate the comparisons one by one, exiting once one is true; if you replace the ‘||’ operator with the operator ‘|’ then you the result is ‘branchless’ in the sense that you entice the compiler to evaluate all comparisons.

If you look at how the code gets compiled, you may notice that the compiler is forced to do comparisons and jumps, because it is not allowed to read in the provided string beyond its reported size.

You might be able to do slightly better if you can tell your compiler that the string you receive is ‘padded’ so that you can read eight bytes safely from it. I could not find a very elegant way to do it, but the following code works:

static inline uint64_t string_to_uint64(std::string_view view) {
  uint64_t val;
  std::memcpy(&val, view.data(), sizeof(uint64_t));
  return val;
}

uint32_t string_to_uint32(const char *data) {
  uint32_t val;
  std::memcpy(&val, data, sizeof(uint32_t));
  return val;
}


bool fast_is_special(std::string_view input) {
  uint64_t inputu = string_to_uint64(input);
  if ((inputu & 0xffffffffff) == string_to_uint64("https\0\0\0")) {
    return input.size() == 5;
  }
  if ((inputu & 0xffffffff) == string_to_uint64("http\0\0\0\0")) {
    return input.size() == 4;
  }
  if (uint32_t(inputu) == string_to_uint32("file")) {
    return input.size() == 4;
  }
  if ((inputu & 0xffffff) == string_to_uint32("ftp\0")) {
    return input.size() == 3;
  }
  if ((inputu & 0xffffff) == string_to_uint32("wss\0")) {
    return input.size() == 3;
  }
  if ((inputu & 0xffff) == string_to_uint32("ws\0\0")) {
    return input.size() == 2;
  }
  return false;
}

Though I did not do it, you can extend the comparison so that it is case-insensitive (simply AND the input with the bytes 0xdf instead of the bytes 0xff).

You can use a faster approach if you can assume that the input string has been padded with zeros:

  uint64_t inputu = string_to_uint64(input);
  uint64_t https = string_to_uint64("https\0\0\0");
  uint64_t http = string_to_uint64("http\0\0\0\0");
  uint64_t file = string_to_uint64("file\0\0\0\0");
  uint64_t ftp = string_to_uint64("ftp\0\0\0\0\0");
  uint64_t wss = string_to_uint64("wss\0\0\0\0\0");
  uint64_t ws = string_to_uint64("ws\0\0\0\0\0\0");
  if((inputu == https) | (inputu == http)) {
    return true;
  }
  return ((inputu == file) | (inputu == ftp) 
          | (inputu == wss) | (inputu == ws));

Observe how I have selected what I believe are the two most common cases (among URL protocols).

Finally, we must use a hash function to solve the problem with a single comparison:

static const uint8_t shiftxor_table[128] = {
    'w', 's', 0,   0,   0,   0,   0,   0,   0,
    0,   0,   0,   0,   0,   0,   0,   0,   0,
    0,   0,   0,   0,   0,   0,   0,   0,   0,
    0,   0,   0,   0,   0,   0,   0,   0,   0,
    0,   0,   0,   0,   0,   0,   0,   0,   0,
    0,   0,   0,   'f', 'i', 'l', 'e', 0,   0,
    0,   0,   0,   0,   0,   0,   0,   0,   0,
    0,   'f', 't', 'p', 0,   0,   0,   0,   0,
    'w', 's', 's', 0,   0,   0,   0,   0,   'h',
    't', 't', 'p', 0,   0,   0,   0,   0,   0,
    0,   0,   0,   0,   0,   0,   'h', 't', 't',
    'p', 's', 0,   0,   0,   0,   0,   0,   0,
    0,   0,   0,   0,   0,   0,   0,   0,   0,
    0,   0,   0,   0,   0,   0,   0,   0,   0,
    0,   0};

bool shiftxor_is_special(std::string_view input) {
  uint64_t inputu = string_to_uint64(input);

  return string_to_uint64(
             shiftxor_table +
             (((inputu >> 28) ^ (inputu >> 14)) &
              0x78)) == inputu;
}

If you do not want to assume that the strings are padded (i.e., no cheating), then you can do it the “gperf way” (it is my own code, but it is based on gpref):

std::string_view table_hashnocheat_is_special[] = {"http", "", "https", 
  "ws", "ftp", "wss", "file", ""};

bool hashnocheat_is_special(std::string_view input) {
  if(input.empty()) { return false; }
  int hash_value = (2*input.size() + (unsigned)(input[0])) & 7;
  const std::string_view target = table_craftedhash_is_special[hash_value];
  return (target[0] == input[0]) && (target.substr(1) == input.substr(1));
}

I am sure that there are faster and more clever alternatives!

In any case, how fast are my alternatives? Using GCC 11 on an Intel Ice Lake server, I get the following results:

gperf7.1 ns/string

regex 360 ns/string
std::unordered_map 19 ns/string
direct 16 ns/string
direct (branchy) 13 ns/string
direct (branchless) 18 ns/string
hashnocheat_is_special 7.0 ns/string
fast 2.6 ns/string
faster 1.9 ns/string
hashing (shiftxor_is_special) 1.1 ns/string

On an Apple M2 with LLVM 12, I get similar (but better) results:

regex 450 ns/string
std::unordered_map 15 ns/string
direct (branchy) 7.5 ns/string
direct (branchless) 4.8 ns/string
gperf 8.0 ns/string
hashnocheat_is_special 7.2 ns/string
fast 1.1 ns/string
faster 0.8 ns/string
hashing (shiftxor_is_special) 0.4 ns/string

Care is needed when optimizing such small functions: whether and how the function gets inlined can be critical to the good performance. The results will depend also on the data source and on the compiler.

The hashing method (e.g., shiftxor_is_special) has the benefit of being essentially branch-free which makes its performance having little dependency on the distribution of the input. It is also fastest in these tests.

If you cannot safely pad your strings, I recommend the hashnocheat_is_special function. Having to deal with variable-length strings is significantly slower, but it is sometimes necessary.

My source code is available.

Further reading: I was later informed that my friend Wojciech Muła had looked at the problem earlier in 2022. He did not look at string padding, but for the version with no-cheating (variable-length strings), he comes up with the same conclusion that I do.

Science and Technology links (December 25 2022)

Fast base16 encoding

Given binary data, we often need to encode it as ASCII text. Email and much of the web effectively works in this manner.

A popular format for this purpose is base64. With Muła, we showed that we could achieve excellent speed using vector instructions on commodity processors (2018, 2020). However, base64 is a bit tricky.

A much simpler format is just base16. E.g., you just transcribe each byte into two bytes representing the value in hexadecimal notation. Thus the byte value 1 becomes the two bytes ’01’. The byte value 255 becomes ‘FF’, and so forth. In other words, you use one byte (or one character) per ‘nibble’: a byte is made of two nibbles: the most-significant 4 bits and the least-significant 4 bits.

How could encode base16 quickly? A reasonable approach might be to use a table. You grab one byte from the input and you directly lookup the 2 bytes from the output which you immediately write out:

void encode_scalar(const uint8_t *source, size_t len, char *target) {
  const uint16_t table[] = {
      0x3030, 0x3130, 0x3230, 0x3330, 0x3430, ...
      0x6366, 0x6466, 0x6566, 0x6666};
  for (size_t i = 0; i < len; i++) {
    uint16_t code = table[source[i]];
    ::memcpy(target, &code, 2);
    target += 2;
  }
}

It requires a 512-byte table but that is not concerning.

Could we do better?

Milosz Krajewski wrote some good-looking C# code using vector instructions. I wrote something that should be the equivalent using x64 C++. We have both routines for 128-bit and 256-bit vectors. My code is for demonstration purposes but it is essentially correct.

The core idea is not complicated. You must grab a vector of bytes. Then you must somehow expand it out: each nibble must go into a byte. And then the magic is this: we use the fast vectorized lookup (e.g., the pshufb instruction) to look up each nibble into a 16-byte table containing the letters ‘0’, ‘1’…’a’, …’f’.

Here is the 128-bit code using Intel intrinsics:

  __m128i shuf = _mm_set_epi8('f', 'e', 'd', 'c', 'b', 'a', '9', '8', '7', '6',
                              '5', '4', '3', '2', '1', '0');
  size_t i = 0;
  __m128i maskf = _mm_set1_epi8(0xf);
  for (; i + 16 <= len; i += 16) {
    __m128i input = _mm_loadu_si128((const __m128i *)(source + i));
    __m128i inputbase = _mm_and_si128(maskf, input);
    __m128i inputs4 =
        _mm_and_si128(maskf, _mm_srli_epi16(input, 4));
    __m128i firstpart = _mm_unpacklo_epi8(inputs4, inputbase);
    __m128i output1 = _mm_shuffle_epi8(shuf, firstpart);
    __m128i secondpart = _mm_unpackhi_epi8(inputs4, inputbase);
    __m128i output2 = _mm_shuffle_epi8(shuf, secondpart);
    _mm_storeu_si128((__m128i *)(target), output1);
    target += 16;
    _mm_storeu_si128((__m128i *)(target), output2);
    target += 16;
  }

The 256-bit code is roughly the same, with one extra instruction to shuffle the bytes to compensate for the fact that 256-bit instructions work ‘per lane’ (in units of 128-bit words). My source code is available.

We might therefore expect the 256-bit to be maybe twice as fast? My results on an icelake processor with GCC11:

table lookup 0.9 GB/s
128-bit vectors 6.4 GB/s
256-bit vectors 11 GB/s

We are not quite twice as fast, but close enough. I do not find these speeds very satisfying: I expect that less naive code could be faster.

Milosz gets much poorer results in C#: the 256-bit code is barely faster than the 128-bit code, but he does some relatively complicated computation in the 256-bit code instead of just calling the 256-bit shuffle instruction (vpshufb). (I suspect that he will soon fix this issue if he can.)

Our code would work on ARM as well after minor changes. For AVX-512 or SVE, we may need different approaches. We could add both encoding and decoding.

The size of things in bytes

storing 1 GiB/month on the cloud 0.02$US
web site of my twitter profile (@lemire), HTML alone 296 KiB
web site of my twitter profile (@lemire), all data 3.9 MiB
Google result for ‘Canada’, HTML alone 848 KiB
Google result for ‘Canada’, all data 3.7 MiB
Node JS runtime 164 MiB
Size of the Java (19) runtime 330 MiB
LLVM/clang compiler+runtime 5.5 GiB
Boost (C++) library (source) 609 MiB
Go runtime 471 MiB

Implementing ‘strlen’ using SVE

In C, the length of a string in marked by a 0 byte at the end of the string. Thus to determine the length of the string, one must scan it, looking for the 0 byte. Recent ARM processors have a powerful instruction set (SVE) that is well suited for such problems. It allows you to load large registers at once and to do wide comparisons (comparing many bytes at once).

Yet we do not want to read too much data. If you read beyond the string, you could hit another memory page and trigger a segmentation fault. This could crash your program.

Thankfully, SVE comes with a load instruction that would only fault on the ‘first active element’: as long as the first element you are loading is valid, then there is no fault.

With this in mind, a simple algorithm to compute the length of a C string is as follows:

  1. Load a register.
  2. Compare each byte in it to 0.
  3. If any comparison matches, then locate the match and return the corresponding length.
  4. If not, increment by the register size (given by svcntb()), and repeat.

Using intrinsics, the code looks as follows…

size_t sve_strlen(const char *s) {
  size_t len = 0;
  while (true) {
    svuint8_t input = svldff1_u8(svptrue_b8(), (const uint8_t *)s + len);
    svbool_t matches = svcmpeq_n_u8(svptrue_b8(), input, 0);
    if (svptest_any(svptrue_b8(), matches)) {
      return len + svlastb_u8(svbrka_z(matches, matches), svindex_u8(0, 1));
    }
    len += svcntb();
  }
}

In assembly, the code looks as follows…

mainloop:
        ldff1b  { z0.b }, p0/z, [x10, x8]
        add     x8, x8, x9
        cmpeq   p1.b, p0/z, z0.b, #0
        b.eq    .mainloop

I do not discuss the details of my function, but it assumes implicitly that the underlying register size is no larger than 256 bytes which is guaranteed by the specification.

Benchmarking against your system’s strlen function is difficult methodologically. Nevertheless, we can measure the number of instructions retired for sizeable strings as an indication of the relative speed. Using GCC 11 on a graviton 3 system (Amazon), I get the following metrics:

system’s strlen 0.23 instructions per byte
SVE 0.15 instructions per byte

Though I do not advocate adopting SVE as a replacement for strlen at this time, the potential is interesting, considering that I threw together my implementation in minutes.

My source code is available.

Credit: Thanks to Denis Yaroshevskiy for inviting me to look at non-faulting loads.

Update: It turns out that strlen was one of the examples that ARM used in its slides presenting SVE. At a glance, their implementation looks like mine but with more sophistication.

Checking for the absence of a string, naive AVX-512 edition

Suppose you would like to check that a string is not present in a large document. In C, you might do the following using the standard function strstr:

bool is_present = strstr(mydocument, needle);

It is simple and likely very fast. The implementation is algorithmically sophisticated.

Can you do better?

Recent Intel and AMD processors have instructions that operate on 512-bit registers. So we can compare 64 bytes using a single instruction. The simplest algorithm to search for a string might look as follows…

  1. Load 64 bytes from our input document, compare them against 64 copies of the first character of the target string.
  2. If we find a match, load the second character of the target string, copy it 64 times within a register. Load 64 bytes from our input document, with an offset of one byte.
  3. Repeat as needed for the second, third, and so forth characters…
  4. Then advance in the input by 64 bytes and repeat.

Using Intel intrinsic functions, the algorithm looks as follows:

  
  for (size_t i = 0; ...; i += 64) {
    __m512i comparator = _mm512_set1_epi8(needle[0]);
    __m512i input = _mm512_loadu_si512(in + i);
    __mmask64 matches = _mm512_cmpeq_epi8_mask(comparator, input);
    for (size_t char_index = 1; matches && char_index < needle_len; 
         char_index++) {
      comparator = _mm512_set1_epi8(needle[char_index]);
      input = _mm512_loadu_si512(in + i + char_index);
      matches =
          _kand_mask64(matches, _mm512_cmpeq_epi8_mask(comparator, input));
    }
    if(matches) { return true; }
  }
  return false;

It is about the simplest algorithm I could think of.

We are now going to benchmark it against strstr. To make it interesting, I will pick a string that it designed to be repeatedly ‘almost’ in the input document, except for its last character. It is a worst-case scenario.

I use GCC 11 on an Intel Icelake processor. The source code of my benchmark is available.

number of characters in the string AVX-512 (naive) strstr
2 33 GB/s 13 GB/s
5 22 GB/s 9 GB/s
10 13 GB/s 8 GB/s
14 10 GB/s 7 GB/s

Unsuprisingly, my naive AVX-512 approach scales poorly in this benchmark with the string length. However, it is somewhat pessimistic, I would expect better results with a more realistic use case.

It should be possible to do much better with some more sophistication. However, for short strings, we are already twice as fast as strstr which is encouraging.

What is the memory usage of a small array in C++?

In an earlier blog post, I reported that the memory usage of a small byte array in Java (e.g., an array containing 4 bytes) was about 24 bytes. In other words: allocating small blocks of memory has substantial overhead.

What happens in C++?

To find out, I can try to allocate one million 4-byte arrays and look at the total memory usage of the process. Of course, the memory usage of the process will include some overhead unrelated to the 4-byte arrays, but we expect that such overhead will be relatively small.

From my benchmark, I get the following results…

system memory usage (in bytes)
GCC 8, Linux x86 32 bytes
LLVM 14, Apple aarch64 16 bytes

The results will vary depending on the configuration of your system, on your optimization level, and so forth.

But the lesson is that allocating four bytes (new char[4] or malloc(4)) does not use four bytes of memory… it will generally use much more.

Science and Technology links (December 11 2022)

  1. As we focus on some types of unfortunate discrimination (race, gender), we may become blind to other types of discrimination. For example, tend to discrimate against ugly people, short men, old people, and so on.
  2. Life may have emerged on Earth thanks to ‘aqueous microdroplets’.
  3. Naked mole rats are long-lived mammals. We believe that they experience negligible senescence, meaning that they do not lose fitness with age. One theory to explain they particular biology has to do with the fact that they live in a subterranean setting. Dammann et al. suggest that the fact that they are social creatures might also play a role.
  4. Low-carb diets may help obese individuals who face food addiction symptoms.
  5. Climate change may change how people name their children according to the journal Science.
  6. Offshore wind turbines endanger whales according to a Bloomberg report.
  7. Grip strength (who strongly you can hold on to something with your hands) is a good indication of your biological age.
  8. An mRNA technology might reprogram and rejuvenate your skin (according to a commercial press release).
  9. Many climate models predict that increased CO2 will impact clouds which would cause much of the predicted warming, since CO2 by itself is not very potent. New research published in Nature makes these models implausible. According to the authors, it makes it improbable that the climate is highly sensitive to CO2 levels.
  10. We have been told to avoid meat and dairy products (butter, cheese) because they contain saturated fats that might cause heart diseases. Teicholz (2022) reports that the rigorous evidence on saturated fats, which showed they do not cause heart disease, has long suppressed. The research seems clear: saturated fats have no effect on heart attacks, strokes or cardiovascular mortality, or total mortality.
  11. As you grow older, your cells tend to get larger.
  12. Goklany (2020) estimates that over 60% of our food production is dependent on fossil-fuel technologies. Currently, 12% of the land (excluding Antartica) is devoted to agriculture (cropland). Without fossil fuel, this percentage would need to move to 33%. The process of cropland expansion would harm biodiversity, increase food prices, and increase food insecurity among the poor.

Fast midpoint between two integers without overflow

Let us say that I ask you to find the number I am thinking about between -1000 and 1000, by repeatedly guessing a number. With each guess, I tell you whether your guess is correct, smaller or larger than my number. A binary search algorithm tries to find a value in an interval by repeating finding the midpoint, using smaller and smaller intervals. You might start with 0, then use either -500 or 500 and so forth.

Thus we sometimes need a fast algorithm to find the midpoint in an interval of integers.The following simple routine to find the midpoint is incorrect:

int f(int x, int y) {
  return (x + y)/2;
}

If the integers use a 64-bit two’s complement representation, we could pick 1 for x and 9223372036854775807 for y, and then the result of the function could be a large negative value.

Efficient solutions are provided by Warren in Hacker’s Delight (section 2.5):

int f(int x, int y) { 
  return (x|y) - ((x^y)>>1); 
}

int f(int x, int y) { 
  return ((x^y)>>1) + (x&y); 
}

They provide respectively the smallest value no smaller than (x+y)/2 and the largest value no larger than (x+y)/2. The difference between the two values is (x ^ y) & 1 (credit: Harold Aptroot).

They follow from the following identities: x+y=(x^y)+2*(x&y) and x+y=2*(x|y)-(x^y).

Update: Reader BartekF observes that C++20 added a dedicated function for this problem: std::midpoint.

Optimizing compilers reload vector constants needlessly

Modern processors have powerful vector instructions which allow you to load several values at once, and operate (in one instruction) on all these values. Similarly, they allow you to have vector constants. Thus if you wanted to add some integer (say 10001) to all integers in a large array, you might first load a constant with 8 times the value 10001, then you would load elements from your array, 8 elements by 8 elements, add the vector constant (thus do 8 additions at once), and then store the result. Everything else being equal, this might be 8 times faster.

An optimizing compiler might even do this optimization for you (a process called ‘auto-vectorization). However, for more complex code, you might need to do it manually using “intrinsic” functions (e.g., _mm256_loadu_si256, _mm256_add_epi32, etc.).

Let us consider the simple case I describe, but where we process two arrays at once… using the same constant:

#include <x86intrin.h>
#include <stdint.h>
void process_avx2(const uint32_t *in1, const uint32_t *in2, size_t len) {
  // define the constant, 8 x 10001
  __m256i c = _mm256_set1_epi32(10001);
  const uint32_t *finalin1 = in1 + len;
  const uint32_t *finalin2 = in2 + len;
  for (; in1 + 8 <= finalin1; in1 += 8) {
    // load 8 integers into a 32-byte register
    __m256i x = _mm256_loadu_si256((__m256i *)in1);
    // add the 8 integers just loaded to the 8 constant integers
    x = _mm256_add_epi32(c, x);
    // store the 8 modified integers
    _mm256_storeu_si256((__m256i *)in1, x);
  };
  for (; in2 + 8 <= finalin2; in2 += 8) {
    // load 8 integers into a 32-byte register
    __m256i x = _mm256_loadu_si256((__m256i *)in2);
    // add the 8 integers just loaded to the 8 constant integers
    x = _mm256_add_epi32(c, x);
    // store the 8 modified integers
    _mm256_storeu_si256((__m256i *)in2, x);
  }
}

My expectation, until recently, was that optimizing compilers would  keep the constant in a register, and never load it twice. Why would they?

Yet you can check that GCC loads the constant twice. You will recognize the assembly sequence:

mov          eax, 10001 // load 10001 in a general register
vpbroadcastd ymm1, eax  // broadcast 10001 to all elements

In  this instance, other compilers (like LLVM) do better. However, in other instances, both LLVM and GCC happily load constants more than once. Only the Intel compiler (ICC) seems to be able to avoid this issue with some consistency.

The processor has more than enough vector registers, so it is not a register allocation issue. Of course, there are instances where it is  best to avoid creating the constant, but you can check that even when the compiler ought to know that the constant is always needed, it may still create it twice. AVX-512 has introduced new mask types and they suffer from this effect as well.

Does it matter? In most cases, this effect should have little performance impact. It is almost surely only a few instructions of overhead per function.

It would be interesting to be able to instruct the compiler not to do reload the constants. You might think that the static keyword could help, but with LLVM, static vector variables may be protected by a lock, which probably makes your code even heavier.

 

How big are your SVE registers ? (AWS Graviton)

Amazon has some neat ARM-based systems based on Amazon’s own chips (Graviton). You can access them through Amazon’s web services (AWS). These processors have advanced vector instructions able to process many values at once. These instructions are part of an instruction sets called SVE for Scalable Vector Extension. SVE has a trick: it hides its internal register size from you. Thus, to the question “how many values can it process at once?”, the answer is ‘”it depends”.

Thankfully, you can still write a program to find out. The svcntb intrinsic tells you how many 8-bit integers fits in a full register. Thus the following C++ line should tell you the vector register size in bytes:

std::cout << "detected vector register size (SVE): " 
  << svcntb() << " bytes" << std::endl;

And here is what I get currently on an AWS server:

$ ./svesize
detected vector register size (SVE): 32 bytes

It is hard to find ARM processors with such wide registers (32 bytes) and it is unclear whether future iterations will still have 32 bytes registers.

My source code is available.

Generic number compression (zstd)

I have done a lot of work that involves compressing and uncompressing data. Most often, I work on data that has specific characteristics, e.g., sorted integers. In such cases, one can do much better than generic compression routines (e.g., zstd, gzip) both in compression ratios and performance.

But how well do these generic techniques do for random integers and floats?

  1. We generate 32-bit floats in the interval [0,1] and store then as double-precision (64-bit) floats. Roughly speaking, it should be possible to compress this data by a factor of two.
  2. We generate 64-bit integers in the range [-127,127]. We should be able to compress this data by a factor of eight (from eight bytes to one byte).

What are the results? I use zstd v1.5.2 (with default flags) and a couple of small programs.

source compression ratio
32-bit floats as 64-bit floats 2x
64-bit integers in the range [-127,127] 5x

The compression ratio is pretty good for the floating-point test, nearly optimal. For the 64-bit integers, the results are less exciting but you are within a factor of two of the ideal compression ratio.

Update: As reported in the comments, you can get much better compression ratios if you request more aggressive compression, although it also takes much more time.

Science and Technology links (November 26 2022)

  1. In Molière’s famous play, Tartuffe, the main characters is outwardly pious but fundamentally deceitful. Are people who insist on broadcasting their high virtue better people, or are they more like Tartuffe? Dong et al. (2022) conclude that people who say that they have good values are not necessarily better people in practice, but they are more likely to be hypocrites. That is, Tartuffe is a realistic character. It is worth pointing out that Molière’s play was censored by the king.
  2. The thymus is a small organ which plays a critical role in your immune system by producing T cells. As you get older, your thymus nearly disappears. The net result is that by the time you are 60 years old, you have few available T cells left and your immune system cannot adapt to new diseases as well. Calum Chace reports on a small clinical trial that showed that we can rejuvenate the thymus inexpensively. Unfortunately, though the clinical trial has been completed years ago, nobody seems to care about what is possibly a medical breakthrough in the fight against aging.
  3. When doing exercises to get stronger, it is the extension of the muscles that matters, not the contraction.
  4. Climate change might results in us having more rainbows.
  5. Social scientists are no better at predicting societal change than the rest of us.
  6. In female rats, scientists find that a gene therapy may protect against reproductive aging.
  7. High-speed wireless internet significantly increased teen girls’ severe mental health diagnoses – by 90%.
  8. We can reset the age of a single cell.
  9. Content warnings reduce aesthetic appreciation of visual art.
  10. Highly intelligent people are not more likely to have mental illness.
  11. Psychopaths only do better if others cannot retaliate. They also tend to cooperate less.
  12. Hundreds of thousands of bats are killed per year in countries with many wind turbines.