Replace strings by views when you can

C++ programmers tend to represent strings using the std::string class. Though the implementation might vary, each instance of an std::string might use 32 bytes. Though it is not a large amount of memory, it can add up.

In the Node.js runtime, as part of the build tools, there is a function which precomputes the string representation of all 16-bit integers, followed by a comma.

std::vector<std::string> precompute_string() {
  size_t size = 1 << 16;
  std::vector<std::string> code_table(size);
  for (size_t i = 0; i < size; ++i) {
    code_table[i] = std::to_string(i) + ',';
  }
  return code_table;
}

Creating 65536 strings uses 2 megabytes on most systems.

What could we do instead?

We could create a single string buffer made of the string ‘0,1,2,…,65535,’, and record the offsets within the string so that we can locate quickly where a given value is. That is, given the integer 500, I want to get immediately the location of the substring ‘500,’. I need two offsets: the offset of the current value and the offset of the next value.

The unique string requires only 382106 bytes, which is quite a bit, but several times less than the 2 megabytes needed by the array of std::string instances.

In C++, you might code it as follows. For simplicity, we can roll our own integer-to-string routine.

std::pair<std::array<char, 382106>, 
  std::array<uint32_t, 65537>> precompute_string_fast() {
  std::array<char, 382106> str;
  std::array<uint32_t, 65537> off;
  off[0] = 0;
  char *p = &str[0];
  constexpr auto const_int_to_str = 
        [](uint16_t value, char *s) -> uint32_t {
    uint32_t index = 0;
    do {
      s[index++] = '0' + (value % 10);
      value /= 10;
    } while (value != 0);

    for (uint32_t i = 0; i < index / 2; ++i) {
      char temp = s[i];
      s[i] = s[index - i - 1];
      s[index - i - 1] = temp;
    }
    s[index] = ',';
    return index + 1;
  };
  for (int i = 0; i < 65536; ++i) {
    size_t offset = const_int_to_str(i, p);
    p += offset;
    off[i + 1] = off[i] + offset;
  }
  return {str, off};
}

We do not actually need to store the offsets. We can compute them on the fly instead quite economically. However, it takes some effort and unless you are stressed for memory, it is likely better to compute the offsets as they require only 256 KB.

I wrote a benchmark to see how long it takes to precompute the data. On my M2 macBook, I get the following numbers:

 

array of std::string 0.57 ms
one big string + offsets 0.26 ms

So constructing just one string might be twice as fast.

What about query time? Retrieving a reference to an std::string instance is trivial in the case of the array of std::string instances. For the big string case, we return a compute std::string_view instance.

  auto GetCodeFast = [&fast_table, &offsets](uint16_t index) {
    return std::string_view(&fast_table[offsets[index]],
                            offsets[index + 1] - offsets[index]);
  };

Though it looks like a fair amount of work, I find that processing 65536 randomly generated values is significantly faster with the one-big-string approach.

array of std::string (query) 0.12 ms
one big string + offsets (query) 0.08 ms

There are other ways to solve this problem. For example, instead of recomputing or computing offsets, we can use a fixed number of bytes per element (say 6 or 8 bytes) and either store the number of digits or compute it on the fly. You can also try to use even less memory by not storing a string like ’66,’ multiple times. All these variations have different trade offs and which is best depends on your application.

Generating many non-trivial objects is a performance anti-pattern. Put all your data together when you can.

Daniel Lemire, "Replace strings by views when you can," in Daniel Lemire's blog, September 9, 2024, https://lemire.me/blog/2024/09/09/replace-stdstring-by-stdstring_view-when-you-can/.

Published by

Daniel Lemire

A computer science professor at the University of Quebec (TELUQ).

12 thoughts on “Replace strings by views when you can”

  1. Since we’re generating one string after the other, we could calculate the digits without division by ten: just use five nested loops 0..9, the outer loop generates the first (most significant) digit, the inner loop generates the last digit.

    This would generate the numbers with leading zeros. We could avoid that, but maybe the conditions we’d have to add to the loops would be a but unwieldy (and possibly slow). If that’s the case, we could instead change the table of offsets to a table of counts of leading zeros, and use it to find the starting position of a number in the string buffer: std::string_view(&fast_table[index * 6 + leading_zeros[index]], 6 – leading_zeros[index])

    Since there are at most 4 leading zeros, leading_zeros could be an uint8_t array, which would only need 64 K instead of 256 K.

    Or we could remove the offsets or leading_zeros array altogether and instead calculate the count of leading zeros on the fly, e.g. similarly to https://lemire.me/blog/2021/06/03/computing-the-number-of-digits-of-an-integer-even-faster/

    Another possible optimization: We don’t really need to store the strings for numbers less than 10,000 – we can reuse the larger numbers. For exampe, if someone needs the string for 456, we can look up the string “10456” and return a string_view of the “456” part. This would reduce the total string length from 382106 to 55536*6 = 333216 bytes, about 13% less. (But in most cases, that’s probably not worth it, because the necessary index and offset calculations would make GetCodeFast() slower.)

  2. Why store the comma opposed to deriving the length from log10(index)? You really only need to store the 6 digit numbers and you could use a offset to find get the rest.

  3. Nice – thoughtful and clever as always. Thanks Daniel!

    In my own work there are even larger gains to be had encoding floats. We commonly deal with floats in the range [0,1], often generating large files of these.

    Decide on a resolution, for our work 4 digits is often plenty, and precompute all possible fractions (10k), and when called upon to write a fraction, just compute the index and write the string. Just like above.

    While I have never tried it, I have always wondered how it would work if the array were not initialized, and when a request for a value was encountered, it would only then be computed. If all values tended to be used, it would clearly be worse, so hard to know in general.

    The other thing we do with integers is to store only the commonly used values (which happens a lot for us – molecules). Then if a value outside the cache is requested, it is done via normal integer conversion to string.

    I would share code, but it was written a LONG time ago, and contains many non-standard things by today’s thinking. But it is simple.

  4. I wish the C++ standard library (and C standard library, and OS API) had more support for string views. I tend to prefer std::string_view, however they have to be converted to null terminated strings at API boundaries most of the time.

  5. If you go for fixed size blocs, here is an even faster way to do it:

    std::array
    precompute_string_even_faster() {
    std::array str = { '0', '0', '0', '0', '0', ','};
    for(size_t digit = 0, done = 1; digit < 5; ++digit, done *= 10) {
    char *dest = &str[6*done], *next_dest;
    for(size_t i = 1; i < 10; ++i) {
    char current_char = '0' + i;
    size_t todo = std::min(done, 65536 - done*i);
    next_dest = std::copy_n(str.data(), 6*todo, dest);
    for(size_t j = 0; j < todo; ++j) {
    dest[6*j + 6 - 2 - digit] = current_char;
    }
    dest = next_dest;
    if (dest == str.end()) {
    return str;
    }
    }
    }
    __builtin_unreachable();
    }

  6. There is no need for a 256KiB lookup table. We can simply pretend that all items have the same length and then use a correction factor for those items that are shorter than the max:

    //include the ‘,’
    constexpr auto length(uint16_t number) { return number > 9999? 5+1: number > 999? 4+1: number > 99? 3+1: number > 9? 2+1: 1+1; }
    int correction_factor[5] { 0, -9*3 + 9*2, -99*4 + 99*3 – 9*3 + 9*2, etc.};
    //reduces to a lookup table

    constexpr auto PosAndLen(uint16_t number) {
    const auto len = length(number);
    const auto index = (number * len) + correction_factor[length(number)];
    return {index, length};
    }

    1. Oops, meant to write:

      constexpr auto PosAndLen(uint16_t number) {
      const auto len = length(number);
      const auto index = (number * len) + correction_factor[len];
      return {index, len};
      }

      The idea is to prevent cache eviction and its associated costs.
      I’m sure this trick has been (re)invented many times, except that I’ve not seen it mentioned and don’t know the term for it.

Leave a Reply

Your email address will not be published.

You may subscribe to this blog by email.