Sentinels can be faster

A standard trick in programming is to use “sentinel values”. These are special values that represent metadata efficiently.

The C language represents strings as a sequences of characters terminated with the null character. The null character is a sentinel that indicates the string length. E.g., my name (“Lemire”) would be represent as the string “Lemire\0” in memory when using the C language where I represent the null character as ‘\0’ . The length of my name (6 characters) is never stored explicitly.

If you are a long-time C programmer, you probably think that the null sentinel is an “obvious” concept but newcomers to programming often find it counterintuitive. It is a non-trivial idea.

The null sentinel is “space efficient” in C. It is almost surely the case that other programming languages have more memory overhead per string.

Yet I remember reading that null-terminated strings were a design flaw in the C language from the point of view of performance. Indeed, you frequently need to know the length of a string and when you do, you need to scan all of the string to find the null sentinel and compute the string length. Other ancient programming languages like Pascal had string values with length prefixes. Today, I expect most programming languages to have steered clear of null-terminated strings.

But the use of sentinels can also lead to computationally efficient code though you do not need null characters per se. Let me illustrate.

Suppose that you are given a string and that you want to skip all of the leading spaces. If you are only given the starting point of the string and either its length or its ending point, then you might be stuck with code like so:

const char * skip_leading_spaces(const char * start, const char * end) {
  while((start != end) && (is_space(*start))) {
  return start;

For each character, you have to check that you are still within the string, and you have to check whether it is a space. You end up doing two comparisons! The C++ programming languages has a strong bias in favour of such constructions. The popular functional programming idioms are often built on top of similar code.

Of course, you could get smarter. For example, you could branch on the string size to try to avoid one of the check at each character.  But if the string length is hard to predict, you may end up with small benefits and complicated code.

What if you know that the string is either null-terminated, or that it contains at least one non-space character that can serve as a de facto sentinel ? Then you can use simpler code where you only have to load a new character and check that it is a space:

const char * skip_leading_spaces(const char * start) {
  while(is_space(*start)) {
  return start;

You end up with half the number of comparisons!

Is it faster?

It can be a lot faster. I wrote a little benchmark where I generate random strings having a random number of leading spaces. The sentinel approach is 40% faster in my tests.

Of course, the results depend critically on your data and on the exact problem you are trying to solve. Yet I think it is worth keeping sentinels in mind when you need to write high performance code, and you are dealing with irregular data.

Daniel Lemire, "Sentinels can be faster," in Daniel Lemire's blog, September 3, 2020.

Published by

Daniel Lemire

A computer science professor at the University of Quebec (TELUQ).

22 thoughts on “Sentinels can be faster”

  1. NULL terminated strings are potentially also a security vulnerability if not handled properly. Other than making it easier to accidentally do buffer overflows, they’re subject to NULL byte injection attacks. They also can’t properly support binary data in general (since you’re going to have data with NULLs in them), so you need to have support for a length-specified variant anyway.

    Speed wise, they can be problematic, particularly if you’re looking to parallelise code (e.g. SIMD). It can also effectively turn character iteration into a pointer-chasing problem – the CPU only knows it can check the next character if the current character isn’t NULL (and a compiler can’t help you), though, for a lot of string handling stuff, it’s probably the same, unless you have fixed length fields, or known length processing, e.g. strcmp, or are willing to put effort in to work around it, e.g. SIMD.
    Of course, you can always trivially append a value at the end if it would benefit your code.

    1. I agree with most of this comment, but not the “can also effectively turn character iteration into a pointer-chasing problem”. It is true that it prevents easy vectorization because you can’t be sure if you can access byte n+1 until you’ve checked byte n – but it is materially different that pointer chasing since there is no data dependency there, only a well-predicted control dependency. That’s the difference between say 1 cycle per byte versus 5 cycles per byte.

      Vectorization is still possible, but you have to make some tough choices: e.g., aligning your reads and using the fact you won’t cross a a page to do a read that is technically out of bounds, but known safe. Library code like strlen() does exactly that – but they have to get whitelisted in tools like ASAN and Valgrind which would otherwise complain. For libraries other than the standard library, this is less palatable since you’ll probably have to convince users to add the whitelist rules yourself.

  2. C newbie here, so apologies for my limited understanding.

    We pass const char * start to the skip_leading_spaces() function, and inside we increment the start pointer. But this is a const char *, so we shouldn’t be allowed to increment it, right?

      1. To clarify for Raj, const char* means the char can’t change, whereas char* const means the pointer can’t change.

        You can even have const char* const where both the pointer and the pointed to char are const.

        Yes, at first this is confusing, but it’ll become natural after some more months writing C or C++.

  3. This is a complicated tradeoff. Sentinels can be faster, but they can also be slower.

    In general I found that they are a great tool to micro-optimize parsers, but they make scanning using SIMD hard – unless the CPU provides instructions to load data that don’t trigger fault handlers, or you’re willing to cheat by detecting page boundaries (which doesn’t work with address sanitizer / valgrind), null-terminated strings prevent ability to scan multiple characters at a time.

    E.g. scanning an integer out of a string can be done with a few SIMD ops (load 16 bytes, subtract ‘0’, range check vs 10, movemask, find first 0 bit) if you know the 16 byte load is safe, but that’s hard to establish without an explicit size.

  4. Noooo! Byte at a time is a sure way to kill performance.
    – read glibc str/mem routines for your favorite architecture.
    – understand them
    – profit

    I did some of the initial glibc vectorized versions for x86_64 back in 2010 while at AMD, and saw more than a 10x speedup for non-trivial strings. Bigger speedups for longer strings. Glibc developers have significantly improved upon those versions since then.

  5. Unless you are running these tests in a processor without branch prediction I honestly don’t understand the 40% difference – I would have expected a very small (if any) difference. Actually, in your benchmarking code it looks you generate the data once…, not two identical copies (in different memory addresses) of the data for the ‘regular’ and ‘sentinel’ case, so I think that that is the source of the 40%…

    1. Well predicted comparisons are not free: at a minimum there are still instructions and uops to make the comparison and take the branch. It can get worse from there: the compiler might not know which is the likely branch and it can end up organizing code so that the fast path has an always-taken branch which slows things down even more, etc.

      A 40% improvement from removing one compare-and-branch from a loop that is extremely small, basically consisting of a single load and another compare-and-branch is not surprising.

  6. I compiled your code (and changed the order in which the tests are exercised), run it many times and I could not get more than 5% difference between the two with -O2 and -O3. So, I was wrong in that the difference would be small – 5% is significant to me, but for the 40% there must be some reason in how the code is generated or executed (which I don’t have the time to analyze further).

    1. Here are my numbers on an AMD Rome processor:

      [dlemire@rome 09]$ c++ --version
      c++ (GCC) 8.3.1 20190311 (Red Hat 8.3.1-3)
      Copyright (C) 2018 Free Software Foundation, Inc.
      This is free software; see the source for copying conditions.  There is NO
      [dlemire@rome 09]$ make
      c++ -O3 -std=c++11 -o sentinel sentinel.cpp
      [dlemire@rome 09]$ ./sentinel
       N = 10000
      range 1313.74
      sentinel 892.865
      ratio 1.47138

      Here are the numbers on a 2008 imac:

      $ c++ --version
      Apple clang version 11.0.3 (clang-1103.0.32.62)
      Target: x86_64-apple-darwin19.6.0
      Thread model: posix
      InstalledDir: /Applications/
      lemire@iMac-2018 ~/CVS/github/Code-used-on-Daniel-Lemire-s-blog/2020/09 $ make
      c++ -O3 -std=c++11 -o sentinel sentinel.cpp
      lemire@iMac-2018 ~/CVS/github/Code-used-on-Daniel-Lemire-s-blog/2020/09 $ ./sentinel
       N = 10000
      range 1040.48
      sentinel 725.152
      ratio 1.43484

      Here are my numbers on a Linux-based skylake (Intel) box:

      $ c++ --version
      c++ (Ubuntu 8.4.0-1ubuntu1~16.04.1) 8.4.0
      Copyright (C) 2018 Free Software Foundation, Inc.
      This is free software; see the source for copying conditions.  There is NO
      dlemire@skylake:~/CVS/github/Code-used-on-Daniel-Lemire-s-blog/2020/09$ make
      c++ -O3 -std=c++11 -o sentinel sentinel.cpp
      dlemire@skylake:~/CVS/github/Code-used-on-Daniel-Lemire-s-blog/2020/09$ ./sentinel
       N = 10000
      range 1194.83
      sentinel 871.197
      ratio 1.37148
      1. I tried your latest version in my computer (which is “old”, an Intel(R) Core(TM) i7-2620M CPU @ 2.70GHz, c++/gcc 9.3.0) and got:

        $ make
        c++ -O3 -std=c++11 -o sentinel sentinel.cpp
        $ ./sentinel
        N = 10000
        range 1411.8
        sentinel 1353.81
        ratio 1.04284

        I see that you get the ~40% and I am sure that that outcome is really happening.

        However, it is very difficult to explain the reason why (such a large percentage) in a general way. Both the range condition and is_space() require the new value of start (which just takes an addition), whereas the main latency is incurred in bringing the character from cache/memory and checking the table within is_space(). A modern speculative processor will already have computed the next “start++”, performed the check “start != end” (but not committed), and could in principle even initiate the next memory/cache request before is_space() finishes, also, being sequential access, the data for the speculative path is also probably in the cache. If I do the loop without calling is_space(), it takes 16 ns, i.e., the overhead of the range check alone is very small, and the overhead of ‘is_space()’ we also know it (the ‘sentinel’ version). I changed the order of the comparisons for the range check to:

        while (is_space(*start) && (start != end))

        and got even a smaller difference (~1%):

        $ ./sentinel-v
        N = 10000
        range 1348.07
        sentinel 1337.33
        ratio 1.00803

        Because I am pretty sure that the 40% is actually happening, this has me puzzled, puzzled because the underlying reason has to be more complex than just having two comparisons. I possibility that crossed my mind is that the overhead we are seeing with the range comparison could be related to the fact that we are comparing “chars” and maybe this causes a barrier effect as the next character is aliased in the same word and maybe this messes with speculation, the compiler, or both (I’m just speculating now, as I can’t go much further into that rabbit hole, but interesting anyway).

        1. If you email me, I’ll give you access to one of my servers.

          whereas the main latency is incurred in bringing the character from cache/memory

          Memory/cache access latency is almost surely irrelevant for this benchmark. We are reading sequentially.

          while (is_space(*start) && (start != end))

          I expect that this is incorrect code.

          puzzled because the underlying reason has to be more complex than just having two comparisons

          Here is the assembly of the sentinel version…

              movzx   ecx, byte ptr [rax + 1]
              add     rax, 1
              cmp     byte ptr [rcx + is_space(unsigned char)::table], 0
              jne     .LBB0_1

          You can retire easily 4 instructions per cycle. Depending on the processor, there may be a limit at home many jumps you can make (e.g., you could be limited to one every two cycles). I believe however that in this case recent Intel and AMD processors should be able to sustain one jump per cycle given how short the loop is.

          Note that a sandy bridge processor like yours is not something I have recent experience with (it is 10 years old).

          Here is the assembly of the other version…

              movzx   ecx, byte ptr [rax]
              cmp     byte ptr [rcx + is_space(unsigned char)::table], 0
              je      .LBB1_4
              add     rax, 1
              cmp     rsi, rax
              jne     .LBB1_1
              mov     rax, rsi

          I would expect that the compare/jump instructions get fused. However, you surely can’t execute two fused compare/jump instruction per cycle. So the long version has a limit of two cycles per iteration.

          So I get a speed limit of one iteration per cycle for the sentinel version and a speed limit of two iteration every two cycles for the regular version. Of course, these are speed limitations and we probably do not manage to achieve this ideal performance.

          1. Wow, that was a very valuable deep dive. Many thanks. This exercise was more valuable than I would have thought initially.

            Now I think I see it. It doesn’t matter how much speculative work the processor could have done, in this case, if I understand it correctly, we are gated by the number and type of instructions that can be retired per cycle, and there lies, for me, an important takeaway. It is a very illustrative example to teach about the state of modern microarchitecture.

            Also, I agree with all other points (sequential access, means that accessing the data is not the limiting factor), also, when inverting the comparisons I was sloppy (I cannot think of an excuse for that). My Sandy Bridge still feels very powerful if I don’t compare it against any modern processor 🙂

            1. I have not “reasoned out” the performance to my satisfaction. It would have taken me quite a few more minutes. I just wanted to illustrate that the 40% difference is credible.

              For such a small benchmark, it should be possible to model the performance entirely and know exactly the expected performance.

              A sandy bridge processor is really not very far from current processors as far as the main architectural features are concerned but on microbenchmarks, there can certainly be sizeable differences.

              1. The ways in which complexity leaks out of anything even when it looks short and simple never ceases to overwhelm me.

                At the beginning, the 40% looked to me too extraordinary for a simple explanation – I wanted to know where could it come from, exactly.

                BTW, I came across, which actually contains a section on the bottlenecks of each microarchitecture that I found interesting (assuming it is an accurate report).

  7. mov al,’ ‘
    repz scasb
    jz @F
    dec rdi

    I’d probably inline it. (c: The CISC goodness of x86 has been neglected by CPU manufacturers for decades – with good reason. Instructions like JRCXZ make handling Pascal type strings easy though.

    The main benefit of not using sentinels is that the zero byte can be used for something else – typically, binary data needing the full range of values.

    Each method has their uses: sentinel, count or range.

    1. I inlined the code you suggested and tested in my laptop (I know that could be considered a sin, an Intel(R) Core(TM) i7-2620M CPU @ 2.70GHz, c++/gcc 9.3.0) and found no difference between the generated C code or the inlined CISC instructions; I am curious if you get different results in a different machine and why the results would be different or not (I can imagine ways in which it could be slower or faster, I just don’t know enough of each concrete microarchitecture to figure out why before running the benchmark).

      I did it like this:

      const char *skip_leading_spaces_asm1(const char *start) {
      asm("mov al, ' ';\n\t"
      "repz scasb;\n\t"
      "jz 1f;\n\t"
      "dec rdi;\n\t"
      : "+D"(start)
      : "m"(*(const char(*)[])start), "c"(-1)
      : "cc", "al");
      return start;

      Or alternatively (effectively the same code, but using Extended ASM’s features better to set ‘al’):

      const char *skip_leading_spaces_asm(const char *start) {
      asm volatile("repz scasb;\n\t"
      "jz 1f;\n\t"
      "dec rdi;\n\t"
      : "+D"(start)
      : "m"(*(const char(*)[])start), "a"(' '), "c"(-1)
      : "cc");
      return start;

      I believe that the code is correct (I printed the ‘start’ at then end of each run and compared it against the C++ function’s output). I would nonetheless welcome that if something in the code is dangerous or wrong or not best practice, or could be radically improved, it’d be pointed out.

      I modified the Makefile like this:

      tolower: sentinel.cpp
      c++ -O3 -masm=intel -std=c++11 -o sentinel sentinel.cpp
      c++ -O3 -masm=intel -std=c++11 -S -o sentinel.asm sentinel.cpp

      This was one output:

      N = 10000
      range 1413.99
      sentinel 1340.31
      sentinel_asm 1354.13
      ratio range/sentinel 1.05497
      ratio range/sentinel_asm 1.04421

      1. Thanks for testing. I didn’t expect it to be faster in this context – just as small as the call overhead of a separate function and more featureful. Timing such a small piece of code is error prone. From a practical standpoint: using such a routine is surely to be followed by a response for an empty string – which the forward branch would also handle.

        There is another thing though. Intel’s processor design has religated the complex CICS instructions to reduced silicon instead of improving their speed – for a couple decades. This is because compilers don’t generate these instructions. Further examples are: LOOP*, JCXZ, etc.. Exceptions are REP STOS/MOVS*, which have been optimized in recent generations.

        JCXZ and LOOP seem to have not been neglected on AMD processors.

        Unpopular instructions are not optimized. In extreme cases they are repurposed (like BOUND, and the BCD instructions). The unfortunate consequence has us programming like RISC machines – which are boring, imho.

Leave a Reply

Your email address will not be published.

You may subscribe to this blog by email.