Variable-length strings can be expensive

Much of our software deals with variable-length strings. For example, my name “Daniel” uses six characters whereas my neighbor’s name (“Philippe”) uses 8 characters.

“Old” software often shied away from variable-length strings. The Pascal programming language supported fixed-length strings. Most relational databases support fixed-length strings. Flat files with fixed-length lines and fixed-length records were common in the old days.

It seems that people today never worry about variable-length strings. All the recent programming languages I know ignore the concept of fixed-length strings.

Yet fixed-length strings have clear engineering benefits. For example, if I were to store “tweets” (from Twitter) in memory, using a flat 140 characters per tweet, I would be able to support fast direct access without having to go through pointers. Memory allocation would be trivial.

However, maybe processors have gotten so smart, and string operations so optimized, that there is no clear performance benefits to fixed-length strings in today’s software.

I don’t think so.

Even today, it is often possible to accelerate software by replacing variable-length strings by fixed-length ones, when you know ahead of time that all your strings are going to be reasonably short.

To examine this idea, I have created lists of strings made simply of the numbers (as strings) from 0 to 1024. Such strings have length ranging between 1 and 4 characters. In C, we use null-terminated characters, so the actual length is between 2 and 5. In C++, they have standard strings (std::string), with significant overhead: on my system, a single std::string uses 32 bytes, not counting the string content itself.

Instead of using variable-length strings, I can “pad” my strings so that they have a fixed length (8 characters), adding null characters as needed.

How long does it take, in CPU cycles per string, to sort a short array of strings (1024 strings)?

string type CPU cycles per element
C++ std::string 520
standard C string 300
padded C string 140

So for this experiment, replacing variable-length strings by fixed-length strings more than double the performance! And my code isn’t even optimized. For example, to keep the comparison “fair”, I sorted pointers to strings… but my padded C strings fit in a machine word and do not require a pointer. So, in fact, fixed-length strings could be nearly three times faster with enough work.

To summarize: Variable-length strings are a convenient abstraction. You may hear that string operations are very cheap, unlikely to be a bottleneck and so forth… That might be so…

But I submit to you that the omnipresence of variable-length strings as a universal default can make us blind to very sweet optimization opportunities.

My source code is available.

Published by

Daniel Lemire

A computer science professor at the University of Quebec (TELUQ).

20 thoughts on “Variable-length strings can be expensive”

  1. Out of curiosity, which compiler & version did you use for the C++? (The small string optimization in recent C++ makes them very close to fixed length strings for small strings, at the cost of a few more instructions to decide which version is being used.)

    1. Out of curiosity, which compiler & version did you use for the C++?

      $ g++ --version
      g++ (Ubuntu 5.4.1-2ubuntu1~16.04) 5.4.1 20160904
      

      The small string optimization in recent C++ makes them very close to fixed length strings for small strings (…)

      My results are much the same under clang. With the Intel compiler, fixed-length strings are no longer any faster, but that’s because they get to be as slow as the variable-length ones.

      I don’t have easy access to old compilers on this exact machine.

      1. I’ve had a look at your code and I wouldn’t be surprised if the biggest contribution to the speed difference is that the fixed length 8 padded-strings are allowing you to use a very optimized comparison function. (I don’t have a profiler setup on my new machine yet.)
        This is an interesting result and I agree fixed length strings are interesting, I’m just interested in figuring out what precisely is happening on the machine.

          1. Memory alignment could be playing a role over here.

            A useful test might be to allocate memory for std::string using aligned_alloc (alignment = 8).

  2. I suspect that a real application spends very little overall time sorting strings, so optimizing that part will have little impact on the overall performance.

    I recently profiled some data acquisition middleware, and discovered that it spent 93% of it’s time waiting on the driver to deliver raw data. Optimizing the remaining 7% would be largely a waste of my time.

    When looking closely at the driver – I further uncovered that I was operating at about 70% of the underlying transport mechanisms’ capacity – and improving this would require better firmware on the device side (reducing gaps between bytes and strings).

    So – even though I was adding arrays of integer numbers together, and I could see ways to optimize these, it would make no appreciable difference overall.

    1. @Dominic

      You’ll get no argument from me.

      Many systems have their performance bound set by IO, network… or other architectural constraints. That’s why we build massive systems using JavaScript (which does not even support parallelism!).

  3. As you said it’s application dependent, in MySQL where a fixed length is required the performance should be better. And going further in C padded where the string fits in word length only, there is some architecture dependency as well right, for tuning it to avoid pointers ?

  4. GCC strings are notoriously bad, because they are copy-on-write. They even used to have atomic operations involved in simple constructors, not to mention that GNU couldn’t implement a bug-free string implementation for a multi-core environment for many years. Copy-on-read strings are probably better (unfortunately, STLPORT died).

    That said, even with copy-on-read strings, accessing the string requires a dereference operation, which slows things down. Try a variant of the string, which keeps the counter followed by the data buffer. This one could be fast, IMHO.

    1. I think that used to be the case prior to the changes for C++11. (Not that g++’s stuff is brilliant, but it does use the SSO now.)

  5. BTW, Java’s string are immutable for a good reason. In a multi-threaded environment, it’s often much cheaper to copy a small string than to block a gazillion of threads on trivial string operations.

  6. We’ve unfortunately hit the point where most dictionary words will fit into a single machine word; the classic benchmark of sorting the dictionary turns out slower using classic algorithms than sorting it fixed-length.

    Some current algorithms use the fixed-length method by caching 4-8 bytes in an int next to the string pointer. For instance here: http://rd.springer.com/chapter/10.1007%2F978-3-540-89097-3_3#page-1 . It is indeed faster.

Leave a Reply

Your email address will not be published.

To create code blocks or other preformatted text, indent by four spaces:

    This will be displayed in a monospaced font. The first four 
    spaces will be stripped off, but all other whitespace
    will be preserved.
    
    Markdown is turned off in code blocks:
     [This is not a link](http://example.com)

To create not a block, but an inline code span, use backticks:

Here is some inline `code`.

For more help see http://daringfireball.net/projects/markdown/syntax

You may subscribe to this blog by email.