Optimizing compilers deduplicate strings and arrays

When programming, it can be wasteful to store the same constant data again and again. You use more memory, you access more data. Thankfully, your optimizing compiler may help.

Consider the following two lines of C code:

    printf("Good day professor Jones");
    printf("Good day professor Jane");

There is redundancy since the prefix “Good day professor” is the same in both cases. To my knowledge, no compiler is likely to trim this redundancy. However, you can get the desired trimming by breaking the strings:

    printf("Good day professor ");
    printf("Jones");

    printf("Good day professor ");
    printf("Jane");

Most compilers will recognize the constant string and store it once in the program. It works even if the constant string “Good day professor” appears in different functions.

Thus the following function may return true:

    const char * str1 = "dear friend";
    const char * str2 = "dear friend";
    return str1 == str2;

That is, you do not need to manually create constant strings: the compiler recognizes the redundancy (typically).

The same trick fails with extended strings:

    const char * str1 = "dear friend";
    const char * str2 = "dear friend\0f";
    return str1 == str2;

All compilers I tried return false. They create two C strings even if one is a prefix of the other in the following example…

char get1(int k) {
    const char * str = "dear friend";
    return str[k];
}

char get2(int k) {
    const char * str = "dear friend\0f";
    return str[k];
}

Unsurprisingly, the “data compression” trick works with arrays. For example, the arrays in these two functions are likely to be compiled to just one array because the compiler recognizes that they are identical:

int f(int k) {
    int array[] = {1,2,3,4,5,34432,321323,321321,1,
    2,3,4,5,34432,321323,321321};
    return array[k];
}


int g(int k) {
    int array[] = {1,2,3,4,5,34432,321323,321321,1,
    2,3,4,5,34432,321323,321321};
    return array[k+1];
}

It may still work if one array is an exact subarray of the other ones with GCC, as in this example:

int f(int k) {
    int array[] = {1,2,3,4,5,34432,321323,321321,1,
    2,3,4,5,34432,321323,321321};
    return array[k];
}


int g(int k) {
    int array[] = {1,2,3,4,5,34432,321323,321321,1,
    2,3,4,5,34432,321323,321321,1,4};
    return array[k+1];
}

You can also pile up several arrays as in the following case where GCC creates just one array:

long long get1(int k) {
    long long str[] = {1,2,3};
    return str[k];
}

long long get2(int k) {
    long long str[] = {1,2,3,4};
    return str[k+1];
}

long long get3(int k) {
    long long str[] = {1,2,3,4,5,6};
    return str[k+1];
}

long long get4(int k) {
    long long str[] = {1,2,3,4,5,6,7,8};
    return str[k+1];
}

It also works with arrays of pointers, as in the following case:

const char * get1(int k) {
    const char * str[] = {"dear friend", "dear sister", "dear brother"};
    return str[k];
}

const char * get2(int k) {
    const char * str[] = {"dear friend", "dear sister", "dear brother"};
    return str[k+1];
}

Of course, if you want to make sure to keep your code thin and efficient, you should not blindly rely on the compiler. Nevertheless, it is warranted to be slightly optimistic.

Published by

Daniel Lemire

A computer science professor at the University of Quebec (TELUQ).

14 thoughts on “Optimizing compilers deduplicate strings and arrays”

  1. It’s suffixes of C-strings that can be shared, not prefixes (recall their length is not encoded except implicitly by the 0 byte terminator).

      1. It wouldn’t be possible. The way suffix string optimization works is by having the block and placing pointers in the middle. Because null termination defines where the string is, you can’t just cut off the end for one of them; however, you can cut off the beginning by placing the pointer of the substring in the middle.

        1. In his example, the second string contains an explicit null character, so I think it COULD be shared with the first one?

          Of course, it’s probably rare enough that this comes up with string constants that compiler authors didn’t bother to add the optimization

  2. I guess most of optimizations are due to deep inlining, but there is indeed lot of compression: very complex assembly in place of trivial lookup.
    Very interesting cat-and-mouse games the compiler plays: when it comes to storage it makes very reasonable (conservative) assumptions, but when it comes to code generation, it just inlines as much as possible, to the point that most of computation is treated as “constexpr”.
    Here are my experiments:
    https://godbolt.org/z/bPqG9f14M

  3. If you can discard the null termination rule though semantic analysis (e.g if the use of the strings is equivalent to c++’s.std::span) then you ought to be able to optimize it – providing the value is not passed to any functions that expect null termination. Have you seen her Sutter’s recent cpp2 talk? In a cpp2-pure context, that sort of optimization ought to become more possible I think

    1. I have seen suggestions to use std::stri g_view instead, which work like span, so that could be opportunity for compiler.

  4. This happens not just within compilation units—but across them too!—given a sufficiently smart linker. Read about the SHF_MERGE flag in ELF object files for more info.

  5. Interesting article, however maybe a mistake?

    Comparing pointers.. only compares pointers, not the strings of bytes they may point to. For that you’d need strcmp()

    Your article writes:

    “The same trick fails with extended strings:
    return str1 == str2”

    1. That’s the whole point of the article: the compiler sometimes reuses the same piece of memory and therefore we compare pointers to detect that.

  6. Can’t share the prefix for C strings since they implicitly have (and require) a null terminator, so the first string is not actually a prefix of the second in terms of what it can store in the data section of the bin.

    You could potentially have that optimization in languages where length is stored out of line (cpp/rust) but to try to break up and print common prefixes of strings with diverging postfixes without having print basically be an intrinsic with a *lot* of magic is hard (as well as significantly less performant, or bigger if you monomorphise which takes away the space savings)

Leave a Reply

Your email address will not be published. The comment form expects plain text. If you need to format your text, you can use HTML elements such strong, blockquote, cite, code and em. For formatting code as HTML automatically, I recommend tohtml.com.

You may subscribe to this blog by email.