How quickly can you break a long string into lines?

Suppose that you receive a long string and you need to break it down into lines. Consider the simplified problems where you need to break the string into segments of (say) 72 characters. It is a relevant problem if your string is a base64 string or a Fortran formatted statement.

The problem could be a bit complicated because you might need consider the syntax. So the speed of breaking into a new line every 72 characters irrespective of the content provides an upper bound on the performance of breaking content into lines.

The most obvious algorithm could be to copy the content, line by line:

void break_lines(char *out, const char *in, size_t length,
  size_t line_length) {
  size_t j = 0;
  size_t i = 0;
  for (; i + line_length <= length; i += line_length) {
    memcpy(out + j, in + i, line_length);
    out[j+line_length] = '\n';
    j += line_length + 1;
  }
  if (i < length) {
    memcpy(out + j, in + i, length - i);
  }
}

Copying data in blocks in usually quite fast unless you are unlucky and you trigger aliasing. However, allocating a whole new buffer could be wasteful, especially if you only need to extend the current buffer by a few bytes.

A better option could thus be to do the work in-place. The difficulty is that if you load the data from the current array, and then write it a bit further away, you might be overwriting the data you need to load next. A solution is to proceed in reverse: start from the end… move what would be the last line off by a few bytes, then move the second last line and so forth. Your code might look like the following C function:

void break_lines_inplace(char *in, size_t length, size_t line_length) {
  size_t left = length % line_length;
  size_t i = length - left;
  size_t j = length + length / line_length - left;
  memmove(in + j, in + i, left);
  while (i >= line_length) {
    i -= line_length;
    j -= line_length + 1;
    memmove(in + j, in + i, line_length);
    in[j+line_length] = '\n';
  }
}

I wrote a benchmark. I report the results only for a 64KB input. Importantly, my numbers do not include memory allocation which is separate.

A potentially important factor is whether we allow function inlining: without inlining, the compiler does not know the line length at compile-time and cannot optimize accordingly.

Your results will vary, but here are my own results:

method Intel Ice Lake, GCC 12 Apple M2, LLVM 14
memcpy 43 GB/s 70 GB/s
copy 25 GB/s 40 GB/s
copy (no inline) 25 GB/s 37 GB/s
in-place 25 GB/s 38 GB/s
in-place (no inline) 25 GB/s 38 GB/s

In my case, it does not matter whether we do the computation in-place or not. The in-place approach generates more instructions but we are not limited by the number of instructions.

At least in my results, I do not see a large effect from inlining. In fact, for the in-place routine, there appears to be no effect whatsoever.

Roughly speaking, I achieve a bit more than half the speed as that of a memory copy. We might be limited by the number of loads and stores. There might be a clever way to close the gap.

Daniel Lemire, "How quickly can you break a long string into lines?," in Daniel Lemire's blog, April 19, 2024.

Published by

Daniel Lemire

A computer science professor at the University of Quebec (TELUQ).

3 thoughts on “How quickly can you break a long string into lines?”

  1. You might be able to get a bit more throughput by limiting the size range of copies, ignoring overshoot, instead of arbitrary sized memcpy/memmove. Example:

    template
    __attribute__((target(“avx2”)))
    inline void break_lines_fixlen(char *out, const char *in, size_t length,
    size_t line_length) {
    if(line_length ll_max) {
    break_lines(out, in, length, line_length);
    return;
    }

    size_t j = 0;
    size_t i = 0;
    for (; i + ll_max <= length;) {
    std::memcpy(out + j, in + i, ll_max);
    out[j+line_length] = '\n';
    j += line_length + 1;
    i += line_length;
    }
    if (i < length) {
    std::memcpy(out + j, in + i, length – i);
    }
    }

    I definitely do get better throughput for the inplace variant, on a i7 12700K:

    size 65536
    copy : 23.41 GB/s 0.4 Ma/s 2800.00 ns/d
    copy_fixlen : 24.27 GB/s 0.4 Ma/s 2700.00 ns/d
    memcpy : 59.58 GB/s 0.9 Ma/s 1100.00 ns/d
    inplace : 43.69 GB/s 0.7 Ma/s 1500.00 ns/d
    inplace_fixlen : 46.81 GB/s 0.7 Ma/s 1400.00 ns/d

    (`_fixlen` variants use as the range)

  2. Depending on the context and the further usage of the ‘broken’ lines, maybe no copies at all are required.
    In that cases, std::span or std::string_view could help.

Leave a Reply

Your email address will not be published.

You may subscribe to this blog by email.