Performance tip: avoid unnecessary copies

Copying data in software is cheap, but it is not at all free. As you start optimizing your code, you might find that copies become a performance bottleneck.

Let me be clear that copies really are cheap. It is often more performant to copy that data than to track the same memory across different threads. The case I am interested in is when copies turn a trivial operation into one that is relatively expensive.

Recently, the fast JavaScript runtime (Bun) optimized its base64 decoding routines.Base64 is a technique used to represent binary data, like images or files, as ASCII text. It is ubiquitous on the Internet (email formats, XML, JSON, HTML and so forth). In Node.js and in Bun, you can decode a base64 string Buffer.from(s, "base64").

Importantly, when decoding base64 strings, you can be almost certain that you are dealing with a simple ASCII string. And systems like Node.js and Bun have optimized string representations for the ASCII case, using one byte per character.

We had optimized base64 decoding in Node.js some time ago (credit to Yagiz Nizipli for his work)… but I was surprised to learn that Bun was able to beat Node.js by a factor of three. Because both Node.js and Bun use the same base64 decoding, I was confused.

I mistakenly thought, based on quick profiling, that Node.js would copy the base64 data from an ASCII format (one byte per character) to a UTF-16 format (two bytes per character) despite our best attempt at avoiding copies. You can review my technically incorrect analysis on X. What I like is the second half of my analysis:

The story illustrates why our software is slower than it should be. We have layers of abstractions to fight against. Sometimes you win, sometimes you lose.

These layers are there for a reason, but they are not free.

To make matters worse… these abstraction layers often thicken over time… and the friction goes up.

To be clear, I do not claim that the Node.js code is optimal. In fact, I know it can be better. But it is not trivial to make it go fast.

I sometimes hear people say… “well, it is C++ and C++ is hard”. No. The C++ part is easy relatively speaking. The difficulty is at a higher level. It is not a matter of syntax. It is a matter of architecture.

I say that I was technically incorrect. Why was I incorrect?

It turns out that the copy was not happening as part of base64 decoding but in a completely separate function. There is an innocuous function in Node.js called  StringBytes::Size which basically must provide an upper on the memory needed by the Buffer.from function. I went back in time and looked at how this function looked in the original Node.js (0.10):

case BASE64: {
  String::AsciiValue value(str);
  data_size = base64_decoded_size(*value, value.length());
  break;
}

Can you see what it does? It takes the string it receives, it makes a copy, and from the copy, it computes how many bytes the decoded version will need.

That original code tried to make an ‘ASCII’ copy so I presume it created a copy using one byte per character. It still was a shame, but not terribly so.

But very soon after (Node.js 0.11), it was changed to a version that converted to 16-byte strings, for increased safety:

case BASE64: {
  String::Value value(str);
  data_size = base64_decoded_size(*value, value.length());
  break;
}

This new version is potentially much more expensive because it may copy an ASCII string to a temporary UTF-16 (2-byte per character) string. It uses more memory, but it is also a more complicated function. Internally, the JavaScript engine implements this with the C++ template std::copy_n. The generated code is probably fine, but it is hardly “down to the metal”.

As long as everything was not too optimized, it ran just fine… but as you optimized the base64 decoding, you found that this simple length computation took up more than half of the running time.

So, in concrete terms, what are we talking about as far as performance goes? Let me consider a JavaScript benchmark:

import { bench, run } from "mitata";
function makeBenchmark(size, isToString) {
  const base64Input = Buffer.alloc(size, "latin1").toString("base64");
  const base64From = Buffer.from (base64Input, "base64");
  bench(`Buffer. from(${size} bytes, 'base64')`, () => {
   Buffer.from(base64Input, "base64");
  });
}
[1024 * 1024 * 8]. forEach (s => makeBenchmark(s, false)) ;
await run();

I install the latest version of Bun (bun upgrade --canary). I compare Node.js 22 with a patched version. I use my Apple MacBook for testing (ARM-based M2 processor). You can see that by simply avoiding the unnecessary copy, I boosted the base64 decoding from 2.5 GB/s to 6.0 GB/s. Not bad for removing a single line of code.

function time speed
Node.js 18 6’400 µs 1.3 GB/s
Node.js 22 3’300 µs 2.5 GB/s
Node.js with fix 1’200 µs 7.0 GB/s
Bun 950 µs 8.8 GB/s

Sometimes people observe at this point that the performance of Node.js 18 was already fine: 1.3 GB/s is plenty fast. It might be fast enough, but you must take into account that we are measuring a single operation that is likely part of a string of operations. In practice, you do not just ingest base64 data. You do some work before and some work after. Maybe you decoded a JPEG image that was stored in base64, and next you might need to decode the JPEG and push it to the screen. And so forth. To have an overall fast system, every component should be fast.

You may observe that Bun is still faster than Node.js, even after I claim to have patched this issue. But there are likely other architecture issues that Bun does not have. Remember that both Node.js and Bun are using the same library in this instance: simdutf.

It is maybe interesting to review Bun’s code (in Zig):

const outlen = bun.base64.decodeLen(slice);
const to = allocator.alloc(u8, outlen) catch return &[_]u8{};
const wrote = bun.base64.decode(to[0..outlen], slice).count;
return to[0..wrote];

It is far simpler than the equivalent in Node where memory is allocated in one function, and then the resulting buffer is passed to another function where the decoding finally happens. It is likely that Bun is faster because it has a simpler architecture where it is easier to see where the unnecessary work happens.

Update. After the publication of this blog post,
Leszek Swirski added v8::String::ValueView to v8 which should reduce the need for copies.

Daniel Lemire, "Performance tip: avoid unnecessary copies," in Daniel Lemire's blog, June 22, 2024.

Published by

Daniel Lemire

A computer science professor at the University of Quebec (TELUQ).

Leave a Reply

Your email address will not be published.

You may subscribe to this blog by email.