How fast can you validate UTF-8 strings in JavaScript?

When you recover textual content from the disk or from the network, you may expect it to be a Unicode string in UTF-8. It is the most common format. Unfortunately, not all sequences of bytes are valid UTF-8 and accepting invalid UTF-8 without validating it is a security risk.

How might you validate a UTF-8 string in a JavaScript runtime?

You might use the valid-8 module:

import valid8 from "valid-8";
if(!valid8(file_content)) { console.log("not UTF-8"); }

Another recommended approach is to use the fact that TextDecoder can throw an exception upon error:
new TextDecoder("utf8", { fatal: true }).decode(file_content)

Or you might use the isUtf8 function which is part of Node.js and Bun.
import { isUtf8 } from "node:buffer";
if(!isUtf8(file_content)) { console.log("not UTF-8"); }

How do they compare? Using Node.js 20 on a Linux server (Intel Ice Lake), I get the following speeds with three files representative of different languages. The Latin file is just ASCII. My benchmark is available.
Arabic Chinese Latin
valid-8 0.14 GB/s 0.17 GB/s 0.50 GB/s
TextDecoder 0.18 GB/s 0.19 GB/s 7 GB/s
node:buffer 17 GB/s 17 GB/s 44 GB/s

The current isUtf8 function in Node.js was implemented by Yagiz Nizipli. It uses the simdutf library underneath. John Keiser should be credited for the UTF-8 validation algorithm.

Published by

Daniel Lemire

A computer science professor at the University of Quebec (TELUQ).

2 thoughts on “How fast can you validate UTF-8 strings in JavaScript?”

  1. How about limiting it to “in the browser”? Could one compile simdutf to WebAssembly and use its SIMD instruction set to out-do TextDecoderand valid-8? Seems like maybe not yet, there’s an open issue for adding WebAssembly SIMD support.

Leave a Reply

Your email address will not be published.

To create code blocks or other preformatted text, indent by four spaces:

    This will be displayed in a monospaced font. The first four 
    spaces will be stripped off, but all other whitespace
    will be preserved.
    
    Markdown is turned off in code blocks:
     [This is not a link](http://example.com)

To create not a block, but an inline code span, use backticks:

Here is some inline `code`.

For more help see http://daringfireball.net/projects/markdown/syntax

You may subscribe to this blog by email.

Exit mobile version