String representations are not unique: learn to normalize!

Most strings in software today are represented using the unicode standard. The unicode standard can represent most human readable strings. Unicode works by representing each ‘character’ as a numerical value (called a code point) between 0 and 1 114 112.

Thus the character é is typically represented as the numerical value 233 (or 0xe9 in hexadecimal). Thus in Python, JavaScript and many other programming languages, you get the following:

>>> "\u00e9"
'é'

Unfortunately, unicode does not ensure that there is a unique way to achieve every visual character. For example, you can combine the letter ‘e’ (code point 0x65) with ‘acute accent’ (code point 0x0301):

>>> "\u0065\u0301"
'é'

Unfortunately, in most programming languages, these strings will not be considered to be the same even though they look the same to us:

>>> "\u0065\u0301"=="\u00e9"
False

For obvious reason, it can be a problem within a computer system. What if you are doing some search in a database for name with the character ‘é’ in it?

The standard solution is to normalize your strings. In effect, you transform them so that strings that are semantically equal are written with the same code points. In Python, you may do it as follows:

>>> import unicodedata
>>> unicodedata.normalize('NFC',"\u00e9") == unicodedata.normalize('NFC',"\u0065\u0301")
True

There are multiple ways to normalize your strings, and there are nuances.

In JavaScript and other programming languages, there are equivalent functions:

> "\u0065\u0301".normalize() == "\u00e9".normalize()
true

Though you should expect normalization to be efficient, it is unlikely to be computationally free. Thus you should not repeatedly normalize your strings, as I have done. Rather you should probably normalize the strings as they enter your system, so that each string is normalized only once.

Normalization alone does not solve all of your problems, evidently. There are multiple complicated issues with internalization, but if you are at least aware of the normalization problem, many perplexing issues are easily explained.

Further reading: Internationalization for Turkish:
Dotted and Dotless Letter “I”

Published by

Daniel Lemire

A computer science professor at the University of Quebec (TELUQ).

6 thoughts on “String representations are not unique: learn to normalize!”

  1. It’s also a pain dealing with api’s that flip a coin to decide, do I reject an invalid “utf8” sequence with an error, or do I emit (ascii!) SUB? And maybe multiple adjacent erroneous sequences become multiple SUBs. Or not. Then some other system does a binary comparison of the equivalent (?!) strings. Gah.

    Likewise: overlong encodings are errors? subbed? mapped? passed through to be someone else’s problem?

  2. It is important to recognize that what the user thinks of as a “character”—a basic unit of a writing system for a language—may not be just a single Unicode code point. Instead, that basic unit may be made up of multiple Unicode code points. To avoid ambiguity with the computer use of the term character, this is called a user-perceived character. For example, “G” + grave-accent is a user-perceived character: users think of it as a single character, yet is actually represented by two Unicode code points. These user-perceived characters are approximated by what is called a grapheme cluster, which can be determined programmatically.

    Grapheme cluster boundaries are important for collation, regular expressions, UI interactions, segmentation for vertical text, identification of boundaries for first-letter styling, and counting “character” positions within text. Word boundaries, line boundaries, and sentence boundaries should not occur within a grapheme cluster: in other words, a grapheme cluster should be an atomic unit with respect to the process of determining these other boundaries.

    https://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries

  3. One specific problem with normalization is that the concatenation of two normalized strings is not necessarily a normalized string, so normalization only during input is not necessarily sufficient.

Leave a Reply to Djamé Cancel reply

Your email address will not be published.

To create code blocks or other preformatted text, indent by four spaces:

    This will be displayed in a monospaced font. The first four 
    spaces will be stripped off, but all other whitespace
    will be preserved.
    
    Markdown is turned off in code blocks:
     [This is not a link](http://example.com)

To create not a block, but an inline code span, use backticks:

Here is some inline `code`.

For more help see http://daringfireball.net/projects/markdown/syntax

You may subscribe to this blog by email.