Most strings in software today are represented using the unicode standard. The unicode standard can represent most human readable strings. Unicode works by representing each ‘character’ as a numerical value (called a code point) between 0 and 1 114 112.
>>> "\u00e9" 'é'
Unfortunately, unicode does not ensure that there is a unique way to achieve every visual character. For example, you can combine the letter ‘e’ (code point 0x65) with ‘acute accent’ (code point 0x0301):
>>> "\u0065\u0301" 'é'
Unfortunately, in most programming languages, these strings will not be considered to be the same even though they look the same to us:
>>> "\u0065\u0301"=="\u00e9" False
For obvious reason, it can be a problem within a computer system. What if you are doing some search in a database for name with the character ‘é’ in it?
The standard solution is to normalize your strings. In effect, you transform them so that strings that are semantically equal are written with the same code points. In Python, you may do it as follows:
>>> import unicodedata >>> unicodedata.normalize('NFC',"\u00e9") == unicodedata.normalize('NFC',"\u0065\u0301") True
There are multiple ways to normalize your strings, and there are nuances.
> "\u0065\u0301".normalize() == "\u00e9".normalize() true
Though you should expect normalization to be efficient, it is unlikely to be computationally free. Thus you should not repeatedly normalize your strings, as I have done. Rather you should probably normalize the strings as they enter your system, so that each string is normalized only once.
Normalization alone does not solve all of your problems, evidently. There are multiple complicated issues with internalization, but if you are at least aware of the normalization problem, many perplexing issues are easily explained.
Further reading: Internationalization for Turkish:
Dotted and Dotless Letter “I”
6 thoughts on “String representations are not unique: learn to normalize!”
i wish this could be done at the OS clipboard level 🙁
It’s also a pain dealing with api’s that flip a coin to decide, do I reject an invalid “utf8” sequence with an error, or do I emit (ascii!) SUB? And maybe multiple adjacent erroneous sequences become multiple SUBs. Or not. Then some other system does a binary comparison of the equivalent (?!) strings. Gah.
Likewise: overlong encodings are errors? subbed? mapped? passed through to be someone else’s problem?
It is important to recognize that what the user thinks of as a “character”—a basic unit of a writing system for a language—may not be just a single Unicode code point. Instead, that basic unit may be made up of multiple Unicode code points. To avoid ambiguity with the computer use of the term character, this is called a user-perceived character. For example, “G” + grave-accent is a user-perceived character: users think of it as a single character, yet is actually represented by two Unicode code points. These user-perceived characters are approximated by what is called a grapheme cluster, which can be determined programmatically.
Grapheme cluster boundaries are important for collation, regular expressions, UI interactions, segmentation for vertical text, identification of boundaries for first-letter styling, and counting “character” positions within text. Word boundaries, line boundaries, and sentence boundaries should not occur within a grapheme cluster: in other words, a grapheme cluster should be an atomic unit with respect to the process of determining these other boundaries.
One specific problem with normalization is that the concatenation of two normalized strings is not necessarily a normalized string, so normalization only during input is not necessarily sufficient.
Interesting. Can you give me an example?
There are some in the Unicode TR 15:
You may subscribe to this blog by email.