Emojis, Java and Strings

Emojis are funny characters that are becoming increasingly popular. However, they are probably not as simple as you might thing when you are a programmer. For a basis of comparison, let me try to use them in Python 3. I define a string that includes emojis, and then I access the character at index 1 (the second character):

>>> x= "πŸ˜‚πŸ˜πŸŽ‰πŸ‘"
>>> len(x)
4
>>> x[1]
'😍'
>>> x[2]
'πŸŽ‰'

This works well. This fails with Python 2, however. So please upgrade to Python 3 (it came out ten years ago).

What about Java and JavaScript? They are similar but I will focus on Java. You can define the string just fine…

String emostring ="πŸ˜‚πŸ˜πŸŽ‰πŸ‘";

However, that’s where troubles begin. If you try to find the length of the string (emostring.length()), Java will tell you that the string contains 8 characters. To get the proper length of the string in terms of “unicode code points”, you need to type something like emostring.codePointCount(0, emostring.length()) (this returns 4, as expected). Not only is this longer, but I also expect it to be much more computationally expensive.

What about accessing characters? You might think that emostring.charAt(1) would return the second character (😍), but it fails. The problem is that Java uses UTF-16 encoding which means, roughly, that unicode characters can use one 16-bit word or two 16-bits, depending on the character. Thus if you are given a string of bytes, you cannot tell without scanning it how long the string is. Meanwhile, the character type in Java (char) is a 16-bit word, so it cannot represent all Unicode characters. You cannot represent an emoji, for example, using Java’s char. In Java, to get the second character, you need to do something awful like…

new StringBuilder().appendCodePoint(
  emostring.codePointAt(emostring.offsetByCodePoints(0, 1))).toString()

I am sure you can find something shorter, but that is the gist of it. And it is far more expensive than charAt.

If your application needs random access to unicode characters in a long string, you risk performance problems.

Other language implementations like PyPy use UTF-32 encoding which, unlike Java’s UTF-16 encoding, supports fast random access to individual characters. The downside is increased memory usage. In fact, it appears that PyPy wants to move to UTF-8, the dominant format on the Web right now. In UTF-8, characters are represented using 1, 2, 3 or 4 bytes.

There are different trade-offs. If memory is no object, and you expect to use emojis, and you want fast random access in long strings, I think that UTF-32 is superior. You can still get some good performance with a format like UTF-8, but you will probably want to have some form of indexing for long strings. That might be difficult to implement if your language does not give you direct access to the underlying implementation.

More annoying than performance is just the sheer inconvenience to the programmer. It is 2018. It is just unacceptable that accessing the second emoji in a string of emojis would require nearly undecipherable code.

Appendix: Yes. I am aware that different code points can be combined into one visible character so that I am simplifying matters by equating “character” and “code point”. I am also aware that different sequences of code points can generate the same character. But this only makes the current state of programming even more disastrous.

25 thoughts on “Emojis, Java and Strings”

  1. “I imagine this could be more computationally expensive” … clearly you have no idea what you’re talking about, and it astounds me that you felt the need to even write this out, being how erroneous it is. What’s to stop you from simply doing length / 2? Are you autistic or something? This is not ok, and quite frankly, you should be extremely embarrassed right now. If you actually knew programming you would never feel the need to write this.

    1. What’s to stop you from simply doing length / 2?

      Given an arbitrary UTF-16 string, and its length in bytes, I cannot know how many unicode characters there are without examining the content of the bytes. So no, dividing by two is not good enough. It will work in this case, but not in general.

      1. Daniel, I don’t know why you bothered posting his comment, let alone replying to it. Asking if someone is autistic, telling one how one should feel, saying one doesn’t know how to program. Wow! Even if there was factual merit to what he said, I wouldn’t expect this to pass moderation πŸ™‚

    2. Are you autistic or something?

      I don’t think that word means what you think it means. Good attempt at unprompted flaming, though. I tip my hat to Daniel for a classy response to a blatant troll.

  2. Be careful, just because it is called UTF-16 or 32 does not mean 2 or 4 bytes are used per codepoint. In fact even UTF-8 can go up to 6 bytes.
    The compatibility mess was not created by Java though, it just tries to be as compatible as possible in a changing Unicode world where charAt() worked fine until the world changed.

        1. Below is a detailed description I read ages ago when I was trying to figure out why Java was so slow reading Strings compared to simple ASCII reading. It was when lazy conversion was implement in Kettle and parallel CSV reading because you can burn a tremendous amount of CPU cycles properly reading files from all over the world, let alone doing accurate date-time conversions, floating point number reading and so on. It put me on the wrong foot since all my IT life I was told that reading files is IO bound. In the world if ultra fast parallel disk subsystems and huge caches I can assure you all this is no longer the case. Please note the link is 15 years old, from before the emoji era, but perhaps in another 15 years Unicode will have faced other challenges.

          httpss://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/

  3. I totally second that the current state of programming is disastrous. Too bad not too many programmers seem to realize that or express an intent to do something about that.

    I think about all these string problems like that: strings are not random access, period. The fact that strings have been represented as arrays with characters as elements is yet another artifact of the programmer nerds’ ignorance, one of the series of “misconceptions programmers have about X.” With that, the humanity should have started with inventing efficient abstractions to deal with non-random access strings instead of the ugliness we see in Java and elsewhere.

    UTF-32 on its own may be considered a hack, in my opinion, as it is an incredibly wasteful representation: it consumes 4x the memory normally needed for an English string, which is kind of ridiculous. I would say, even UTF-16 is already not good with its 2x redundancy. Given that UTF-16 is both inefficient, and not random-access, it seems like a redundant solution in the presence of UTF-8.

  4. But are there any good use-cases for random-access to code-points? It seems like it’ll actually just encourage bugs, since it’ll kind-of sort-of work on some things, but then break when you throw a string with combining characters at it.

    It seems reasonable, perhaps even code, for a language to not provide random access to code points.

    (Tangentially, a great thing about emojis is it flushed out a lot of apps that had shitty unicode support and forced them to fix it.)

  5. But don’t the substring algos work fine operating byte-by-byte on utf8?

    As an example, Go strings are (by convention) utf8, and provide no random access to code-points. It’s AFAIK not something people complain about, and in fact, Go’s support for unicode is generally considered pretty good. (But maybe it’s just because people are too busy complaining about other things, like missing generics!) πŸ™‚

    1. I’m not sure I understand what you are saying.

      Let us compare…

      In Python, if I want to prune the first two characters, I do…

      >>> x= "πŸ˜‚πŸ˜πŸŽ‰πŸ‘"
      >>> x[2:]
      'πŸŽ‰πŸ‘'
      

      In Swift, I do…

        var x = "πŸ˜‚πŸ˜πŸŽ‰πŸ‘"
        var suf = String(x.suffix(2))
      

      In Go, you do…

      var x = "πŸ˜‚πŸ˜πŸŽ‰πŸ‘"
      var suf = string([]rune(x)[2:])
      

      So I can see why people don’t complain too much about Go.

      1. Well, the Go code is doing something a bit different, it’s converting the string into a []rune (aka []int32) and then slicing that. If you’re willing to convert from string into some sort of vector type, then you’re always going to have direct indexability, of course.

        But my bigger point is AFAIK is is never a good idea to index strings by code-point anyway. Your example, for example, happens to work on the input you’ve given, but breaks on other input.

        E.g., the string “mΜ€hπŸ˜‚πŸ˜” will not print what you expect.

        https://play.golang.org/p/iWjxjpBa-_g

        So I think it’s probably better not to have code-point indexing built-into in strings, as a gentle nudge towards useing more sophisticated algorithms when needing to do actual “character” (i.e. grapheme) level manipulations.

        1. So I think it’s probably better not to have code-point indexing built-into in strings, as a gentle nudge towards useing more sophisticated algorithms when needing to do actual β€œcharacter” (i.e. grapheme) level manipulations.

          Should the language include or omit these “more sophisticated algorithms”?

          I mean… do you expect Joe programmer to figure this out on his own… Or do you think that the language should tell Joe about how to do it properly? Or should Joe never have to do string manipulations?

          I would argue that Java provides no help here. It explicitly allows you to query for the character at index j and gives you a “character” which can very well be garbage. How useful is that?

          Code points would be better. Still, I agree that code point indexing is probably not great (even though it is better that whatever Java offers) but… if you want better, why not go with user-perceived characters?

          Swift gives you this…

            1> var x = "mΜ€hπŸ˜‚πŸ˜"
          x: String = "mΜ€hπŸ˜‚πŸ˜"
            2> x.count
          $R0: Int = 4
           3> var suf = String(x.suffix(3))
          suf: String = "hπŸ˜‚πŸ˜"
           4> var suf = String(x.suffix(4))
          suf: String = "mΜ€hπŸ˜‚πŸ˜"
          

          What, if anything, do you not like about Swift?

          I think Swift is way ahead of the curve on this one.

          1. Hey, that’s cool! I’m not a swift user, but looking up the docs, Swift is doing the correct thing, giving you “extended grapheme clusters”. Great!

            It’s just the middle-ground of giving you code-points which I’m not a fan of — it leads you toward bugs that are hard to notice.

            (I also still like Go approach of, “a string is a sequence of utf8 bytes; use a unicode library if you want fancy manipulations”. Maybe the Swift approach will turn out to be even nicer, though hard to say w/o experience using it.)

              1. I feel like such a philistine, since I don’t know Rust either, but that is not a surprising result to me!

                Go will give you the same.

                https://play.golang.org/p/jimB5h8WwWn

                The reason is the first two code-points are

                006D LATIN SMALL LETTER M
                0300 COMBINING GRAVE ACCENT

                (Those two code-points combine together to give you the single grapheme “mΜ€”.)

                Encoded into utf8, they become 3 bytes (109, 204, 128). So if you are treating the string as a sequence of utf8 bytes, slicing the first 3 elements would give you that.

                So it looks like Rust, like Go, takes this approach. And if you you care about fancier manipulations, you need to use a library (e.g., https://crates.io/crates/unicode-segmentation).

                As a fun aside, that string breaks a couple playgrounds:

                https://play.rust-lang.org/?gist=9958c46c59eff8d655c818e55580d202&version=undefined&mode=undefined

                https://trinket.io/python/8a0742b45e

                Try editing text after the “mΜ€”; the cursor don’t match correctly. You also can’t select the string in the Rust playground.

                The Go playground works correctly, but probably just because it uses a simple text-entry box w/o syntax highlighting or other niceities. (But would you rather have simple-but-correct or fancy-but-buggy software?)

                Finally, I managed to hang emacs by asking it to describe-char “mΜ€”.

                Unicode support is still janky in a lot of places!

                  1. Sorry, I was not trying to imply you didn’t understand the result, just provide some explanation/context/motivation for the result.

                    I think what you’re saying is, “I expect a string to look like a sequence of graphemes”.

                    Whereas Go and Rust say, “a string is sequence of utf8 bytes”. So in that sense, it’s not what you expect.

                    I think the Go and Rust approach is still reasonable, since they’re likely to lead to correct software. (Vs, say, Python, which is”almost right” in the default case, making it easier to make subtly-broken software.)

                    (Come to think, perhaps a better test-case to give you would’ve been “πŸ‘·β€β™€οΈπŸ‘©β€βš•οΈπŸŽ‰πŸ‘”.)

                    The Swift approach seems reasonable too, and maybe even better since it does the right thing by default, though at the cost that you’ve got a lot of unicode complexity in your core string class, and it’s non-obvious (at least to me) what your internal string represenation is going to be, or what the perf cost of various operations is going to be. (E.g., is something like “.count” on a swift string constant time, or does it have to run through the whole string calculating the graphemes?)

                    1. I think the Go and Rust approach is still reasonable, since they’re likely to lead to correct software.

                      In what sense?

                      You are still left to do things like normalization on your own. This makes it quite hard to do correct string searchers in Go, say.

                      Try this:

                      package main

                      import (
                        "fmt"
                         "strings"
                      )
                      
                      func main() {
                        var x = "Pok\u00E9mon"
                        var y = "Poke\u0301mon"
                        fmt.Println("are ", x, " and ", y , " equal/equivalent?") 
                        fmt.Println(x == y)
                        fmt.Println(strings.Compare(x,y))
                      }
                      

                      Sure, you can remember to use a unicode library as you say and never rely on the standard API to do string processing, but Go does not help you. If you don’t know about normalization, and try to write a search function in Go, you will get it flat wrong, I bet.

                    2. (I think I hit the nesting depth limit for replies; this is a reply to Daniel’s sibling comment at 4:04.)

                      That is a fair point, but I would not view the situation as dimly as you do.

                      I would say software like that has sharp-edges, rather than being incorrect. If I, as a user, normalize my input before handing it off to the software, it will function correctly. This is how emacs works, for example. It is an annoyance occasionally, but not in my mind a “bug” per se.

                      Compare this situation to the two code playgrounds I posted above.

                      Once you include a multi-code-point grapheme in your input, they stop working correctly, full stop. The character insert offset is shown incorrectly, and text selection using the mouse is glitchy. There is nothing you can do as a user to avoid this.

                      So that’s the style of bug that’s encouraged by the “almost correct” perspective of a string as a sequence of code-points.

                      I take your point, though, that Swift’s perspective of a string as a sequence of graphemes may be the superior approach, avoiding both types of undesirable behavior.

                      (Though I guess at some perf & complexity price.)

                      So going back to your original post, in my view, Python 3’s behavior is bad, Go and Rust are ok, and Swift is (maybe) the best.

  6. Interesting blog post.

    I wanted to point out that MoarVM (the Perl6 VM) uses a string representation called ‘normalized form grapheme’ that allows efficient random access on unicode grapheme strings. Link to documentation.

    The essence of that trick is to combine all combinators into a grapheme and map that to a synthetic codepoint (I believe a negative number), which is then ‘unmapped’ when encoded to an external format (e.g.., UTF-8).

    This is obviously not perfect as it incurs extra cost at IO, although that is true of any system that uses anything other than UTF-8 internally. So I think it is a nice solution (until unicode runs out of 31 bit space, that is).

Leave a Reply

Your email address will not be published. Required fields are marked *

To create code blocks or other preformatted text, indent by four spaces:

    This will be displayed in a monospaced font. The first four 
    spaces will be stripped off, but all other whitespace
    will be preserved.
    
    Markdown is turned off in code blocks:
     [This is not a link](http://example.com)

To create not a block, but an inline code span, use backticks:

Here is some inline `code`.

For more help see http://daringfireball.net/projects/markdown/syntax