How do search engines handle special characters? Should you care?

Matt Cutts is Google’s search engine optimization expert. He runs a great YouTube channel called Google Webmaster Central. He was recently asked how Google handles special characters such as ligatures, soft hyphens, interpuncts and hyphenation points. His answer? He doesn’t know.

Being a scientist, I decided to compare Google to the young upstart (Bing).  Using Google, the result sets from Kurt Goedel differ from those with Kurt Gödel. For example, I could only find the Kurt Goedel article from the uncyclopedia when searching for Kurt Goedel. Similarly, Google fails to realize that cœur and coeur are the same words. However, Bing knows that Goedel and Gödel is the same person. Bing knows that cœur and coeur is the same word.

While the consequences are small, they are nevertheless real:

  • Students may fail to find great references on Kurt Gödel because they search for Kurt Goedel. Indeed, most academic papers seem to prefer Gödel to Goedel.
  • A writer who tries to be typographically correct and writes cœur may get penalized when people search for coeur.

Score: Bing 1. Google 0.

Source: Will Fitzgerald. Special thanks to Mark Reid, Marek Krajewski, Jeff Erickson and Christer Ericson for the debate on Twitter motivating this blog post.

10 thoughts on “How do search engines handle special characters? Should you care?”

  1. @John

    Great point.

    In this case, comparing “Kurt Godel” and “Kurt Gödel”, both Bing and Google fail my “this is the same person” test.

    So, if you want people to find your article on Kurt Gödel more easily, maybe you should include a few “Kurt Godel” typos. 😉

  2. Michael’s comments are well stated–at least, it’s what I would have written as my comment!

    The original question to Matt Cutts was about ligatures (e.g. is “Duff’s Beer” the same as “Duff’s beer”) and typographic hyphens.

  3. Apparently Google gives you somewhat different results when you search on “Godel” as well. I would expect more English speakers would simply leave off diacritical marks than, for example, change ö to oe.

  4. It’s actually even more complicated. The search engine isn’t a single entity, but rather many online and offline processes, each of which can implement different rules.

    Let’s assume for a moment that the web page and server code correctly handles and logs the Unicode text representation (so that ö isn’t corrupted somewhere along the way); surprisingly many sites already fail this step.

    You’ve got things like autocomplete, stemming, spell correction, and synonym handling, each of which is distinct software that normalizes it’s inputs differently. Then you’ve got all the offline processes that analyze query logs, index documents, extract terms, etc. Some of these can map oe to ö while others don’t.

    In your example, it’s unclear which of these systems are not treating these queries as equivalent. It might even be ranking; maybe uncyclopedia is in the results somewhere, but unnotmalized query-dependent ranking changes it’s order of appearance. By playing with additional constraints (eg, you may be able to further probe the implementation.

    Text normalization is still an afterthought in most software, even at Google.

  5. Speaking of autocomplete, I typed that on an iPhone, and it “corrected” its to “it’s” without my noticing. But at least it handles umlauts. 🙂

  6. I liked Michael’s post because it pointed out the complications inherent in the whole stack, from source text to web servers and on through any other storage and presentation tools (I would add the lack of capability for users to represent diacritical characters, in either unicode form, in an input text field to the list as well, for engines that don’t normalize both user input and index).

    I found Daniel’s blog post interesting because he’s measuring the search engines ability to help users wade through this dirty data. Despite the complications inherent in the whole stack, what can the engines do to get english speaking users the right information? As the full stack of tools improve their internationalization support, we’ll get improved source data, but for now, dirty data is a fact of life for the search engines.

    Complicating the picture even more is when two separate input concepts normalize to homographs. For example, Russian pisát “to write” vs písat “to piss” (I couldn’t enter the russian characters successfully, but this version suffices). There are situations in which a non-native russian speaker could become pretty embarassed when presenting to a russian speaking audience, all because the search engine decided to normalize the two terms to the same thing.

    One thing the experts could look into is, how are foreign search providers handling these issues? It’s possible that the answer for english speaking individuals is to normalize everything, but it may be worth investigating the foreign search provider’s tactics for handling the wealth of dirty data out there on the ‘net.

  7. NLP is fun in that so often you run into cases like this where there just isn’t a universal right answer. Googling Godel, I see a gallery on the front page. What if that’s the Godel I want? Pulling Gödel in would dilute my results even more. Even capitalization can be significant. ‘Papa’ = Spanish for pope, ‘papa’ = Spanish for potato.

    I like Google’s use of “did you mean ” to suggest things I might have meant, but still let me see both sets. For researchers who really care about this, ways of explicitly enabling stemming/character normalization/etc. is useful (but overly complex for a Google). In the long run I think the solution is machine intelligence that can differentiate contexts (did you mean the vegetable or the pontiff? Let me find all the pages it was used in that sense)

  8. Great points about homographs, capitalization, etc. affecting interpretation of the search. Of course, the minute we get beyond something “simple” like character equivalence classes, we’re into all the challenges that make search such a fun and exciting space to work in.

    Stop phrases (“The Help” is the title of a very popular book), punctuation (C++, R.S.V.P., WALL-E, Math.rand(), etc.), mixed encodings (especially in the Far East region, where you get URL-escaped GBK mixed with CJK in the same search request), etc.

    And these are all just the cases where we assume perfect queries and corpus. In the real world, spelling errors, encoding errors, and disagreement about canonical form abound. Recent examples include the movie “Kick-Ass” (or is it “Kick ass” or “Kickass”?), the product “iPhone” (or is it “i-Phone” or “eye phone” or oops “iPone” or “iPhome”), etc. Maybe you misplaced the umlaut: Kürt Godel (which seems to “work” in both Google and Bing, with no spell-correction or backout links displayed by either engine).

    Fun stuff!

Leave a Reply

Your email address will not be published. Required fields are marked *