The Google Similarity Distance

I read a paper on the Google Similarity Distance this morning by Cilibrasi and Vitanyi. They search for word cooccurrences using the Google search engine. Their formula goes as follows: (G(x,y)-min(G(x,x),G(y,y)))/max(G(x,x),G(y,y)) where G is the “Google code” function. The Google code function is defined as -log g(x,y) where g(x,y) is the normalized number of web pages containing both term x and term y: the normalization is such that if you sum up g(x,y) over all x,y then you get 1.0. With this simple approach, they seem to be able to translate between English and Spanish, build a thesaurus, and so on. This reminds me a bit of the recent work done by Turney on analogies.

Published by

Daniel Lemire

A computer science professor at the Université du Québec (TELUQ).

One thought on “The Google Similarity Distance”

Leave a Reply

Your email address will not be published. Required fields are marked *

To create code blocks or other preformatted text, indent by four spaces:

    This will be displayed in a monospaced font. The first four 
    spaces will be stripped off, but all other whitespace
    will be preserved.
    
    Markdown is turned off in code blocks:
     [This is not a link](http://example.com)

To create not a block, but an inline code span, use backticks:

Here is some inline `code`.

For more help see http://daringfireball.net/projects/markdown/syntax