The Google Similarity Distance

I read a paper on the Google Similarity Distance this morning by Cilibrasi and Vitanyi. They search for word cooccurrences using the Google search engine. Their formula goes as follows: (G(x,y)-min(G(x,x),G(y,y)))/max(G(x,x),G(y,y)) where G is the “Google code” function. The Google code function is defined as -log g(x,y) where g(x,y) is the normalized number of web pages containing both term x and term y: the normalization is such that if you sum up g(x,y) over all x,y then you get 1.0. With this simple approach, they seem to be able to translate between English and Spanish, build a thesaurus, and so on. This reminds me a bit of the recent work done by Turney on analogies.

One thought on “The Google Similarity Distance”

Leave a Reply

Your email address will not be published. If you leave an email, you will be notified when there are replies. The comment form expects plain text. If you need to format your text, you can use HTML tags such <strong> and <em>. For formatting code as HTML automatically, I recommend tohtml.com.