Can you infer tags from text?

The buzz is all about tags these days. Tagyu is an interesting tool which claims to suggest tags based on the text content of the page. I’d like to see a description of the algorithm, but I see none.

  • http://www.daniel-lemire.com/ gets the tags “firefox” “web2.0”.
  • http://www.daniel-lemire.com/en/ gets the tag “job”.
  • http://www.daniel-lemire.com/fr/ gets the tags “france” and “uqam”.

It seems like the tags for my blog make sense, but the tags for my home pages (French and English) are really bad. Tagging my French home page with “france”? Maybe because I use the French language? It is a bit of a stretch. Tagging my English home page with “job”? No. I don’t think so.

The problem is interesting and I bet there are solid solutions, but we are not there yet.

I also question whether collaborative tags have a future. I must admit I don’t use them, so I won’t comment much further, but it is a bit too empirical for my taste.

Published by

Daniel Lemire

A computer science professor at the University of Quebec (TELUQ).

3 thoughts on “Can you infer tags from text?”

  1. Tagyu doesn’t do a great job on home pages because they are “about” too many things. Blogging home pages typically contain several subjects on one page, and that can confuse the tool.

    Tagyu works by finding documents similar to your text and determining how they have been tagged by others. The tags suggested come from the tags on related documents.

    If I send the text of this blog entry to Tagyu, then I get the following tags:

    tools tagging blog tags del.icio.us

  2. Hi Mr. lemire,

    I think that a part of the problem is the way tags are handled by some systems and not tag in themselves.

    First, if we can consider that tags are the main topics of a digital document (text, video, etc). Then, some tags (between 2 to 8) will describe the meaning of the content of that document. They could be keywords present in the document or semantically related terms in relation with that document.

    This said, systems that handle tags would have to check the meaning, the semantic links, between the tags used to describe a digital document to know what they are really describing.

    salutations,

    Frédérick.

Leave a Reply

Your email address will not be published.

To create code blocks or other preformatted text, indent by four spaces:

    This will be displayed in a monospaced font. The first four 
    spaces will be stripped off, but all other whitespace
    will be preserved.
    
    Markdown is turned off in code blocks:
     [This is not a link](http://example.com)

To create not a block, but an inline code span, use backticks:

Here is some inline `code`.

For more help see http://daringfireball.net/projects/markdown/syntax

You may subscribe to this blog by email.