The missing research tool…

I want to know when a new research paper…

  • is similar to one of my papers;
  • cites one of my papers;
  • is relevant to my current research.

Why can’t I have this already? Why do I have to manually go to Google Scholar to monitor what my peers are doing? But Google Scholar always fall short because it does not know Daniel Lemire is as a researcher. 

Years of research in Natural Language Processing should make the construction of such a tool almost as easy as building a bridge. It is not difficult to parse my research papers, learn from it, and detect similar papers as they appear on the net.

So we need a new generation of tools:

  • The tool knows who the researcher is, what he published, what he cited…
  • The tool knows what the researcher is currently working upon.
  • The tool can filter and aggregate the data automatically and offer it to the researcher without effort.
  • The tool promotes open access content when possible.

I would pay for such a tool.

Published by

Daniel Lemire

A computer science professor at the University of Quebec (TELUQ).

18 thoughts on “The missing research tool…”

  1. @Dupuis

    Scopus and/or Web of Science do a few of the things you’re looking for, not perfectly, but it’s a start.

    I would argue it is a *bad* start.

    Scopus knows about 11 of my papers, and one of them is not from me, so it knows about 10 of my papers. It thinks that my 2000 paper “Wavelet time entropy” is my most cited work (with 21 citations).

    Should I trust Scopus? Is this an accurate picture? No. Not even close.

    Google Scholar tells me otherwise. My RACOFI paper from 2003 is cited 39 times according to Google Scholar, yet it does not even exist on Scopus! My Slope One paper from 2005 is cited 36 times according to Google Scholar and it does not exist on Scopus! My Tag-cloud drawing paper from 2007 is cited 20 times according to Google and… it does not even exists on Scopus.

    Even if you don’t trust the numbers Google Scholar gives you, these 3 papers I just gave you do exist. They have been repeatedly cited and there is even a wikipedia page about one of these papers. Yet, as far as Scopus is concerned, I have hardly been cited for my work after 2000… except for the Scale and translation invariant collaborative filtering systems paper…

    Coverage matters a lot more to a researcher than precision. Missing 3 of my most important contributions is a big deal to me. I don’t care that it reports only 10 of my papers… I care that it misses my most important work though!!!

    A tool that does not know about my important work, can’t possibly help me monitor upcoming papers efficiently.

  2. That would be a trivial mashup, if the data were available in machine-readable format. What’s the challenge here? Is the raw data available? Is it parsing the papers to recognize citations?

  3. @Haran

    The data is most certainly not available in structured format. Nor is it available from one place only. Even if you can parse the papers to recognize the citations, you have to link the citations to the papers. That is not easy. There are many ways to cite a paper, and several papers have almost the same titles and almost the same authors.

    There are places to get you started. For example, in Computer Science, DBLP makes available a rather large list of papers as XML. The papers in the arxiv database can also, I presume, be indexed somehow.

    Recognizing similarities between what the researcher is doing, and a given paper is also not trivial. It is probably similar to spam filtering. There may even be people who will try to cheat the system to get their papers recommended more often!

    So, it is a difficult challenge, for many reasons. But it seems that as years go by, no progress is being made. I have seen zero progress in the last two years on this problem. None. Nada.

    And it is not just the challenge in getting access to the data. For example, even open access archives (such as arxiv) are hard to monitor!

  4. That’s true but the problem with publisher tools is they apply only to the content owned by the publisher (usually). What’s needed is publisher-neutral tools that don’t care who owns the intellectual property. (Which is related to, in spirit anyway, Daniel’s desideratum “The tool promotes open access content when possible”)

  5. I use Web of Science for this. I have saved searches for relevant terms, and one for any papers that cite key papers. Weekly emails summarise it all.

  6. This is a symptom of a general problem. As long as academic writings are hidden in a maze of pay-for-access ghettos, access to information (and the tools used) will be poor.

    The same base problem makes academic writings less useful (and thus less meaningful) to the entire community.

    Solve the underlying problem.

  7. ISI Web of Science, although incomplete, allows you to trace papers citing a specific paper.

    Also, I don’t known about computer science, but for Health sciences in general, many publisher allow you to set citations alert on papers of interest. I.e., each time a paper important for your litterature is cited (including yours), it e-mails you the reference.

    I find these tools very valuable to be informed of what is going on the specific domain I am working on.

  8. After reading this post, I wrote a little shell script that polls Google Scholar for new citations to my papers. I used wget with the Google Scholar URL and full paper title, egrep -o “Cited by [0-9]+”, then store the counts in a file. If the counts change, the script e-mails me.

    Of course, this misses whatever citations that Google Scholar doesn’t pick up.

    The link by Suresh, WhatToSee, looks very useful.

Leave a Reply

Your email address will not be published. Required fields are marked *

To create code blocks or other preformatted text, indent by four spaces:

    This will be displayed in a monospaced font. The first four 
    spaces will be stripped off, but all other whitespace
    will be preserved.
    Markdown is turned off in code blocks:
     [This is not a link](

To create not a block, but an inline code span, use backticks:

Here is some inline `code`.

For more help see