From counting citations to measuring usage (help needed!)

We sometimes measure the caliber of a researcher by how many research papers he wrote. This is silly. While there is some correlation between quantity and quality — people like Einstein tend to publish a lot — it can be gamed easily. Moreover, several major researchers have published relatively few papers: John Nash has about two dozens papers in Scopus. Even if you don’t know much about science, I am sure you can think of a few writers who have written only a couple of books but are still world famous.

A better measure is the number of citations a researcher has received. Google Scholar profiles display the citation record of researchers prominently. It is a slightly more robust measure, but it is still silly because 90% of citations are shallow: most authors haven’t even read the paper they are citing. We tend to cite famous authors and famous venues in the hope that some of the prestige will get reflected.

But why stop there? We have the technology to measure the usage made of a cited paper. Some citations are more significant: for example it can be an extension of the cited paper. Machine learning techniques can measure the impact of your papers based on how much following papers build on your results. Why isn’t it done?

People object that defining a metric based on machine learning is troublesome. However, we rely daily on spam filters, search engines and recommender systems that we do not fully understand. Measures that are beyond our ability to compute by hand have repeatedly proven useful. Moreover, identifying important citations can have other applications:

  • Google Scholar says that I am cited about 160 times a year. On average, a paper citing me comes out every two days. What does it mean? I don’t know. I would be interested in identifying quickly which papers make non-trivial use of my ideas. I am sure many researchers would be interested too!
  • I sometimes stumble on older highly cited papers. I want to quickly identify the significant follow-up papers. Yet I am often faced with a sea of barely relevant papers that merely cited the reference in passing. It would be tremendously useful for me to know which papers have cited the reference meaningfully.

Hence, I surveyed the machine learning literature on classifying citations. I found high quality work, but I feel it is an under-appreciated problem. So I got in touch with Peter Turney and Andre Vellino and we decided to promote this problem further.

Our first step is to collect a data set of papers together with their most important references. We believe that the best experts to determine what are the crucial references are the authors themselves!

So, if you are a published researcher, we ask you to contribute by filling out our short online form. On this form, you will be asked for your name and a few papers together with an identification of the crucial references for each paper. The form can take less than 30 seconds to fill out.

In exchange, we will publish the data we collect under the ODC Public Domain Dedication and Licence. If you leave us your email, we will even tell you when the data is publicly available. Such a public high-quality data set should entice a few researchers to write papers. And, of course, I might contribute to such a paper myself.

My long-term goal is simple: I hope that in a couple of years, Google Scholar will differentiate between citations and “meaningful” citations.

Now go fill out the form!

Note: I have an earlier version of this post on Google+ with several insightful comments.

Further reading: Building a Better Citation Index by Andre Vellino

Update: The dataset is available.

Published by

Daniel Lemire

A computer science professor at the University of Quebec (TELUQ).

17 thoughts on “From counting citations to measuring usage (help needed!)”

  1. I find your approach interesting but I’m afraid you are going to collect a too sparse dataset to be useful 🙁

    On another hand, in the form you say “By an essential reference, we mean a reference that was highly influential or inspirational for the core ideas in your paper”. Unfortunately, I respectfully disagree with this being the only way of finding great papers and/or better metrics for influence in other researchers.

    First of all, not always a paper’s influence is so strong in another paper; specially as time goes by and some influential papers are considered basic literature that must be cited just to provide context for a current work but not because they are “inspirational” for that work.

    Secondly, some papers do not have any inspiration, they simply appear out of thin air. In this sense you have mentioned John Nash since his PhD dissertation is a great example of this: 2 reference, 1 is a self-cite 🙂

    In my first reading of this post I thought “why don’t they use PageRank or a similar algorith to compute that”. Now I’ve found the comment about That approach is quite interesting *but* I simply don’t like the idea of computing the score for journals, why not compute it for individual papers?

    In fact, I think that would be a rather sensible approach: a relevant/interesting/influential paper would be a paper cited by many relevant/interesting/influential papers.

    In other words, once such a score was computed for all of the papers you can find the most influential papers cited by any paper and the most influential papers citing one paper.

    *And* with that information you could train a ML method to “distinguish” between the context surrounding cites to the papers which have been really influential in a given work and the context surrounding the supportive cites.

    Needless to say, there are tons of details missing here and it wouldn’t be as easy as I assume but I’d give it a try (provided the citation graph data was available).

  2. Applying an ML algorithm is all about building good features. In, IR, for instance, it took decades. Considering how controversial this issue is, I would predict that it will take at least 10 years. Besiders, to come up with good features, you will probably have to do some nontrivial NLP.

  3. @Itman

    Sure but there is already some research on this going back to 2000 or even slightly before. Some people even published a Weka-based open source tool for this problem (circa 2010). So it is not exactly like we have no clue on what to do. However, it has remained a relatively obscure topic. I hope we change that.

    But ok, it could take more than two years before it becomes mainstream. I can dream though, can’t I?

  4. Hi Daniel,

    Regarding 1 and 4. Maybe I’m too stubborn but a citation graph would be a great asset in addition to the dataset you are planning to collect.

    Regarding 2. My fault, you did not mention a better metric. It’s me who’s thinking of the need for better metrics 🙂

    And yes, I hope to find time soon to complete the form 🙂

    Best, Dani

  5. * Are you familiar with ?

    It seems to be quite related to what you’re talking about.

    * I don’t know how much sense it makes, and how much biased it might be, but I think the people from Mendeley ( are in a good position to evaluate paper popularity in a more precise way. They already show some rankings based on how many people have a paper in their library, but I guess you can go beyond that and see how much time people actually spent reading the paper.

    If interested, you may want to contact this guy from Mendeley:

  6. @Alejandro

    Yes, I have been a member of Mendeley ever since it started. There are many initiatives to measure the overall impact of a research paper, including counting the number of downloads… the number of times the paper is mentioned, and so on. But my specific interest on “meaningful” citations goes beyond measuring the “importance” of a research paper.

  7. @Daniel

    1) Really, I should stress that all I am doing is taking the initiative, with the help of others, of building a data set that I think researchers (including myself, perhaps) would find useful in their own research. I do this because I am genuinely interested in the tools that might be constructed based on this research.

    2) I am not proposing some “better metric”. I am merely proposing that we make more mainstream the identification of meaningful references.

    By analogy, in the context of the web, that’s like arguing that we should differentiate meaningful links from shallow links by analyzing the text of the web pages. That, in itself, does not tell you how to rank web pages.

    3) I am not solely interested in recognizing influential work.

    4) I am not sure what you mean by “too sparse”? I guess you are thinking in a graph theoretical sense? The result of this project will not be a graph.

    I hope you will contribute to the data set. I expect you might be one of the researchers who might benefit from this data set.

  8. I’ve just completed the form.

    I must confess that I see things from a different perspective now. Certainly, there are a huge number of papers which are supportive for one’s work and a few which are really relevant (or meaningful in your wording).

    I’m now really curious to know whether ML methods will be able to tell apart one kind of citation from another.

    Good luck in this endeavor.

    Best, Dani

  9. A subject close to my heart, although I am not much for publishing my work…
    Years ago, I found that the best filter for finding relevant information on a subject new to me was what I have often referred to as the “inverse frequency” filter- an author that is publishing too frequently generally is contributing nothing new to the body of knowledge on a given subject- just rehashing historical results. On the other hand, an author that publishes at most once or twice a year, or, even better, once every couple of years, is more likely going to provide new and useful information.

    In the past, frequency of citation has been useful, but it seems the system is being gamed these days, and the fact that a certain author is frequently cited may be more an indication of how many “agreements” he has for mutual citations (or how strongly a particular publisher is promoting his work). Finding good technical literature is getting as hard as finding pertinent information on the public Internet!

    When one is familiar with a topic, one can quickly identify the key contributors to the knowledge base. However, when one is investigating a new field of interest or endeavor, sorting the wheat from the chafe is a daunting task. I don’t have time to read 10,000 papers (or even scan 10,000 bibliographies of published papers) to find the information crucial to my understanding of a subject. Therefore, any viable filter that can identify to true innovators would be of significant value to me personally (a service I would even willingly pay for!).

    Citation frequency is probably a better measure of the worth of a reference document than the author’s frequency of publication, but, as you point out, there is a crying need for some weighting measure to identify crucial, meaningful work…

  10. One interesting heuristic that a colleague of mine had proposed was to overweight citations within the meat of the paper, as opposed to those in the related work. It would be interesting to see if the data you collect has anything to say about this.

  11. Are you familiar with the citation ontology (CiTO)?

    This provides an ontology for turning citations into linked data — i.e. the reason for the citation (supports, refutes, uses methods from, etc) is encoded in markup around the citation, so it doesn’t have to be guessed from the context by machine learning. Long-term this is surely a better solution, though until publishers adopt this standard, machine-learning algorithms such as yours may be the best we can do. Perhaps cito-marked up papers could be used as a training set?

  12. Lets say, for example, that Russian paper cites Chinese paper. It will be hard for automatic system to analyse that.

    For example, Google Scholar is very weak at analysing Russian papers. It lists only about 20% of them and correcly shows only about 10% of citations.

  13. Do you plan to make the feature extraction code, and the processed data (i.e the data with extracted features) available?
    The linked-to “Dataset” is just the questionaire with links to the papers.


Leave a Reply

Your email address will not be published.

To create code blocks or other preformatted text, indent by four spaces:

    This will be displayed in a monospaced font. The first four 
    spaces will be stripped off, but all other whitespace
    will be preserved.
    Markdown is turned off in code blocks:
     [This is not a link](

To create not a block, but an inline code span, use backticks:

Here is some inline `code`.

For more help see

You may subscribe to this blog by email.