Not all citations are equal: identifying key citations automatically

Suppose that you are researching a given issue. Maybe you have a medical condition or you are looking for the best algorithm to solve your current problem.

A good heuristic is to enter reasonable keywords in Google Scholar. This will return a list of related research papers. If you are lucky, you may even have access to the full text of these research papers.

Is that good enough? No.

Scholarship, on the whole, tends to improve with time. More recent papers incorporate the best ideas from past work and correct mistakes. So, if you have found a given research paper, you’d really want to also get a list of all papers building on it…

Thankfully, a tool like Google Scholar allows you to quickly access a list of papers citing a given paper.

Great, right? So you just pick your research paper and review the papers citing them.

If you have ever done this work, you know that most of your effort will be wasted. Why? Because most citations are shallow. Almost none of the citing papers will build on the paper you picked. In fact, many researchers barely even read the papers that they cite.

Ideally, you’d want Google Scholar to automatically tell apart the shallow citations from the real ones.

This whole problem should be familiar to anyone involved in web search. Lots of people try to artificially boost the ranking of their web sites. How did the likes of Google respond? By getting smarter: they use machine learning to learn the best ranking using many parameters (not just citations).

It seems we should do the same thing with research papers. Last year, I looked in the problem and found that machine learning folks had not addressed this issue to my satisfaction. To help attract attention to this problem, we asked volunteers to identify which references, in their own papers, were really influential. We recently made the dataset available.

The result? Our dataset contains 100 annotated papers. Each paper cites an average of 30 papers, out of which only about 3 are influential.

If you can write software to identify these influential citations, then you could make a system like Google Scholar much more useful. The same way GMail sorts my mail into “priority” and “the rest”, you could imagine Google Scholar sorting the citations into “important” and “the rest”. Google Scholar could also offer better ranking. That would be a great time saver.

Though we make the dataset available, we did not want to passively wait from someone to check whether it was possible to do something with it. The result was a paper recently accepted by JASIST for publication, probably in 2014. Some findings that I find interesting:

  • Out of dozens of features, the most important feature is how often the reference is cited in the citing paper. That is, if you keep citing a reference, then you are more likely to be building on this reference.
  • The recency of a reference matters. For example, if you are citing a paper that just appeared, your citation is more likely to be shallow.
  • We have been using citations to measure the impact of a researcher, through the h-index. Could we get a better measure if we gave more weight to influential references? To this end, we proposed the hip-index and it appears to be better than the h-index. See Andre Vellino’s blog post on this topic.

You can grab the preprint right away. Though I think the paper does a good job at covering the basic features, there is much room for improvement and related work. We have an extensive future work section which you should check out if you are interested in contributing to this important problem.

Credit: Most of the credit for this work goes to my co-authors. Much of the heavy lifting was done by Xiaodan Zhu.

20 thoughts on “Not all citations are equal: identifying key citations automatically”

  1. Very nice idea!
    To take it one step further: Alot of the Related Work section in a paper (where most of the citations actually are) tend to be categorized in some way i.e. past work dealing with the same problem, work that dealt with a similar problem or work that dealt with related but orthogonal problems. While they might not be very influential, I still think they are useful citations in pointing users towards the relevant literature. What I’m envisioning is a Google scholar system that does not only categorize it into “priority” and “rest”, but more fine-grained descriptive labels such as “influential work”, “similar work”, “different but related work”, “general surveys”, etc.

  2. @Marie

    More generally, the classification of references into various categories has a long history (see our paper for references)… but there has been very few attempts to automate the process.

    This is really a case where we need machine learning to come in and help us.

  3. Hm. Shallow or deep. Influential or non-influential. Authors’ citation decisions are more complicated than that… Might I draw your attention to Marilyn Domas White and Peiling Wang’s “A Qualitative Study of Citing Behavior: Contributions, Criteria, and Metalevel Documentation” published in Library Quarterly (1997;67:122-154)? There’s a nice table in there of the various motivations of authors for including a citation or not, and some illustrative quotation. It doesn’t boil down as simple as you’ve made it for the purposes of teaching your machine!

    The temptation to look for machine solutions to ranking results in searches yielding tens of thousands of results is obvious, but I remain skeptical about the direction you’re taking, and the variables you describe in the pre-print are certainly game-able. If this does eventually result in a whizzy new search filter, I’d certainly love to give it a try, and I’d certainly appreciate being able to switch it OFF again, if you know where I’m coming from.

  4. @Douglas

    Shallow or deep. Influential or non-influential. Authors’ citation decisions are more complicated than that…

    Yes, I am aware of this. Allow me to quote a paragraph from our paper:

    The idea that the mere counting of citations is dubious is not new (Chubin & Moitra, 1975): The field of citation context analysis (a phrase coined by Small, 1982) has a long history dating back to the early days of citation indexing. There is a wide variety of reasons for a researcher to cite a source and many ways of categorizing them. For instance, Garfield (1965) identified fifteen such reasons, including giving credit for related work, correcting a work, and criticizing previous work.

    It doesn’t boil down as simple as you’ve made it for the purposes of teaching your machine!

    We did not make it simpler so that it would be easier for the computer, we wanted to make it easier for the authors!

    Previous attempts at automatic classifications have used much richer categorizations. See Garzone and Mercer (2000) who used 35 categories as well as Teufel et al. (2006) who used several categories organized in a two-level tree structure.

    the variables you describe in the pre-print are certainly game-able

    I agree. We allude to this in our paper:

    Moreover, identifying the genuinely significant citations might be viewed as an adversarial problem.

    I would argue however that it would be much harder to game our features than to game citation counts, to say nothing of publication counts.

    But let me draw attention to an interesting related issue: even if you don’t care to identify influential citations, and are happy to count citations, you could still benefit from the identification of influential citations to catch cheaters!

    Again, quoting from the preprint:

    In a survey, Wilhite and Fong (2012) found that 20% of all authors were coerced into citing some references by an editor, after their manuscript had undergone normal peer review. In fact, the majority of authors reported that they were willing to add superfluous citations if it is an implied requirement by an editor. If we could determine that many non-influential references in some journals are citing some specific journals, this could indicate unethical behavior.

  5. Thanks for your generous response. I did download your pre-print, but I perhaps have not yet given its 40 pages the full justice they deserve. I did get as far as your acknowledgement of the complexity of authors’ citation behavior, and I thank you for the references to studies with even more complex classifications than White and Wang’s.
    As we both know, the influence of impact factor has led to gaming of citation, but I’m dubious about mathematical remedies: insiders are in an arms race for search engine attention, the poor old public gets ever dumber, more homogeneous search results.
    Its called the scientific *literature* for good reason, and a reader’s opinion of an author will surely be shaped by their assessment of whether they give good citation or not, particularly as we move into an open access world where it is easy to get the full text (i.e. call the author’s citation bluff). I suspect that this social pressure not to draw the attention to orthogonal or off-topic material may prove an effective remedy…

  6. Xiaodan Zhu is a very talented researcher, I personally follow his work and I met some of his colleagues in Chicago last year (i2b2): very clever people!

    I didn’t read the paper yet, I will print it and write my impressions. According to my tiny experience, I would consider the following rules of thumb:
    – how many times a paper has been already cited (before the publication of the current paper)? The highest it is, the highest the probability of being a “must” citation (maybe because it represents a particular ML technique, or a particular definition or a particular standard).
    – if I clustered the sub-field communities than I could extract the previously mentioned number, but related to a particular sub-field. Highly cited papers in your own community (especially the very old ones) are very likely to be not crucial. Whereas, maybe, if you cite an highly cited paper from a very different community it could be that that paper gives you a new perspective.
    – if a citation is only in the background section, it’s very likely not to be crucial. I would look at things like (#citation_to_this_paper/#citation_in_this_section)
    – in general I suppose that citing non-popular papers has to be strongly related to the fact that the authors search for it, discovered it, wanted it in the paper, and consequently found it relevant and crucial for their research.

    I’ll let you know my impressions about the paper. Is it possible to provide my contribution to the dataset? I already have these information for my papers.

    Thanks for this post!

    typo:
    Our dataset contains 100 **papers** annotated papers.

  7. @Michele

    Yes, your comment just makes clear that there is a lot more work required… It is my hope that this problem will receive more attention in the future.

    I have no plan at this time to extend the dataset but if you are interested, get in touch with me, I’ll try to help as much as I can. I suppose that there would be great value in producing an extended version.

    Thanks for pointing out the typo in my post.

  8. Excellent blog post, I am going to have to read the article. We in the scientific database world suffer from the same issue, shallow citations “there is a database called X, but I am going to tell you why I don’t care and am creating a new one” and deep citations “I used the data from database X and got a bunch of results, see figure 2”. The problem is the same as you describe, the solution will need something else to deal with the fact that not all biologists publish papers any longer to showcase their work. Some of us create databases or write machine learning algorithms for a living and we would also like to cited. Any ideas how your solution can help?

    ps. I am going to have to include your data resource in our registry so that I can run my algorithms on it.

  9. @Jo Vermeulen

    The short answer is no. (It shouldn’t be surprising given that we just posted our paper days ago.)

    The long answer is that the core purpose of our paper is to encourage people to build such tools so we can finally do more than citation counting.

  10. Thanks Daniel,
    There is a very vocal group called the Resource Identification Initiative on Force11 that is trying to figure out how to do just that. Your point is a good one though and should enter the discussion.

  11. Thanks for your blog post. Our group at NUS is also very much interested in these topics. Along with Simone Teufel, who has been working on the theory of Argumentative Zoning for many years, we have made a text classification tool that tries to describe the argumentative purpose of each sentence in an input article. You may find it an interesting project to read about, along with our ParsCit project.

    http://wing.comp.nus.edu.sg/raz/

    http://wing.comp.nus.edu.sg/parsCit/

  12. @Jo Vermeulen

    From our work, you can expect that some researchers would benefit from the hip-index because, though they get fewer citations in the current system, they are cited more abundantly within papers.

  13. @Daniel: I would expect this to be reflected in the number of citations of that paper too, over time (if other researchers pick it up).

    However, I can imagine the cip number for a specific paper might show this effect much sooner, and could then be used to indicate future ‘important’ or ‘influential’ papers.

  14. @Jo Vermeulen

    “I would expect this to be reflected in the number of citations of that paper too, over time”

    Yes, a paper receiving lots significant citations will eventually be cited quite a bit, but the reverse is not true. Highly cited papers do not necessarily contain new and influential ideas. Take, for example, review articles.

    Lots of articles become highly cited because people need a reference regarding some topic, so they pick whatever reference other people have picked, often without reading it. Thus, some bad researchers get cited a lot.

Leave a Reply

Your email address will not be published. Required fields are marked *