Google provides a ranking of research venue per domain. For databases and information systems, they provide the top 20 venues according to their h-index.

As part of their assessment, they chose to include arXiv: a repository of freely available research papers. Almost anyone can post a paper on arXiv. There is some filtering, but there is no scientific review of the papers. This means that if you download a paper from arXiv and it is complete junk, you have nobody to complain to (except the authors).

Nevertheless, it appears that people are willing to post their great papers on arXiv. On a ranking per h-index, the databases section of arXiv ranks in 11th place, outranking prestigious journals like the VLDB Journal, Data & Knowledge Engineering, Information Systems and Knowledge and Information Systems… not to mention all the journals and conferences that do not appear in the top-20 list provided by Google.

One could argue that the good ranking can be explained by the fact that arXiv includes everything. However, it is far from true. There are typically less than 30 new database papers every month on arXiv whereas big conferences often have more than 100 articles (150 at SIGMOD 2013 and and over 200 at VLDB 2013). Roughly speaking, the database section of arXiv is equivalent to two big conferences while there are dozens of conferences and journals.

You can subscribe to to arXiv on Twitter. All papers are freely downloadable.

Credit: A post by Rasmus Pagh on Google+ pointed out the good ranking of arXiv in theoretical computer science.

17 Comments

  1. Having many low-citation papers is almost completely irrelevant for the purposes of computing h-indexes. What counts is volume of high-citation papers. So the fact that there’s a lot of junk in arXiv should not cause it to be a surprise that its h-index is high: there’s also a lot of well-cited stuff.

    (I am intentionally writing low-citation and high-citation rather than low-quality and high-quality, because while citations and quality are correlated they are not the same thing.)

    Comment by D. Eppstein — 27/8/2014 @ 13:50

  2. Note the definition: h5-median for a publication is the median number of citations for the articles **that make up its h5-index**.

    So among the 42 top-cited papers, the median is 57 citations (e.g., MADLIB paper). This doesn’t say anything about the junk papers.

    The way citations are counted is weird too — most of the highly cited papers are VLDB/SIGMOD papers, and most of the citations are likely to the conference versions, not arXiv versions.

    - A.

    Comment by Anon — 27/8/2014 @ 17:19

  3. @Anon

    I misinterpreted their numbers.

    Thanks.

    Comment by Daniel Lemire — 27/8/2014 @ 18:30

  4. Now that you see you mis-interpreted the numbers, the explanation of the phenomon is immediate. If the collection of papers X is a subset of the collection of papers Y, then Y will have a higher h-index according to google.

    Y will also have a higher median citation among papers making up the h-index.

    The vast majority of papers are on the arxiv today, accounting for its very high stats.

    Comment by Different Anon — 27/8/2014 @ 19:42

  5. @Another

    “The vast majority of papers are on the arxiv today, accounting for its very high stats.”

    My post was about the database section. There are about 30 new papers a month on arXiv cs.DB:

    • January 2014: 30
    • Febrary 2014: 38
    • March 2014: 33
    • April 2014: 23
    • may 2014: 27
    • june 2014 : 31

    One venue alone (VLDB) had 236 papers in 2013.

    Comment by Daniel Lemire — 27/8/2014 @ 23:34

  6. In support of what Daniel is saying, arXiv Computation and Language (cs.CL) is ranked more highly in the field of computational linguistics than the most prestigious journal (in my subjective opinion), Computational Linguistics:

    http://scholar.google.com/citations?view_op=top_venues&hl=en&vq=eng_computationallinguistics

    As with cs.DB above, the comparison here is between a specific journal (Computational Linguistics) and a specific subset of arXiv (cs.CL).

    Comment by Peter Turney — 28/8/2014 @ 11:32

  7. How many of the top-cited papers appear elsewhere, say, in VLDB? Note that to have a good h5 index, it is not necessary to have a lot of papers.

    For example, in “Computer Vision & Pattern Recognition” arxiv is ranked 12 with h-5 index equal to 38.

    So, what? Can’t you have 38 publications that have manuscript both on arxiv and elsewhere?

    And, indeed, check the top paper in this category:
    Point-Set Registration: Coherent Point Drift Myronenko, X Song

    It also appears in Pattern Analysis and Machine Intelligence, IEEE.

    So, the arguments are not convincing.

    Comment by Leonid Boytsov — 28/8/2014 @ 17:08

  8. @Leonid

    If you refer to my blog post, can you please point out precisely the arguments you find lacking? I simply cannot respond to “the arguments are not convincing”. I need to know what you disagree with.

    Comment by Daniel Lemire — 29/8/2014 @ 8:10

  9. Good ranking CAN be explained by the fact that ARXIVE includes everything. More precisely, some of the best other papers. Because ARXIVE overlaps with good journal conferences you can’t rank it alone. Furthermore, famous authors may be willing to publish on Arxiv, because their papers will be read anyway. Even if they don’t publish anywhere else. For others this strategy likely won’t work. Yet h5 index won’t tell you this.

    Comment by Leonid Boytsov — 29/8/2014 @ 8:18

  10. @Leonid

    Some things I am *not* saying:

    1. “Papers posted on arXiv only appears on arXiv.” (Daniel: The opposite is true. It is trivial to check.)

    2. “Posting your paper on arXiv alone is a good strategy to become highly cited.” (Daniel: Though I do not know for sure, I suspect it is a terrible strategy. In fact, submitting to arXiv alone is something I recommend against doing, except for technical reports that are otherwise unpublishable.)

    3. “Posting your papers on arXiv will increase the number of citations you receive.” (Daniel: My default assumption is that this statement is generally false but can be true in specific cases. That is, sometimes arXiv can help if it makes your paper more accessible. However, posting the paper on your web site can also help, probably just as much.)

    What I am saying:

    *) On the whole, the quality of papers on arXiv is comparable to that of good venues, at least if you measure quality by citation. Subjectively, the quality of papers on arXiv is amazingly high. (Daniel: arXiv has an h-index comparable to the major conferences and it is only 2–3 times larger than the big venues. So the top tier on arXiv should be as good as what you get by browsing a leading venue. Given that arXiv is unrefereed, I find this result amazing. Being only 3 times worse than a leading conference, given that anyone can post papers… is a great score. Note that we did not discuss a comparison between arXiv and second-tier venue. I think arXiv would put these second-tier venues to shame.)

    Now, if you were to argue that all the best papers from all the best venues appear on arXiv, you would only make my point stronger. It would not be a counterpoint to what I am saying!!!

    As it stands, if you are an engineer and you do not have access to research papers through your employer, subscribing to arXiv, given that it is free, seems like a great choice. You are going to get slightly more junk than if you subscribe to ACM SIGMOD, say, but not a whole lot more.

    What I did not point out in my blog post but that I should point out is that I do not understand this phenomenon. I do not understand why arXiv is so good. I certainly expected it to be far worse.

    The numbers puzzle me. But facts are facts.

    Comment by Daniel Lemire — 29/8/2014 @ 8:31

  11. @Leonid

    “Good ranking CAN be explained by the fact that ARXIVE includes everything. More precisely, some of the best other papers.”

    Yes. It includes many of the best papers in a given field.

    Read this last sentence again: it includes many of the best papers in a given field.

    Comment by Daniel Lemire — 29/8/2014 @ 8:41

  12. I read some other sentences as well. I don’t what your intentions were, but words are highly misleading.

    Starting from the post topic:
    “Though unrefereed, arXiv has a better h-index than most journals…”

    Despite what you says in the last comments it sends an absolutely wrong signal. It sounds like, ohh look there is an unrefereed venue and it it’s so good. But it doesn’t matter, because this venue doesn’t exist alone.

    Next you essentially say that the high h5 index can’t be explained by people publishing papers elsewhere, because only a small fraction of papers appears on arxive. This may be true, but it is not provable, because h5 index judges only few top papers and these papers are published elsewhere.

    If this was all written to support a simple statement that publishing an open-source version of your paper doesn’t diminish its citation index, it’s a lot of misleading words. Because the statement is clearly obvious and has no specific relation to arxiv.

    Comment by Leonid Boytsov — 29/8/2014 @ 8:48

  13. @Leonid

    1) How is my statement misleading:

    “Though unrefereed, arXiv has a better h-index than most journals…”

    It is what Google is telling us. Please follow the link I have offered in my blog post:

    http://scholar.google.com/citations?view_op=top_venues&hl=en&vq=eng_databasesinformationsystems

    2) “Next you essentially say that the high h5 index can’t be explained by people publishing papers elsewhere, because only a small fraction of papers appears on arxive.”

    I disagree with the words you are putting in my mouth. This not what I wrote. Here is what I wrote:

    “One could argue that the good ranking can be explained by the fact that arXiv includes everything. However, it is far from true.”

    3) “If this was all written to support a simple statement that publishing an open-source version of your paper doesn’t diminish its citation index”

    No, this is not my message at all. My message is that if you consider arXiv as a venue, then it is a good quality venue. This means that as a reader, you can reasonably use arXiv as an information source. I conclude my blog post by encouraging people to subscribe to the list of arXiv new papers.

    We can disagree as to whether people (especially engineers) should use arXiv as an information source regarding new papers. Maybe people should not subscribe to the Twitter feed I recommended. Maybe you have arguments against this… certainly, reasonable arguments can be found.

    But I do not think you can disagree about the fact that arXiv, as a venue, has a high h-index. Well, if you do disagree about this, you need to take your disagreement with Google. I am only reporting on what Google is telling us.

    Comment by Daniel Lemire — 29/8/2014 @ 9:33

  14. >But I do not think you can disagree about the fact that arXiv, as a venue, has a high h-index. Well, if you do disagree about this, you need to take your disagreement with Google. I am only reporting on what Google is telling us.

    Ohhh, I absolutely can. And here is my public disagreement

    http://searchivarius.org/blog/does-arxiv-really-have-high-citation-index

    Should I put it on arXiv to look fancier? :-) Regarding, Google. Google is great, but it doesn’t automatically mean that people at Google are always right. For example, Google trends was recently criticized:

    http://www.forbes.com/sites/stevensalzberg/2014/03/23/why-google-flu-is-a-failure/

    Comment by Leonid Boytsov — 29/8/2014 @ 11:05

  15. ARXIV INCLUDES BOTH UNREFEREED AND REFEREED VERSIONS OF PAPERS: DISTINGUISH CITATION FROM EARLY ACCESS AND DOWNLOAD LOCUS

    Peer-reviewed publication is not the same thing as access-provision:

    Subscription Journals provide peer review as well as access (to subscribers).

    Repositories provide access (to peer-reviewed journal articles and sometimes to earlier unrefereed drafts).

    Hence repositories do not have citation counts or h-indexes:

    Users access whatever version they can access, but they cite the journal article.

    The only exception is unrefereed drafts — but even there, it is the author’s draft that is being cited, and not the repository:

    Unrefereed drafts used to be cited as “name, title, unpublished (or ‘in prep’)” and refereed, accepted drafts used to be cited as “name, title, journal, in press).”

    Adding an OA access-point to the journal citation is becoming an increasingly common (and desirable) practice, but it does not change the fact that what is being cited is the work, and the canonical version of the work is the refereed, published version.

    Hence repositories do not have citation counts; they just have download access counts.

    Some interesting statistics can, however, be done on the citation of unrefereed vs refereed versions.

    Comment by Stevan Harnad — 3/9/2014 @ 5:12

  16. @Stevan

    This distinction between access-point and journal-citation is fine in principle, but in fact many authors write their references as if arXiv were a journal-citation, not an access-point. It seems that, in the minds of many authors and in the computers of Google, the distinction between access-point and journal-citation is being blurred.

    Comment by Peter Turney — 3/9/2014 @ 7:59

  17. @PeterTurney

    Scholarly practices are evolving — in the online era one might even say they are “catching up” with the still mostly untapped potential of the online medium.

    Yes, some authors are citing sloppily, but I assure you they are not doing so in their CVs! A posting to Arxiv is not a refereed publication unless it has been accepted for publication by a refereed journal. And that is the reference authors will cite (as long as peer review continues to be the criterion for peer-reviewed publication).

    In Arxiv, the longstanding users such as HEP physicists have caught up: They cite the Arxiv preprint till the journal reference is available, and from then on they cite the journal reference (though they will still add the Arxiv URL or DOI for access).

    Once Open Access becomes universal, all authors, in all disciplines, will catch up…

    Comment by Stevan Harnad — 3/9/2014 @ 10:35

Sorry, the comment form is closed at this time.

« Blog's main page

Powered by WordPress