In current work with do on database indexes, we decided to use DBLP as a data source. Among other things, we use the authors’ name as a dimension. From one plot, I noticed that there must have half a million distinct authors. I doubted this number, and Kamel was nice enough to investigate further. It turns out that there are 531,480 different authors in DBLP! (As a basis for comparison, there about 945,000 articles.)
I don’t know about you, but this feels like a large number. We started to look for explanations. I already reported that the USA is producing 1,500 new Computer Science Ph.D.s a year. Still, there cannot be many more than 100,000 recently active Computer Science authors holding a Ph.D.
Owen pointed us to the recent CACM article Are your citations clean? by Lee et al. Alas, while DBLP is certainly dirty, in that some researchers will appear under two or more different names, it cannot explain why we end up with half a million authors!
The best explanation so far is that many undergraduate or M.Sc. students have papers on DBLP. So much so that they make up the majority of the authors in DBLP.
Do you buy this theory? If not, do you have a better explanation?
(As a side-effect, it should not be very hard to be in the top 10% among the most prolific DBLP authors!)