Most database research papers use synthetic data sets. That is, they use random-number generators to create their data on the fly. A popular generator is dbgen from the Transaction Processing Performance Council (TPC).

Why is that a problem?

  • We end up working with simplistic models. If we consider the main table generated by dbgen, out of 17 columns, 7 have uniform distributions. This almost never happens with real data. Similarly, we often end up with attributes that are perfectly statistically independent. I believe that the randomly generated data is not be representative of the real data you find in real businesses.
  • Because so many people use essentially the same data generators, we risk having solutions that are optimized for this particular synthetic data. This allows researchers to essentially cheat, sometimes unknowingly.

However, finding suitably large real data sets is difficult. My own research focus is on data warehousing. Most businesses are unwilling to share the data in their data warehouses. Even if they were willing to do so, sharing very large files is inconvenient.

Here are three moderately large data sets that I have used in my research:

Can we do better? I asked on the Internet for large tabular data sets (greater than 20 GB) that can be considered representative of business use. I have not found anything that match my needs yet, but I am investigating further one data set that I consider especially promising:

  • James Long suggested I have a look at the US Census from 2000. I am currently downloading some of their data. I am unsure how big a table I can construct, but there are gigabytes of freely available data.

I also received other excellent proposals that I am not pursuing for the time being:

  • Jared Webb suggested I grab one of the large health data sets. I ended up getting the SEER data set from National Cancer Institute, as recommended by Jared. I got a 210 MB zip file filled with relatively small flat files. The data is freely accessible, but you must fax a signed form. It is not nearly large enough for my needs.
  • Howard C. Shaw III suggested the Eron data set. Unfortunately, the Enron email data set is primarily made of unstructured (text) data. It does not suit my purposes.
  • Howard and Neil Conway also pointed out that I should look at the Amazon public data sets. These are moderately large data sets that Amazon makes available to its web services customers. In particular, Tim Goh suggested I look at the Freebase data dump. Unfortunately, I am not an Amazon customer and I am uneasy about basing my research on data that is only available through an Amazon subscription. Thankfully, many of the Amazon public data sets are available elsewhere. For example, Nicolas Torzec told me that freebase is making available its database dumps from their site. He says that the full dump is 4GB compressed and that it contains about 50M objects.
  • Tim Goh reminded me of the Google Book n-gram data set. (Update: this is different from the Google n-gram data set which is not freely available.) It is not typical data warehousing data however.
  • Kristina Chodorow pointed me to a page on quora.com containing a long list of publicly available data sets. There is a wealth of information there, but I find that a lot of times, access is difficult, or the data is unstructured or very specialized (e.g., web search).
  • Jeff Green suggested I look at the NHS public data sets. I spent some time on their site and could only find small data sets.
  • Aaron Newton, Michael Ekstrand and Nicolás Della Penna suggested a snapshot of the Wikipedia database. Unfortunately, again, much of the data is unstructured.
  • Brian McFee pointed me to the Million Song dataset. From what I could tell, you have detailed data about a million songs. It is primarily geared toward music information retrieval.
  • Matt Kevins pointed to the Google public data sets. Alas, I could not find out how to download the data sets and I am not sure how large they are.
  • Aleks Scholz pointed me to the all-sky data set. It is a large, freely available, astronomy data set. Most of the data is made of floating-point numbers so it does not fit my immediate needs, but it looks very interesting.
  • Israel Herraiz suggested I look at the Sourcerer project. It looks like a large collection of Java source code. This could support great research projects, but it is not a good match for what I want to do right now.
  • Axel Knauf suggested web crawling data. Probably not a good fit for my current research though it can be very valuable if you work on web information retrieval.
  • Peter Boothe asked whether I was interested in BGP routing data count. It looks like network data. I haven’t looked at it too much. Could be interesting, but it is probably too specialized.
  • Tracy Harms asked about the Netflix data set. Unfortunately, this data set is no longer publicly available and it was only 2 GB. I used it in my 2010 Data & Knowledge Engineering paper on bitmap indexes.
  • Aaron J Elmore pointed me to oltpbenchmark.com for an online transaction processing benchmark framework. My research is not primarily on transactions (OLTP), but this is a very interesting project. They have collected data sets and corresponding workloads. In particular, they link to Wikipedia access statistics. It could be very important if you are designing back-end systems for web applications.
  • Òscar Celma points to a Twitter social graph which occupies several gigabytes.

(If you answered my queries and I have not included you, I am sorry.)

Update: Mason maintains a list of links to useful data sets. I later found out about the Click Dataset which appears to be interesting and quite large: unfortunately, it is not available for download from a public site.

23 Comments

  1. Seen this compilation?

    http://www.nytimes.com/2012/03/25/business/factuals-gil-elbaz-wants-to-gather-the-data-universe.html?hp

    Comment by Venkat — 27/3/2012 @ 10:43

  2. Sorry, wrong link (though that one is also relevant).

    Right link:

    http://www.quora.com/Data/Where-can-I-get-large-datasets-open-to-the-public?__snids__=36983349

    Comment by Venkat — 27/3/2012 @ 10:44

  3. @Venkat

    Thanks. The quora link is already in my post. ;-)

    Comment by Daniel Lemire — 27/3/2012 @ 10:54

  4. Ah, missed it on first pass, since you highlighted the pointer rather than the reference.

    A useful exercise would be to create a table that technically characterizes what the data is like. A database of datasets.

    Comment by Venkat — 27/3/2012 @ 11:34

  5. @Venkat

    Yes. I once had a page on this blog where I maintained a list of data sets within some kind of taxonomy. I gave up because it became unmanageable.
    Similarly, I try to maintain a data warehousing bibliography. Ultimately, you realize that everything is miscellaneous.

    That’s not quite accurate, of course, but I think it is quite a challenge to categorize data sets because they can be as diverse as human knowledge itself.

    Comment by Daniel Lemire — 27/3/2012 @ 12:14

  6. Hi Daniel,
    The US Bureau of Transportation statistics has a number of large, free and well-structured data sets. In particular I recommend the On-Time Performance data set (~140M rows, ~90 columns) and the Ticket Pricing (Market) data set (~320M rows, ~40 columns). They are here:

    On-Time Performance:
    http://www.transtats.bts.gov/tables.asp?Table_ID=236&SYS_Table_Name=T_ONTIME

    Ticket Pricing:
    http://www.transtats.bts.gov/Tables.asp?DB_ID=125&DB_Name=Airline%20Origin%20and%20Destination%20Survey%20%28DB1B%29&DB_Short_Name=Origin%20and%20Destination%20Survey

    These tables are separated into separate Zip files which you can easily pull via ‘wget’ or ‘curl’ in a script. They’re comma-separated and well structured. The data is actual airline data, which is the best thing about these data sets.

    -Robert

    p.s. I found this site through your comments on some Quora post.

    Comment by Robert Morton — 27/3/2012 @ 16:54

  7. @Robert

    Thanks! I’ll check it out.

    Comment by Daniel Lemire — 27/3/2012 @ 17:36

  8. @Robert

    The data is large and nice, but it also appears to be essentially transactional. It seems to me that if I roll it up on key attributes, not much is left of the data.

    Anyhow, I agree with you that it is very nice, just not what I’m looking for right now.

    Comment by Daniel Lemire — 27/3/2012 @ 19:58

  9. Hi Daniel,
    That hasn’t been my finding from the data. Exploring it with Tableau has exposed all kinds of interesting characteristics, especially in the On-Time Performance data.
    -Robert

    Comment by Robert Morton — 27/3/2012 @ 20:00

  10. @Robert

    The On-Time link you gave leads me to an error page, I have now found the on-time data set and I am having fun with it.

    http://www.transtats.bts.gov/Tables.asp?DB_ID=120&DB_Name=Airline%20On-Time%20Performance%20Data&DB_Short_Name=On-Time

    It is possible that I just don’t grok the ticket pricing data set.

    (Note that I don’t use something as nice as tableau to explore it right now. I’m hacking.)

    Comment by Daniel Lemire — 27/3/2012 @ 20:39

  11. @Robert

    I still find some oddities in the data set. For example, for some months, I get an empty set… as if there were no delays! Seems improbable especially since some of the tuples that I do get other months have delays of zero, or just no delay indication (null value).

    I don’t doubt you are able to exploit this data set, but I find it a bit difficult.

    Still, it is really nice to see the government making such data freely available.

    Comment by Daniel Lemire — 27/3/2012 @ 21:06

  12. For evaluating implementation performance, it would be very helpful to have not just the dataset but also a workload (stream of queries and updates to that dataset). Any options?

    Comment by Jason Eisner — 28/3/2012 @ 3:44

  13. There are several largish Semantic Web datasets, i.e. billions of subject,predicate,object RDF triples. These are not your typical datawarehouse data either, but you could at least make a large table with subject predicate object columns…

    DbPedia has a structured form of the Wikipedia infoboxes, this is a lot like freebase:
    http://wiki.dbpedia.org/Downloads37

    There has also been a series of Semantic Web “Billion Triple” challenges, which made large crawls of semantic web data available (2010 was about a billion triples, 2011 about 3.5 I think)

    http://km.aifb.kit.edu/projects/btc-2010/
    http://km.aifb.kit.edu/projects/btc-2011/

    Comment by Gunnar Grimnes — 28/3/2012 @ 3:56

  14. @Jason

    I think that oltpbenchmark.com might help if you are looking for workloads.

    Comment by Daniel Lemire — 28/3/2012 @ 7:20

  15. The data set from Google is actually “Google Books Ngram” different from Google NGram data set (available for $150 from the Linguistic Data Consortium).

    Bye,
    michele.

    Comment by Michele Filannino — 28/3/2012 @ 9:49

  16. @Michele

    Quite right. I’ve updated my blog post. Thanks.

    Comment by Daniel Lemire — 28/3/2012 @ 10:04

  17. Hi Daniel,

    Freebase dataset is also directly available from their website, check http://wiki.freebase.com/wiki/Data_dumps

    I think that is also provided by Infochimps (http://www.infochimps.com/), which btw might be worth browsing for other datasets.

    Freebase guys also provide WEX (Freebase Wikipedia Extraction), which is a processed dump of the English wikipedia. You can find more info at http://wiki.freebase.com/wiki/WEX

    Have fun :-)

    da

    Comment by Davide Eynard — 30/3/2012 @ 3:29

  18. Openstreetmap has 2.7 billion GPS data points, 1.4 billion nodes, 131 million ways. Dataset is a 250G xml file. (21G compressed)

    http://www.openstreetmap.org/stats/data_stats.html
    http://wiki.openstreetmap.org/wiki/Planet.osm

    Comment by Johan Dahlin — 31/3/2012 @ 11:57

  19. The open library dataset is pretty awesome: http://openlibrary.org/developers/dumps

    Comment by Onkar Hoysala — 31/3/2012 @ 23:28

  20. http://campus.lostfocus.org/dataset/netflix.7z/

    Comment by boya — 7/4/2012 @ 7:26

  21. (Derrick H. Karimi could not pass my Turing test so he asked me to post the following comment:)

    I found your page here very useful, and I would like to contribute. I took off on your US census idea. I downloaded all index.html files from http://www2.census.gov/, which came out to 10,626 files and 1.3Gb. Then I
    searched through them for the biggest CSV files, and found a 2.1 GB .csv
    file here: http://www2.census.gov/acs2010_5yr/pums/csv_pus.zip

    I did not spend much time determining what the data actually means, but I think the answers may be coded according to this:

    http://www.census.gov/acs/www/Downloads/data_documentation/pums/CodeLists/ACSPUMS2006_2010CodeLists.pdf

    One would have to spend some more time to actually figure out what the data is. But it was a good data exploration case for me to have almost no contextual information about the dataset and visualize the trends.

    Comment by Daniel Lemire — 4/9/2012 @ 9:07

  22. Daniel,

    take a look here:
    http://www.ncdc.noaa.gov/most-popular-data

    Regards,
    Cristian

    Comment by Cristian — 28/11/2012 @ 18:08

  23. Daniel,

    i found the dimes project, http://www.netdimes.org and do they provide access to data sets.
    http://www.netdimes.org/new/?q=node/65
    =======================================
    DIMES is a distributed scientific research project, aimed to study the structure and topology of the Internet, with the help of a volunteer community (similar in spirit to projects such as SETI@Home).
    =======================================

    Regards,
    Cristian.

    Comment by Anonymous — 15/1/2013 @ 16:05

Sorry, the comment form is closed at this time.

« Blog's main page

Powered by WordPress