Publicly available large data sets for database research

Most database research papers use synthetic data sets. That is, they use random-number generators to create their data on the fly. A popular generator is dbgen from the Transaction Processing Performance Council (TPC).

Why is that a problem?

  • We end up working with simplistic models. If we consider the main table generated by dbgen, out of 17 columns, 7 have uniform distributions. This almost never happens with real data. Similarly, we often end up with attributes that are perfectly statistically independent. I believe that the randomly generated data is not be representative of the real data you find in real businesses.
  • Because so many people use essentially the same data generators, we risk having solutions that are optimized for this particular synthetic data. This allows researchers to essentially cheat, sometimes unknowingly.

However, finding suitably large real data sets is difficult. My own research focus is on data warehousing. Most businesses are unwilling to share the data in their data warehouses. Even if they were willing to do so, sharing very large files is inconvenient.

Here are three moderately large data sets that I have used in my research:

Can we do better? I asked on the Internet for large tabular data sets (greater than 20 GB) that can be considered representative of business use. I have not found anything that match my needs yet, but I am investigating further one data set that I consider especially promising:

  • James Long suggested I have a look at the US Census from 2000. I am currently downloading some of their data. I am unsure how big a table I can construct, but there are gigabytes of freely available data.

I also received other excellent proposals that I am not pursuing for the time being:

  • Jared Webb suggested I grab one of the large health data sets. I ended up getting the SEER data set from National Cancer Institute, as recommended by Jared. I got a 210 MB zip file filled with relatively small flat files. The data is freely accessible, but you must fax a signed form. It is not nearly large enough for my needs.
  • Howard C. Shaw III suggested the Eron data set. Unfortunately, the Enron email data set is primarily made of unstructured (text) data. It does not suit my purposes.
  • Howard and Neil Conway also pointed out that I should look at the Amazon public data sets. These are moderately large data sets that Amazon makes available to its web services customers. In particular, Tim Goh suggested I look at the Freebase data dump. Unfortunately, I am not an Amazon customer and I am uneasy about basing my research on data that is only available through an Amazon subscription. Thankfully, many of the Amazon public data sets are available elsewhere. For example, Nicolas Torzec told me that freebase is making available its database dumps from their site. He says that the full dump is 4GB compressed and that it contains about 50M objects.
  • Tim Goh reminded me of the Google Book n-gram data set. (Update: this is different from the Google n-gram data set which is not freely available.) It is not typical data warehousing data however.
  • Kristina Chodorow pointed me to a page on containing a long list of publicly available data sets. There is a wealth of information there, but I find that a lot of times, access is difficult, or the data is unstructured or very specialized (e.g., web search).
  • Jeff Green suggested I look at the NHS public data sets. I spent some time on their site and could only find small data sets.
  • Aaron Newton, Michael Ekstrand and Nicolás Della Penna suggested a snapshot of the Wikipedia database. Unfortunately, again, much of the data is unstructured.
  • Brian McFee pointed me to the Million Song dataset. From what I could tell, you have detailed data about a million songs. It is primarily geared toward music information retrieval.
  • Matt Kevins pointed to the Google public data sets. Alas, I could not find out how to download the data sets and I am not sure how large they are.
  • Aleks Scholz pointed me to the all-sky data set. It is a large, freely available, astronomy data set. Most of the data is made of floating-point numbers so it does not fit my immediate needs, but it looks very interesting.
  • Israel Herraiz suggested I look at the Sourcerer project. It looks like a large collection of Java source code. This could support great research projects, but it is not a good match for what I want to do right now.
  • Axel Knauf suggested web crawling data. Probably not a good fit for my current research though it can be very valuable if you work on web information retrieval.
  • Peter Boothe asked whether I was interested in BGP routing data count. It looks like network data. I haven’t looked at it too much. Could be interesting, but it is probably too specialized.
  • Tracy Harms asked about the Netflix data set. Unfortunately, this data set is no longer publicly available and it was only 2 GB. I used it in my 2010 Data & Knowledge Engineering paper on bitmap indexes.
  • Aaron J Elmore pointed me to for an online transaction processing benchmark framework. My research is not primarily on transactions (OLTP), but this is a very interesting project. They have collected data sets and corresponding workloads. In particular, they link to Wikipedia access statistics. It could be very important if you are designing back-end systems for web applications.
  • Òscar Celma points to a Twitter social graph which occupies several gigabytes.

(If you answered my queries and I have not included you, I am sorry.)

Update: Mason maintains a list of links to useful data sets. I later found out about the Click Dataset which appears to be interesting and quite large: unfortunately, it is not available for download from a public site.

Robert Seaton has prepared a set of 100+ freely available data sets.

28 thoughts on “Publicly available large data sets for database research”

  1. @Venkat

    Yes. I once had a page on this blog where I maintained a list of data sets within some kind of taxonomy. I gave up because it became unmanageable.
    Similarly, I try to maintain a data warehousing bibliography. Ultimately, you realize that everything is miscellaneous.

    That’s not quite accurate, of course, but I think it is quite a challenge to categorize data sets because they can be as diverse as human knowledge itself.

  2. Ah, missed it on first pass, since you highlighted the pointer rather than the reference.

    A useful exercise would be to create a table that technically characterizes what the data is like. A database of datasets.

  3. Hi Daniel,
    The US Bureau of Transportation statistics has a number of large, free and well-structured data sets. In particular I recommend the On-Time Performance data set (~140M rows, ~90 columns) and the Ticket Pricing (Market) data set (~320M rows, ~40 columns). They are here:

    On-Time Performance:

    Ticket Pricing:

    These tables are separated into separate Zip files which you can easily pull via ‘wget’ or ‘curl’ in a script. They’re comma-separated and well structured. The data is actual airline data, which is the best thing about these data sets.


    p.s. I found this site through your comments on some Quora post.

  4. Hi Daniel,
    That hasn’t been my finding from the data. Exploring it with Tableau has exposed all kinds of interesting characteristics, especially in the On-Time Performance data.

  5. @Robert

    The data is large and nice, but it also appears to be essentially transactional. It seems to me that if I roll it up on key attributes, not much is left of the data.

    Anyhow, I agree with you that it is very nice, just not what I’m looking for right now.

  6. @Robert

    I still find some oddities in the data set. For example, for some months, I get an empty set… as if there were no delays! Seems improbable especially since some of the tuples that I do get other months have delays of zero, or just no delay indication (null value).

    I don’t doubt you are able to exploit this data set, but I find it a bit difficult.

    Still, it is really nice to see the government making such data freely available.

  7. For evaluating implementation performance, it would be very helpful to have not just the dataset but also a workload (stream of queries and updates to that dataset). Any options?

  8. There are several largish Semantic Web datasets, i.e. billions of subject,predicate,object RDF triples. These are not your typical datawarehouse data either, but you could at least make a large table with subject predicate object columns…

    DbPedia has a structured form of the Wikipedia infoboxes, this is a lot like freebase:

    There has also been a series of Semantic Web “Billion Triple” challenges, which made large crawls of semantic web data available (2010 was about a billion triples, 2011 about 3.5 I think)

  9. The data set from Google is actually “Google Books Ngram” different from Google NGram data set (available for $150 from the Linguistic Data Consortium).


  10. (Derrick H. Karimi could not pass my Turing test so he asked me to post the following comment:)

    I found your page here very useful, and I would like to contribute. I took off on your US census idea. I downloaded all index.html files from, which came out to 10,626 files and 1.3Gb. Then I
    searched through them for the biggest CSV files, and found a 2.1 GB .csv
    file here:

    I did not spend much time determining what the data actually means, but I think the answers may be coded according to this:

    One would have to spend some more time to actually figure out what the data is. But it was a good data exploration case for me to have almost no contextual information about the dataset and visualize the trends.

  11. Daniel,
    i’m looking for dataset for my project at the size of more than 1GB in CSV format.
    so can you tell me where i can get that

Leave a Reply

Your email address will not be published. Required fields are marked *

To create code blocks or other preformatted text, indent by four spaces:

    This will be displayed in a monospaced font. The first four 
    spaces will be stripped off, but all other whitespace
    will be preserved.
    Markdown is turned off in code blocks:
     [This is not a link](

To create not a block, but an inline code span, use backticks:

Here is some inline `code`.

For more help see