Most database research papers use synthetic data sets. That is, they use random-number generators to create their data on the fly. A popular generator is dbgen from the Transaction Processing Performance Council (TPC).
Why is that a problem?
- We end up working with simplistic models. If we consider the main table generated by dbgen, out of 17 columns, 7 have uniform distributions. This almost never happens with real data. Similarly, we often end up with attributes that are perfectly statistically independent. I believe that the randomly generated data is not be representative of the real data you find in real businesses.
- Because so many people use essentially the same data generators, we risk having solutions that are optimized for this particular synthetic data. This allows researchers to essentially cheat, sometimes unknowingly.
However, finding suitably large real data sets is difficult. My own research focus is on data warehousing. Most businesses are unwilling to share the data in their data warehouses. Even if they were willing to do so, sharing very large files is inconvenient.
Here are three moderately large data sets that I have used in my research:
- I found a table of wikileaks-related metadata on Google Fusion. By transforming it into a relational table, I was able to create a table with over a million rows. It is tiny by data warehousing standards, but the data is easily accessible.
- I like the Canadian Census from 1880. You have data about 4 million Canadians. The data is publicly available and convenient. See my paper in Information Sciences from 2011 for details on how we retrieved and processed it.
- There is weather data set that we have used repeatedly. It is relatively large (9 GB) and freely available as the Edited synoptic cloud reports from ships and land stations over the globe. You can retrieve all of the data files from the ftp directory and aggregate them in a single table.
Can we do better? I asked on the Internet for large tabular data sets (greater than 20 GB) that can be considered representative of business use. I have not found anything that match my needs yet, but I am investigating further one data set that I consider especially promising:
- James Long suggested I have a look at the US Census from 2000. I am currently downloading some of their data. I am unsure how big a table I can construct, but there are gigabytes of freely available data.
I also received other excellent proposals that I am not pursuing for the time being:
- Jared Webb suggested I grab one of the large health data sets. I ended up getting the SEER data set from National Cancer Institute, as recommended by Jared. I got a 210 MB zip file filled with relatively small flat files. The data is freely accessible, but you must fax a signed form. It is not nearly large enough for my needs.
- Howard C. Shaw III suggested the Eron data set. Unfortunately, the Enron email data set is primarily made of unstructured (text) data. It does not suit my purposes.
- Howard and Neil Conway also pointed out that I should look at the Amazon public data sets. These are moderately large data sets that Amazon makes available to its web services customers. In particular, Tim Goh suggested I look at the Freebase data dump. Unfortunately, I am not an Amazon customer and I am uneasy about basing my research on data that is only available through an Amazon subscription. Thankfully, many of the Amazon public data sets are available elsewhere. For example, Nicolas Torzec told me that freebase is making available its database dumps from their site. He says that the full dump is 4GB compressed and that it contains about 50M objects.
- Tim Goh reminded me of the Google Book n-gram data set. (Update: this is different from the Google n-gram data set which is not freely available.) It is not typical data warehousing data however.
- Kristina Chodorow pointed me to a page on quora.com containing a long list of publicly available data sets. There is a wealth of information there, but I find that a lot of times, access is difficult, or the data is unstructured or very specialized (e.g., web search).
- Jeff Green suggested I look at the NHS public data sets. I spent some time on their site and could only find small data sets.
- Aaron Newton, Michael Ekstrand and Nicolás Della Penna suggested a snapshot of the Wikipedia database. Unfortunately, again, much of the data is unstructured.
- Brian McFee pointed me to the Million Song dataset. From what I could tell, you have detailed data about a million songs. It is primarily geared toward music information retrieval.
- Matt Kevins pointed to the Google public data sets. Alas, I could not find out how to download the data sets and I am not sure how large they are.
- Aleks Scholz pointed me to the all-sky data set. It is a large, freely available, astronomy data set. Most of the data is made of floating-point numbers so it does not fit my immediate needs, but it looks very interesting.
- Israel Herraiz suggested I look at the Sourcerer project. It looks like a large collection of Java source code. This could support great research projects, but it is not a good match for what I want to do right now.
- Axel Knauf suggested web crawling data. Probably not a good fit for my current research though it can be very valuable if you work on web information retrieval.
- Peter Boothe asked whether I was interested in BGP routing data count. It looks like network data. I haven’t looked at it too much. Could be interesting, but it is probably too specialized.
- Tracy Harms asked about the Netflix data set. Unfortunately, this data set is no longer publicly available and it was only 2 GB. I used it in my 2010 Data & Knowledge Engineering paper on bitmap indexes.
- Aaron J Elmore pointed me to oltpbenchmark.com for an online transaction processing benchmark framework. My research is not primarily on transactions (OLTP), but this is a very interesting project. They have collected data sets and corresponding workloads. In particular, they link to Wikipedia access statistics. It could be very important if you are designing back-end systems for web applications.
- Òscar Celma points to a Twitter social graph which occupies several gigabytes.
(If you answered my queries and I have not included you, I am sorry.)
Update: Mason maintains a list of links to useful data sets. I later found out about the Click Dataset which appears to be interesting and quite large: unfortunately, it is not available for download from a public site.