Document identifier data set
Packaged by D. Lemire on April 3rd 2014
Based on data sets prepared by L. Boytsov using software available at https://github.com/searchivarius/IndexTextCollect
Update: As for May 2023, I no longer make the file available. I recommend that you consider Real data sets for bitmap testing and RealisticTabularDataSets.
File name: IntegerCompression2014.3april2014.zip
File size: 70GB
You can get the file by sftp through [email protected] or [email protected] (password sftpuser). You can find free sftp clients online. If you have Mac or Linux box, then sftp comes by default.
Please report any problem you have with downloading the file.
The archive contains a README file describing the content. The data is further described in our papers, see for example:
- Daniel Lemire, Nathan Kurz, Leonid Boytsov, SIMD Compression and the Intersection of Sorted Integers, Software: Practice and Experience 46 (6), 2016. (arXiv:1401.6399)
- Jeff Plaisance, Nathan Kurz, Daniel Lemire, Vectorized VByte Decoding, International Symposium on Web Algorithms 2015, 2015. (arXiv:1503.07387)
- Daniel Lemire and Leonid Boytsov, Decoding billions of integers per second through vectorization, Software: Practice & Experience 45 (1), 2015. (10.1002/spe.2203)
This data can be used with the SIMDCompressionAndIntersection software library.