For large data sets on disk, indexes are often essential. However, if your data fits in RAM, indexes are often unnecessary. They may even be harmful.
Consider a table made of 10,000,000 rows and 10 columns. Using normalization, you can replace each value by a 32-bit integer for a total of 381 MB. How long does it take to scan it all?
- When the data is on disk, it takes 0.5 s to scan the data. To maximize buffering, I have used a memory-mapped file.
- When the data is in memory, it takes 0.06 s.
Can you bring the 0.06 s figure down with an index? Maybe. But consider that :
- Indexes use memory, sometimes forcing you to store more data on disk.
- Indexes typically slow down construction and updates.
- Indexes typically only improve the performance of some queries. This is especially true with multidimensional data.
Verify my results: My Java code is available on paste.bin. I ran the tests on an iMac with an Intel Core i7 (2.8 GHz) processor.
Source: This blog post was motivated by a question from Julian Hyde, of Mondrian fame.
Further reading: Understanding what makes database indexes work
Code: Source code posted on my blog is available from a github repository.
@Maria
Quite right, Maria. In this post, I took index to mean “supplementary data structure on top of an existing row store”.
Hi, Daniel. I think your general point is a good one, that a database optimized to fit in memory often will beat a database that is not, and that people should pay close attention to getting most or all of their active data in memory.
I’m not sure the example of scanning the entire data set is all that compelling, though, since that usually isn’t necessary to get the data in memory. But I understand you are making an extreme point with that.
By the way, this reminds me of a conversation I had with Udi Manber back in 2002 when he was still at Amazon. I was arguing that his old Glimpse work (which was disk-based and did a coarse-grained index to jump into blocks of compressed data that would then be sequentially scanned) potentially could be applied to newer in-memory systems. At the time, Udi disagreed with me, I remember, but you do see something vaguely similar to that implemented in Google’s index nowadays (see the “block-based” index format now used for Google’s massive in-memory web search indexes as described in Jeff Dean’s WSDM 2009 talk). Fun stuff there.
@Greg
I predicted that we will have servers with 1TB of RAM by 2020. If you believe this prediction or a weaker one, then you must concede that a lot of businesses will be able to keep most of their data in RAM. If you throw in fast persistent RAM on top of this… hard drves may become the new tape backups before long.
How many cores will servers have in 2020? I am guessing 64 cores (but I didn’t do any reasearch).
If you combine these two effects, I think you start to see that software will spend a lot more time scanning data than it does right now.
This will probably be motivated by richer applications which access data in less predictable ways.
As for Mander’s work… I think that it often makes little sense to recover single rows from disk, because there is hardly any difference between reading 1KB or 1MB of data due to disk buffering. RAM systems have a similar issue: it expensive to load stuff in L2 cache, but cheap to process once it is there. So, you don’t need to know exactly where your data is, as long as you can find the small block where it is. So, yes, it looks like main memory is the new disk…
@Paul
Your numbers are interesting. Consider also that SQLite was probably written in C. So the gap between my Java implementation (cooked up in less than an hour) and SQLite is quite significant.
As for memory usage, I am pretty sure that Java will use about 700 MB of RAM to store the table, but it is because of the garbage collector.
The specific query I implemented is easy to index. If that is all you need then a hash table-like index is a must.
Obviously, I was thinking about someone who has to generate more generic queries. If you have to index your table for several possible types of queries, I think that the memory usage might grow quite large. For example, did you index every column? What the memory usage like then?
One thing to mention here is that full scans can be found at the core of many high performance columnar databases. Where with row based engines you need to know for what queries you have to optimize, columnar stores often times do not require preparing indexes and in many cases are also self optimizing by gradually reorganizing it’s internal data representations for even better performance (e.g. MonetDB)
@Paul
Thanks for the great numbers.
Unsurprisingly, indexing all columns takes roughly as much memory as the original table. If you think about it, it is unavoidable because you could reconstruct the whole table from the indexes (on every column) alone. Thus, the indexes have just as much information as the table, but it is organized differently. (This opens up the possibility of throwing away the row store, and just keep the indexes… but of course, reconstructing the rows then becomes painful.)
Perhaps the most meaningful tradeoff here is that if your data just barely fits in memory, you’re better off scanning it than indexing it on disk.
I’m sure we can find many counterexamples, but this statement matches my intuition.
It is always good to choose the optimum data structures even if your data fits in memory. Index, as I understand it in general, is an optimized data structure.
For example, when we processed wikipedia data to compute semantic relationships between pairs of concepts, we stored the whole Wikipedia data in memory (6 Gb). We processed text using this information about concept relationships. And it was very slow and needed a lot of optimizations.
Yes, you are right that indexes always optimize for specific queries, for your application.
@Paul
(…) incremental reindexing should be much faster
Yes, but it is also much easier to support concurrency without the indexes.
Put 10M 10 column INT rows into a SQLite in memory db (random values from 1 to 10M. The executable grows to ~750MB, suggesting a less efficient storage mechanism than your test.
Processor is a Intel Core i7 2.93 GHZ
Unindexed select was 2.6 seconds, about 50x worse than your results. I find this very interesting, I’m guessing between ACID and assumptions about usage, they optimized other queries at the expense of this?
Indexing took 36s (supporting your implication indexes may not be worth it if you change data far more than you read it, although incremental reindexing should be much faster)
Indexed select took .5ms, about 1/100th your scan (and 1/5,000 SQLite’s scan).
My takeaway from this quick experiment? I can imagine some edge cases where an index isn’t needed, but the 100-5000x speed up can offset a lot of penalties from added disk usage/slower updates.
Just did the test of per index growth:
Each single column index adds roughly 160MB.
Also created an index across all the columns, that one was ~680MB. So at the low end you’re looking at just over a GB and a half to index every column, more if you want multi-column indexes (although in my experience, just having one part of the query indexed often goes far enough in reducing the domain of possible matches to make the whole query pretty efficient).
I also realized that by setting my random numbers as (1,10M) I was just modeling a sparse range, e.g. ids, time stamps, etc. So I reran with random values from 1 to 100 to model a database where each column has a small range of values.
Base size goes from: ~ 750MB=>500MB,
index goes from: 160=>140
index across 10 columns: 680=>400
The size reduction is all/mostly SQLite storing ints in the smallest (power of 2) byte count needed to represent them.
Full scan (searching for 3 values to avoid excessive results) still takes 2.6 s, fully indexed is still .5ms, just using a single index gives 170 ms.
So it looks like any lessons anybody wants to take away are more or less independent of data sparsity.
Perhaps the most meaningful tradeoff here is that if your data just barely fits in memory, you’re better off scanning it than indexing it on disk. Similarly, because multi-column indexes are the largest, indexing the most common columns is likely to be superior than completely indexing each query in a non-trivial number of cases.
Interesting stuff…
What if you really need a join? That is denormalized structure is not an option?
For instance, to merge 10,000 records from one table with another table that is scanned sequentially?
@Itman
I’m not sure I understand the question: I’m just scanning 32-bit integers which can be foreign keys for all I care (it is irrelevant to the test in question).
Fast joins are interesting on their own, of course. One of the best way to compute joins would be to first sort both tables and then scan them both very fast.
One of the best way to compute joins would be to first sort both tables and then scan them both very fast.
Sorting of 10,000,000 records may be rather long. In many cases, you don’t have the luxury of performing the full sort before doing the join.
In addition, if you have a high-throughput system, 0.06 for an operation (which may be many per single-user request) is also way too long.
@Itman
In many cases, you don’t have the luxury of performing the full sort before doing the join.
According to Intel and Oracle researchers, sort join will soon overtake hash join:
“Our analysis projects that current architectural
trends of wider SIMD, more cores, and smaller memory bandwidth per core imply better scalability potential for sort-merge join. Consequently, sort-merge join is likely to outperform hash join on upcoming chip multiprocessors.”
In : Sort vs. Hash Revisited: Fast Join Implementation on Modern Multi-Core CPUs by Kim et al. (VLDB 2009)
In addition, if you have a high-throughput system, 0.06 for an operation (which may be many per single-user request) is also way too long.
Right. Sure.
You could make your criticism stronger. It is likely that SQLite is only using one core to answer the indexed version of my slice-count query. Meanwhile, I’m keeping 4 core busy for 0.06 s with my version.
This being said, my point was that you can scan through 10 million rows in 0.06 s on a regular desktop machine (server-class machines can do better). And it is only going to get better over time (more cores, more RAM, and so on).
If you have predictable, high selectivity queries, then indexes are a no brainer. I’ve spent a good chunk of my life working on indexes, so I’m not against them. But as the “Sort vs. Hash Revisited” paper I cite shows, there is a trend that makes data scanning a lot more useful.
@Itman
When sequential searching is slow, one can always divide the data into buckets, locate buckets using a fancy index and search buckets sequentially.
Yes. I’ll write more about this in 2011. 😉 I promise!
Daniel,
Thank you for the reference. Looks like an interesting one.
Yes, I agree that the modern trend is to rely more on sequential search algorihm, which have fewer banch misprediction and have better locality of reference.
When sequential searching is slow, one can always divide the data into buckets, locate buckets using a fancy index and search buckets sequentially.
Looking two data sets A 10millions and 10K, there is a join between 2 and filter on 10K set. without sort and join will perform buble search and it has been proven slow in computer algorithm.
Hey Daniel. Thanks for this.
Results on my MBP are similar to yours.
I noticed you split up scan among available cores in the system. Wouldn’t that potentially degrade performance on HDD’s?
I also ran into a problem when I set 1 col and 100mil records. It would get stuck at loading to RAM… possibly GC complaining?
@MarcinK
Wouldn’t that potentially degrade performance on HDD’s?
The blog post is entitled “For your in-memory databases, do you really need an index?”
If your data is on a slow disk, you probably need an index.
Hello, see your blog says: Using normalization, you can replace each value by a 32-bit integer for a total of 381MB. May I ask, is value a string? Is there a specific way to convert a string into an int number? Can I reverse from 1 int to 1 string?
Is there a specific way to convert a string into an int number?
You may map all strings into (consecutive) integers using a map of some kind. The processing is often call ‘normalisation’ but you also find it under the name “string interning“.