In software, you frequently need to check whether some objects is in a set. For example, you might have a list of forbidden Web addresses. As someone enters a new Web address, you may want to check whether it is part of your black list. Or maybe you have a large list of already used passwords and you want to check whether the proposed new password is part of this list of compromised passwords.
The standard way to solve this problem is to create some key-value data structure. If you have enough memory, you might create a hash table. Or you might use a good old database.
However, such approaches might use too much memory or be too slow. So you want to use a small data structure that can quickly filter the requests. For example, what if you had a tiny data structure that could reliably tell you that your password is not in the list of compromised password?
One way to implement such a filter would be to compute a hash value of all your objects. A hash value is a random-looking number that you generate from your object, in such a way that the same object always generate the same random-looking number, but such that other objects are likely to generate other numbers. So the password “Toronto” might map to the hash value 32. You can then store these small numbers into a key-value store, like a hash table. Hence, you do not need to store all of the objects. For example, you might store 32-bit numbers for each possible password. If you give me a potential password, I check whether its corresponding 32-bit value is in the list and if it is not, then I tell you that it is safe. So if you give me “Toronto”, I check whether 32 is in my table. Otherwise, I send your request to a larger database, for a full lookup. The probability that I send you in vain for a full lookup is called the “false positive probability”.
Though this hash table approach might be practical, you could end up using 4 bytes per value. Maybe this is too much? Bloom filters come to the rescue. A Bloom filter works similarly, except that you compute several hash values from your object. You use these hash values as indexes inside an array of bits. When adding an object to the set, you set all bits corresponding to the objects to one. When you receive a new object and you want to check whether it belongs to the set, you can just check whether all of the bits have been set. In practice, you will use far fewer than 4 bytes per value (say 12 bits) and still be able to achieve a false positive rate of less than 1%.
While the Bloom filter is a textbook algorithm, it has some significant downsides. A major one is that it needs many data accesses and many hash values to check that an object is part of the set. In short, it is not optimally fast.
Can you do better? Yes. Among other alternatives, Fan et al. introduced Cuckoo filters which use less space and are faster than Bloom filters. While implementing a Bloom filter is a relatively simple exercise, Cuckoo filters require a bit more engineering.
Could we do even better while limiting the code to something you can hold in your head?
It turns out that you can with xor filters. We just published a paper called Xor Filters: Faster and Smaller Than Bloom and Cuckoo Filters that will appear in the Journal of Experimental Algorithmics.
The following figure gives the number of bits per entry versus the false positive probability. Xor filters offer better accuracy for a given memory budget. With only 9 bits per entry, you can get a false positive probability much less than 1%.
The complete implementation in Go fits in less than 300 lines and I did not even try to be short. In fact, any semi-competent Go coder can make it fit within 200 lines.
We also have an optimized C version that can be easily integrated into your projects since it is a single-header. It is larger than 300 lines, but contains different alternatives including an approach with slightly faster construction. I wrote a small demonstration in C with a compromised password detection problem. The xor filters takes a bit longer to build, but once built, it uses less memory and is about 25% faster.
We also have Java and C++ implementations. We have Rust, Python
and Erlang versions.
An xor filter is meant to be immutable. You build it once, and simply rebuild it when needed. Though you can update a Bloom filter, by adding keys to it, it means either overallocating the initial filter, or sacrificing accuracy. These filters are typically not meant to be used as dynamic data structure (unlike a hash table) since they have a fixed capacity. In the case of cuckoo filters, once you approach the maximum capacity (say within 94%), then the insertion of new values may fail, and the solution when it does is to rebuild the whole thing.
Deletions are generally unsafe with these filters even in principle because they track hash values and cannot deal with collisions without access to the object data: if you have two objects mapping to the same hash value, and you have a filter on hash values, it is going to be difficult to delete one without the other. No matter what, you must somehow keep track of what was added in the filter and what was removed, and the filter itself cannot tell you. The same issue is at play when you are building the filter: the filter itself cannot tell you whether you already added an element: you have to keep track of this information yourself.
Furthermore all these filters are frequently used in a concurrent setting, with multiple threads. You typically cannot safely modify the filter while querying it. The simplest solution to avoid expensive locks, is to make it immutable. Keep in mind that these filters are meant to be small.
The construction of the xor filter requires several bytes of temporary memory per entry, but this memory can be released immediately after construction and a compact data structure remains, using just a few bits per entry.
If you do not have stringent requirements regarding memory usage, other techniques might be preferable like blocked Bloom filters which have unbeatable speed (at the expense of higher memory usage). The xor filters are really at their best when you want high speed and tight memory usage.
Credit: This is joint work with Thomas Mueller Graf. Xor filters are basically an efficient implementation and specialization of a theoretical construction, the Bloomier filter.
You had me all way until immutable. I used a cuckoo filter to speed up checking for existing relationships a while back on https://maxdemarzi.com/2017/07/13/using-a-cuckoo-filter-for-unique-relationships/
Deletions and additions may fail in a Cuckoo filter. Deletions may fail in general and additions will fail probabilistically if you get close to the capacity (say greater than 94%). If you are using it in a dynamic setting, you need a fallback (possibly a reconstruction) and you need tests. This can be abstracted away, but it requires some engineering. And, of course, if the data structure is dynamic, you need concurrency, possibly via locking. That cannot be safely omitted.
I’m confused Deletions work if you know the item is in the set. It’s not safe to delete without that surety, of course.
(Not in response to thread): Cool stuff! Be quite wished there was a practical implementation of Bloomier filters when designing our setsep data structures. It would be fun to see if xor filters could be extended to that setting.
Sure, xor filter could be used for this: as in the Bloomier filter, you need a second (external) array which is mutable. Then, 2 bits of each xor filter table entry are used to compute the index within that second array (as in the BDZ MPHF). It would be good to better understand the exact use case…
The xor filter currently only stores fingerprints. But other data can be stored as well. In the “known password database” use case, you could reserve one bit per entry to say if it’s a very common password or not.
Which way?
Cuckoo filter even could contain dublicate (small amount of).
Therefore, if you delete element that certainly were in a cuckoo filter, then deletion will not fail.
It is not a pb for data structure of file format like ORC (https://orc.apache.org/ ) which are generated and can’t be modified.
@David and @Yura
If you know what has been added and what has been removed from the set, then you can delete from the cuckoo filter. But the cuckoo filter does not give you this information. You have to keep track of it. In effect, you must have access, somehow, to the true content of the set. And, of course, if you delete content, you don’t recover the memory until you rebuild the filter.
Note that it is an issue with all probabilistic filters. Deletions somehow requires you to keep track of what is in the set, but the filter cannot do that. You need to have this information some other way.
Immutability is fine, however the Go implementation forces the consumer to have
len(keys) * 8
bytes memory available to create the filter.func Populate(keys []uint64) *Xor8
Something like:
func FromIterator(next func() uint64, size int) *Xor8
might make sense as well.
@Maciej
If you cannot afford 8 bytes per entry, then you are not going to be able to build the filter…
(Update: once constructed, the filter will use very little memory. It is only during construction that you need several bytes per entry.)
If this is true, how is 8 per entry better?
A filter (once constructed) use about 8 bits per entry, not 8 bytes.
How big is the memory overhead during filter construction. I couldn’t get a good idea of it with my limited skim of the paper.
The memory overhead during construction is implementation dependent. It takes several bytes per entry. Maybe as high as 64 bytes per entry.
Whoa, quite interesting algorithm.
Couple of questions: what are FPP for “two segment” filter instead of “three segment”? I suppose, it will take more time to build though, because successful run of “algorithm 3” is less possible. code will be simplier if it follow alhorithm, ie without explicit queue segmentation. Is it runs significantly faster with queue segmentation? Thank you for attention.
The FPP is significantly worse if you use only two segments. I don’t know the formula (though it is easy to find) but it is not practical.
There are many strategies that one can use to build the filter, and yes, there is a lot of room for optimization.
I could suggest “two segment – quad place” scheme:
side := hash&1
h1 := reduce(hash>>1, blocklength)
h2 := h1 + 1 + side
if h2 > blocklength { h2-=blocklength }
h3 := reduce(hash>>33, blocklength)
h4 := h3 + (2 - side)
if h4 > blocklentgh {h4-=blocklength}
h3 += blocklength
h4 += blocklength
When I’ve played with Cuckoo hashes such 2×2 construction gave quite comparable result with 3×1 (in terms of “average achievable load factor”), and most of time it takes only two memory fetches (in terms of cache lines).
And since here will be for places, FPP should be higher. (But fingerprint should be tacken from other bits to be independent from h1 and h3).
Nope. I’ve achieved only 1.35 oversize, that quite larger than 1.23 🙁
Other interesting thing: FPP is close to 0.37-0.4 for any Xor8: either 2 points, or 3 points, or 4 points. Looks like it is always 1/2^k = 1/256 . Oops, yeah it should be not less than it… shame on me.
Very nice!
One small typo: “fall positive”
How XOR filters compare with Golomb-coded sets in term of memory usage and query speed?
Please see Table 3 in the paper. Xor filters are much faster. The memory usage is not better with GCS.
Very nice algorithm – so short and it makes the inspired parts easy to find. Very cool how it starts by finding at least one item that doesn’t have an unencumbered set cell, and then peels that item out, hopefully unencumbering cells in other sets, until with a little luck, they can all be pulled out. And then they are finalized in reverse order because that guarantees they each have an unencumbered hash location that can take on any values of xor necessary when its their turn to store a fingerprint.
I haven’t run a lot of simulations yet but I was surprised that each random set I tried was solved with the very first seed that was tried. The code is there to keep trying with more seeds but I wonder what it would take for that to be necessary. If there were a need to try many seeds – the algorithm could easily be modified to make the lengths slightly longer each time too. For that matter, for a list that was going to be preserved for a very long time, there could be a derivation that tries with successively shorter sets.
Maybe a comment in the code around the splitmix64 function that it has a period of 2^64 which is good since you don’t want the hash to create duplicates – which the algorithm doesn’t detect and doesn’t solve.
Timings are very nice too. Just a few cycles for running Contains, regardless of the number of keys. And tens of cycles per key when building until the number of keys got to 10M at which point it was still just about 150ns per key. I could see rebuilding the filter very often when the cost is so low.
Thanks for providing a golang version. Made the algorithm very easy to follow.
It’s also interesting that doubling the size of the fingerprint from a byte to a half word doesn’t increase the number of items the set would hold – it just reduces the chance of a false positive from 1 in 256 to 1 in 64K.
In the third paragraph, the following sentence seems incomplete:
It looks like you meant an “either-or” statement but only covered the case where none of the hash function bits are set. I guess it should be something like: “… either your password might be in the list, or your password is not in the list…”
The golang version is quite concise and easy to port to rust-lang.
I am using croaring bitmap for my indexing library, which already
works well.
A quick comparison between xorfilter and croaring, on a MacBookPro,
gives 2-3 times CPU-performance improvement over croaring.
Ref: https://github.com/bnclabs/xorfilter
Also did a rust port from the go version, and tried to do a bit more idiomatic – but you beat me to it 😀 https://github.com/Polochon-street/rustxorfilter
In the paper, the membership test algorithm defined in section 3.1 evaluates h_0(x) given an x from U, but in the previous section, h_0 was defined as a function on S, which is only a subset of U. What am I overlooking? 🙂
Good catch.
I’ve found variant of Xor8+ algo that puts first two lookups in a single cache line. BitsPerValue is almost same as with original Xor8+ variant. (But for non-plus variant is larger since it requires 1.28x fingerprints vs 1.23x fingerprints)
Idea: put first two points into same cache line in a single segment, and third point in a second segment with rank9.
I’ve tried “2 equal segments” and “first segment is twice larger” and “equal segments” looks to give lesser BitsPerValue.
Changes are demonstrated in Go variang (without actual Rank9 construction, just with statistic calculation): https://github.com/FastFilter/xorfilter/pull/11/commits/c7b541fe86297e75073e0d193173feb7a2dd5180
https://github.com/FastFilter/xorfilter/pull/11
Also https://github.com/FastFilter/xorfilter/pull/10 fixes Go implementation to be closer to intended Xor8+ implementation.
The rust implementation has a failure in release mode due the following assertion.
let fpp = matches * 100.0 / (falsesize as f64);
println!("false positive rate {}%", fpp);
assert!(fpp < 0.40, "fpp({}) >= 0.40", fpp);
Below is the output.
---- tests::test_basic2 stdout ----
seed 5965087604022038976
bits per entry 9.864 bits
false positive rate 0.4027%
thread 'tests::test_basic2' panicked at 'fpp(0.4027) >= 0.40', src\lib.rs:454:9
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace.
Should I relax the “false-positive-rate” to
< 0.50
?Ref: https://github.com/bnclabs/xorfilter/pull/7
Thanks,
Have you tried increasing the size of the test? I suspect that you might have a poor estimation of the false-positive rate.
Increased the test size and false-positive rate consistently stays below 0.40 %. Thanks,
Tried to document the sizing requirement for rust implementation of xorfilter here. I hope it is correct. In which case, to index 1Billion keys we may need as high as 50GB of memory footprint ? Any suggestion on how to handle DGM (Disk-Greater-than-Memory) scenarios ?
I think you are computing an upper bound on the memory usage, not a lower bound.
Yes, I want to cover the worst case scenario that can lead to OOM or memory crunch. Even without the working array, the footprint for finger_print array alone can go beyong 9GB when we try to index 1 billion items.
Using mmap for xor-filter’s memory requirement is one idea. Want to know better alternatives.
I will take the discussion over to your issue.
But I think it is important to establish the context. Just storing 1 billion integers in a hash table in C++ or Java can easily use 32 GB or more.
Once you start having a billion of anything in memory, gigabytes of memory are required.
But I take you point.
This is actually where a standard Bloom filter is very good, since you need no extra storage for construction, and you can construct it incrementally. My main use is for tracking which hashes are definitely NOT already in content-addressable storage. With a fixed filter size, the ratio of false positives goes up as you insert hashes, but it’s not critical in an application like this. A gigabyte of bloom filter storage goes a very long way.
That said, it looks like the Xor filter would be great for smaller data sets.
I wonder if, for your use case, do you also need to remove entries at some point (e.g. garbage collection)? For Apache Jackrabbit Oak we use content-addressable storage as well, with garbage collection. We currently don’t use an approximate data structure.
I just read through the paper and was a bit confused by the Assigning Step. Is B supposed to be initialized with zeros or random values at that point? In the Construction Step, it seems to still be uninitialized when assign() is called with it.
I have not checked the paper recently, but it actually does not matter. You can start with a table that has garbage in it since all that matters is that XOR be the value you desire and each time you just change one slot. The construction does not need to be deterministic. You have a lot of freedom.
(Having read the paper)
I’m curious, what is the reason for h_0, h_1 and h_2 mapping to non-overlapping intervals of |B|?
Is it simply to make sure h_0(x), h_1(x), h_2(x), have three distinct values? Or is there some property of the random graph that it wouldn’t have if each hash function had the full range of |B| but otherwise avoided duplicates?
(Probably I can find the answer if I follow the Botelho cite).
There is obviously the problem that you are going to get a very small number of collisions if you don’t restrict the hash values, but I do not think that it is the primary concern. My expectation is that you are going to get a higher failure probability without the constraints.
I was thinking of a collision-free hash scheme. E.g. with h_0′(x) in 0..|B-3|, h_1′(x) in 0..|B-2| and h_2′(x) in 0..|B-1|. h_1′(x) and h_0′(x)+h_1′(x) mod |B-1| would be two distinct values in 0..|B-2|. Similarly, adding h_2′(x) to those, mod |B|, gives a hashed triple with no duplicates.
Having now skimmed the Botelho paper, I think their proof depends on the graph being 3-partite. It’s not clear to me, yet, whether the same property (whatever it is that makes the failure probability small) would hold without it being 3-partite, but still collision-free.
And if that were true, I wonder if it would lead to a reduction of the 1.23 factor.
I expect that the answer is negative. You will need a higher (worse) ratio.
But it is not hard to test it out!
Some initial experiments suggest a collision-free triplet hash may be a very very slight improvement over the tripartite hash.
It seems to reduce the number of average number of attempts needed to find a suitable graph. But the reduction (if it really exists) is so slight as to not be worth the effort, maybe 1%. And this doesn’t extrapolate to a similar reduction in the capacity expansion factor.
Specifically, I ran five experiments, each with 10K trials on 1K keys. In 4 of the 5 the triplet hash reduced avg attempts (e.g. from 1.116 to 1.105). In the fifth experiment it increased avg by about 0.1%.
Obviously that’s not a thorough experiment. For one thing, the reduction is so small that it could be within the variance for 10K trials. Still, getting a “win” on 4 of 5 attempts seems promising.
I then tried to see if it would allow reduction of that 1.23. But I quickly discovered (as you probably already know) how sensitive the process is to changes in that factor. Reducing that will obviously raise the avg number of attempts. My hope was that changing that to, say, 1.22 might make the new hash scheme comparable to the tripartite hash with 1.23. But that small change raises the avg attempts by about 10% (for both hash schemes). At that point I inferred that I might only get a reduction to something like 1.229 at best, and I gave up.
P.S. my earlier description of the collision-free triplet hash was wrong in some details, but probably demonstrates the concept.