Programmers often need to ‘filter out’ data. Suppose that you are given a database of users where only a small percentage are ‘paying customers’ (say 5% or less). You can write an SQL query to check whether a given user is indeed a paying customer, but it might require a round trip to your database engine. Instead, you would like to hold a small ‘filter’ in memory to quickly check whether the given user is a paying customer. When you are pretty sure it might be a paying customer, then, and only then, do you query the database.

Probabilistic filters are ideally suited for such tasks. They return “true” when the key might be found, and “false” when the key definitively cannot be found. A probabilistic filter can be fast and concise.

The conventional probabilistic filter is the Bloom filter. We construct a Bloom filter as follows. We start with an array of bits. Initially, all of the bits are set to 0. When a new value is added to the filter, we map it to several “random” locations in the array of bit. All of the bits at the matching locations are set to 1. The random mapping is done using “hash functions”. A hash function is a function that returns “random-looking” values: it is nevertheless a genuine function. That is, a given hash function always return the same output given the same input.

When querying a Bloom filter, we hash the received value to several locations in the array of bits. If all of the corresponding bits are set to 1, then we have that value might be in the corresponding set. If only just one of these bit values is 0, then we know that the value was not previous added to the set.

A Bloom filter can be remarkably efficient. But can you do better? There are many other probabilistic filters, but let us say that you want to remain close to Bloom filters.

One way to improve on the Bloom filter is to avoid mapping your values all over the array of bits. Instead, you might first map the value to a small block of bits, and then check the bits within a limited range. It can make processing much faster because you do not need multiple data loads in an array. This approach is known under the name of “blocked Bloom filter”. It is much less popular than conventional Bloom filters. One reason why it might be less popular is that blocked Bloom filters require more storage.

I suspect that another reason people might prefer Bloom filters to alternatives such as blocked Bloom filters has to do with implementation. Bloom filters are mature and it is easy to pick up a library. E.g., there is a mature Bloom filter library in Go.

What is the simplest blocked Bloom filter you could do? Instead of doing something fanciful, let us say that the “block” is just a single machine word. On current hardware, a machine word span 64 bits. You have wider registers for SIMD operations, but let us keep things simple. I am going to take my values and map them to a 64-bit word, and I will set a number of bits to 1 within this word, and only within this word. Such a word might be called a fingerprint. Thus, at query time, I will only need to check a single word and compare it with the fingerprint (a bitwise AND followed by a compare). It should be really fast and easy to implement.

Probabilistic filters, as their name implies, have a small probability of getting it wrong: they can claim that a value is in the set while it is not. We call this error margin the “false-positive rate”. You might think that you want this error to be as small as possible, but there is a tradeoff. For a false-positive rate of 1/2* ^{p}*, you need at least

*p*bits per entries. Conventional Bloom filters have an overhead of 44% in the best case scenario, so you actually need 1.44

*p*bits per entries.

A reasonable choice might be to pick a false-positive rate of 1%. To achieve such a low rate, you need at least 10 bits per entry with a conventional Bloom filter.

It turns out that I can achieve roughly the same false-positive rate while using 2 extra bits per entry for a total of 12 bits per entry. Assume that you take your input values and hash them to a random-looking 64-bit outputs. From such 64-bit ouputs, you can select a location and generate a 64-bit word with 5 bits set. To do so, I can just select 5 bits locations in [0,64), using 6 of the input bits per location. I can create the word using a function such as this one…

std::function<uint64_t(uint64_t)> fingerprint5 = [](uint64_t x) { return (uint64_t(1) << (x&63)) | (uint64_t(1) << ((x>>6)&63)) | (uint64_t(1) << ((x>>12)&63)) | (uint64_t(1) << ((x>>18)&63)) | (uint64_t(1) << ((x>>24)&63)); };

Though it is in C++, the concept is portable to most languages. You might be concerned that such code could be inefficient. Yet the LLVM clang compiler has no trouble producing code that looks efficient (x64 assembly):

shr rcx, 6 mov eax, 1 shl rax, cl bts rax, rdx mov rcx, rdx shr rcx, 12 bts rax, rcx mov rcx, rdx shr rcx, 18 bts rax, rcx shr rdx, 24 bts rax, rdx

Other compilers could possibly be less efficient. Checking for a match is as simple as the following pseudocode:

uint64_t print = fingerprint(key); if( (buffer[location(key,buffer.size())] & print) == print ) { // match }

I did not write a benchmark yet, but my expectation is that such a word-based Bloom filter would provide excellent performance.

I have published prototypical code on GitHub. In the future, I will provide some benchmarks. There is also a natural extension of this idea where instead of checking a single word, you pick two or more words.

**Important**: I do not claim that this is a “new” idea. I merely feel that it is an under-appreciated one.

Daniel Lemire, "Word-aligned Bloom filters," in *Daniel Lemire's blog*, October 3, 2021.

Different approaches are studied in

“Fast Bloom Filters and Their Generalization” by Y Qiao, et al. (the one you e described here is referred to as “bloom-1” in the paper)

Fwiw I’ve implemented bloom-1 as well here

https://hackage.haskell.org/package/unagi-bloomfilter

Here is what I think is the full reference:

Qiao, Yan, Tao Li, and Shigang Chen. “Fast Bloom filters and their generalization.” IEEE Transactions on Parallel and Distributed Systems 25.1 (2013): 93-103.

I appreciate very much the references, both to the paper and to the software. Please note that, as was clear in the post, I did not claim that this was original. If you’d like to elaborate further on your experience, please do so.

To keep your false positive rate at 1%, aren’t you limited in the number of members of the population you can map to each 64-bit word? As in, one shouldn’t map more than 100 members into each 64-bit word? Something like that?

Would be nice to clarify the scalability per word here.

The size of the backing array must grow. In my case, I am allocating 12 bits per entry, so if you have 100 entries, you need 1200 bits or about 19 64-bit words. The more distinct entries you have, the more memory you need to allocate.

It works this way with conventional Bloom filters. You need an array that is large enough if you are going to accommodate your input size.

So the scalability is simple. It is linear. 12 bits per entry. You give me the number of entries, I give you the number of words you need. It is rather easy, something like 12 * number bits. It is no different from Bloom filters.

My intuition would be that such small partitions would lead to unacceptable variance. Do you have any analytical results on variance or tail bounds for the FPP?

There’s an estimate here: https://www.cs.amherst.edu/~ccmcgeoch/cs34/papers/cacheefficientbloomfilters-jea.pdf

Basically they take the lots of little Bloom filters and factor in the variance in populations between the buckets.

Thanks Kendall.

Thanks for the reference. Indeed, section 3 of this paper anticipates all the ideas in this blog post, I think (in particular, 3.1, which describes picking a random k-subset from the B bits of a block, using a lookup table with a single hash function as input). But they don’t seem to have numbers for block sizes other than 512 (64-byte cache line) in the paper. It should be straightforward to compute bounds on FPP for B=64 from equations 3) and 4) in the paper, though.

One thought I had here was to not word-align them and use overlapping buckets, either at byte or even bit granularity. Loads and cache lines would be unaligned of course, but still few in number.

It would create more buckets but possibly even out the variations between them.

It is worth investigating. My concern is that it makes implementation more difficult. E.g., it might be ugly in Java. So you’d want important benefits.

Pretty much the original Bloom filter strategy, no? (i.e. overlapping the bitmaps for all k hash functions). What’s now called the “partitioned” approach (distinct from the “blocked” approach because it separates the bitmaps for each hash function) is more obvious than Bloom’s approach, I think, but performs a bit worse in theory and a bit better in practice (if you believe “A Case for Partitioned Bloom Filters”).

It’s a restriction on the distance between hashes for the same key — it’s a little more work to analyze (basic Bloom filters are easier due to the independence of each bit position).

I tried this with split block Bloom filters and byte-level granularity with blocks of size 256 and lanes of size 32 (that is, setting 1 bit in each 32-bit lane inside an 8-lane block). The difference in false positive probability was negligible – less than switching to 512-bit blocks and 64-bit lanes.

Interesting — thanks for doing that. I suspect it’s equivalent to the original blocked approach.

More than word-aligned buckets, wouldn’t it be a good idea to try with cacheline-aligned buckets? Because the buckets are cacheline-aligned, all of the bit tests would hit the same cacheline. And, as long as the bit tests have no dependency with each other, the processor should still be able to execute them in parallel, hiding the latency of loading multiple words from the cache.

The benefit being that, as each bucket would be 8x or 16x times the size of word, false positives ratio would be lower for a given bloom filter size.

Checked later on https://www.cs.amherst.edu/~ccmcgeoch/cs34/papers/cacheefficientbloomfilters-jea.pdf and it’s precisely what I was referring to.

Yes a cache-line approach works and can be implemented with SIMD instructions. In the GitHub org I link to, there are such implementations.

Note that it is not a good model to assume that processing a whole cache line is the same as processing a single word. It is more complicated.

I’d definitely recommend over the classic bloom filter design for most applications where performance matters.

We had a good experience using blocked bloom filters (256-bit blocks) for query processing in Apache Impala – any speedup in bloom filter query time was a big win. They also got ported to Apache Kudu – there’s a battle tested implementation with AVX2 support in that codebase.

The newer bloom filter support in Apache Parquet also uses a similar design – https://github.com/apache/parquet-format/blob/master/BloomFilter.md – which is a big win given the amount of data stored in that format (albeit most without bloom filter indices at this point).

That last link is very interesting: it’s actually a hybrid of the “blocking” and “partitioning” approaches (blocks are 32 bytes, partitions are 32 bits). It would be interesting to see analytical or simulation results exploring the parameter space of this hybrid approach.

Hi Tobin! I did some benchmarking, including false positive probability, here: https://arxiv.org/abs/2101.01719

https://github.com/kunzjacq/pybloom

is a python project able to compute the false positive probability of a structure with blocked bloom filters as described in the paper cited above (https://www.cs.amherst.edu/~ccmcgeoch/cs34/papers/cacheefficientbloomfilters-jea.pdf ). It can also optimize a design based on constraints on the false positive probability or on the storage used.

(full disclosure: I a am the author of said project)

FYI the Java blocked Bloom filter I have written uses a very similar method: two words, and in each word two bits (that might be the same): https://github.com/FastFilter/fastfilter_java/blob/master/fastfilter/src/main/java/org/fastfilter/bloom/BlockedBloom.java#L60

(I’m not claiming this is new, but I didn’t copy it.)

Hm there are some copy & paste problems in the code I posted above… Probably the editor saw some XML tags.

I fixed your comment.

I think there is some paper out there about just storing a hash signature in a hash table, no full key and no value.

If the hash signature was seen before of course you get a positive. It is also possible to get a false positive (collision.)

Maybe it needs too many bits.

Anyway an insert only Robin Hood hash table sounds ideal for that, and easy to implement.

Delete operations for the Robin Hood hash table require maintaining a histogram of displacements and the unwind operation is a little tricky.

Yes, exactly. Cuckoo filters store just a signature, as do quotient filters. The former use cuckoo hashing, while the latter use Robin Hood linear probing.

I think quotient filters use the Cleary algorithm (bidirectional linear probing combined with Knuth’s “quotienting” trick), but they’re still an “ordered hash table” like Robin Hood.

Yeah there are many fingerprint table schemes around, which are theoretically superior to Bloom filters (by a factor of 1/ln 2). However, you can also store just hash codes without losing any information, if you’re hashing integers! Just apply a permutation (aka block cipher) instead of a hash function. Instead of block ciphers, I prefer to use a high-quality mixing function like the Murmur3 finalizer. You can see this approach here: https://github.com/senderista/hashtable-benchmarks.

Thanks for sharing this insightful article.