Fast and concise probabilistic filters in Python

Sometimes you need to filter out or filter in data quickly. Suppose that your employer maintains a list of forbidden passwords or URLs or words. You may store them in a relational database and query them as needed. Unfortunately, this process can be slow and inefficient.

A better approach might be to use a probabilistic filter. A probabilistic filter is a sort of ‘approximate set’. You can ask it whether a key is present in the set, and if it is present, then you will always get ‘true’ (the correct answer). However, when the key is not present, you may still get ‘true’, although with a low probability. So the probabilistic filter is sometimes wrong. Why would you accept a data structure that is sometimes wrong? Because it can be several times smaller and faster than querying directly the actual set.

The best known probabilistic filter is the Bloom filter, but there are many others. For example, we recently presented the binary fuse filters which are smaller and faster than Bloom filters, for large and immutable sets.

The pyxorfilter module in Python is an implementation of the binary fuse filters. It provides support for several filter types but both Fuse8 and Fuse16 are interesting if you have fairly large sets. They provide respectively a 0.39% and 0.0015% false probability rate. So a Fuse16 filter is almost nearly correct. Why would you prefer Fuse8? Because it uses half the memory.

We can construct a probabilistic filter in Python like so with the pyxorfilter filter:

from pyxorfilter import Fuse8
data = [uuid.uuid4() for i in range(2000000)]

filter = Fuse8(len(data))
filter.populate(data)

Once it is done, you can be certain that all the content of your initial set is in the filter:

for d in data:
  assert filter.contains(d)

You can save the filter on disk and check how much memory is used…

f = open(filename, 'wb')
f.write(filter.serialize())
f.close()
print(os.path.getsize(filename)*8.0/len(data))

If your set is large enough (say 1000,000 elements), you will find that the memory usage is about 9 bits per entry. It grows a bit larger per entry as the set gets smaller. For smaller sets, the pyxorfilter module offers an alternative (Xor8) that can be slightly more efficient in these cases.

How do you know if you can trust the filter? Just query random inputs (highly likely not to be present) and see how many falsely appear to be in the set:

# estimate false positive rate
N = 1000000
count = 0
for i in range(N):
count += filter.contains(uuid.uuid4())
fpp = count/N*100.0

As I already implied, if you replace Fuse8 by Fuse16, then the memory usage per element goes up to about 18 bits, but the false positive rate is far lower: 0.00200%.

I produced a small benchmark. On my laptop, I get that you get over 1 million queries per second (each time checking the presence of a string). On an Intel-based server, I get a lower number, so about half a million per second.

For binary fuse filters, it does not matter whether the element is in the set or not as far as performance goes, so I use random inputs. When using a Bloom filter (say), you would typically get worse performance when the elements are in the set.

The pyxorfilter  module was created Amey Narkhede. It still early days. I expect you can install pyxorfilter the usual way (pip install pyxorfilter) under x64 Linux. Unfortunately, Windows and other platfomrs, there are issues getting the module installed (see issue 10) since binary modules are not available for all platforms. The pip installer should try to install it from the source code. If you are a Python hacker, you can build it from source relatively easily:

git clone --recurse-submodules https://github.com/glitzflitz/pyxorfilter
cd pyxorfilter
python setup.py build_ext
python setup.py install

I am sure Amey would appreciate it if experienced Python hackers could help resolve the few remaining issues.

There are functionalities that pyxorfilter misses. Currently, you can save the filter to disk and recover it later. Sadly, to use it, you need to load the whole filter in memory. That is not needed. It might be more suitable to use memory file mapping or even other lower-level input-output operations.

What if you do not care about Python? You can use the xor and binary fuse filters in Go, C or C++, Zig, Rust, etc. I just love working in Python when I can.

Daniel Lemire, "Fast and concise probabilistic filters in Python," in Daniel Lemire's blog, March 31, 2024.

Published by

Daniel Lemire

A computer science professor at the University of Quebec (TELUQ).

Leave a Reply

Your email address will not be published.

You may subscribe to this blog by email.