probabilistic database lookup

July 3, 2024

Assuming a database DB is slow to query for x. It reduces pressure on the DB if you can avoid a query. But how to know the DB will find nothing for x without a query to the database?

Add a preprocessing step, to handle at least the majority of queries which will not get a match in the database.

the oracle

Aside from syntax and domain specific checks (don’t forget them!), a probabilistic approach can help. The task is to create an oracle, which will either answer

x is not in the database
x might be in the database

In the later case you still need to query, but with preprocessing and a fast oracle, you can safe quite some time in average. The disadvantage is, the oracle runtime is now added to the runtime for each query which will actually find something. (Exception: If you can interrupt a started database query in a time effective fashion. Then you can start oracle and query at the same time and stop the query if the oracle confirms a no-match.)

Note: This is equivalent to probabilistic statements about sets. You can combine oracles to model (probabilistic!) set operations.

implementation

In order to have a short runtime for the oracle, the computation should happen without touching to much memory and causing memory cache misses. (Memory controllers tend to be quite slow compared to the logic unit.) Thus working on a fixed (and configurable) amount of memory helps. The oracle ‘query’ runtime should not depend on the amount of what is actually stored in the database. With hash functions you can now create a Swiss cheese model for our oracle.

There are well known approaches for such oracles

bloom filters using multiple (uniformly distributed, independent) hash functions which all map to an unit vector in a binary vector space
- easy to join multiple input sets (binary OR) if based on the same set of hash functions and the same vector space dimension - this allows parallelization for the creation
- inserts increase the false positive rate
cuckoo filter work with a list of hash tables and shifting out the matches
- allow direct deletion of x without need to rebuild the internal oracle state from scratch
- the swaps between the hash maps decrease insertion speed when increasing the number of values in the database.
Xor filters are more memory efficient and don’t require a hash function generator
- more memory efficient and work with a fixed number of hash functions
  - 3 random hash functions and the fingerprinting function (which is an hash function) are enough
- no updates of the oracle at runtime - building of the oracle is a probabilistic process itself (Peeling Algorithm)
binary fuse filters work as an oracle just alike a Xor filter
- restricting the random hash functions to a window size allows more compact internal state to be created by the Peeling Algorithm in shorter construction time

This is a post in the [blog] series.
Other posts in this series:

March 11, 2025 - blog
January 30, 2025 - git
January 13, 2025 - security
November 14, 2024 - k8s