probabilistic database lookup

Series: [blog]

Assuming a database DB is slow to query for x. It reduces pressure on the DB if you can avoid a query. But how to know the DB will find nothing for x without a query to the database?

Add a preprocessing step, to handle at least the majority of queries which will not get a match in the database.

the oracle

Aside from syntax and domain specific checks (don’t forget them!), a probabilistic approach can help. The task is to create an oracle, which will either answer

In the later case you still need to query, but with preprocessing and a fast oracle, you can safe quite some time in average. The disadvantage is, the oracle runtime is now added to the runtime for each query which will actually find something. (Exception: If you can interrupt a started database query in a time effective fashion. Then you can start oracle and query at the same time and stop the query if the oracle confirms a no-match.)

Note: This is equivalent to probabilistic statements about sets. You can combine oracles to model (probabilistic!) set operations.

implementation

In order to have a short runtime for the oracle, the computation should happen without touching to much memory and causing memory cache misses. (Memory controllers tend to be quite slow compared to the logic unit.) Thus working on a fixed (and configurable) amount of memory helps. The oracle ‘query’ runtime should not depend on the amount of what is actually stored in the database. With hash functions you can now create a Swiss cheese model for our oracle.

There are well known approaches for such oracles