📝 @Arushi Somani May 24, 2024 9:31 PM (PDT)

<aside> đź’ˇ Repo: https://github.com/somaniarushi/bloom

</aside>

Concept

I have seen bloom filters used in a bunch of cases recently — most relevantly, in a large-scale distributed web crawler, a Bloom filter is used as a model of probabilistically checking that a value is in a set.

While it’s completely fine to black box them, it’s important to me to understand a bunch of parts of the stack that I use at a slightly deeper level!

The core principle behind a bloom filter is using hashes instead of exact matches to some data. The cost of this is some false positives, because of hash collisions. However, bloom filters never create a false negative result.

As the number of elements in the bloom filter get larger, the false positive rate increases. As the size of the bloom filter increases, false positives decrease— at the cost of higher latency.

Implementation

Here’s the idea in a simple way. Imagine you have a bit array of length $N$. You have $k$ hash functions, for each of which you calculate the hash module $N$ to find the bucket for each. If all of the buckets reached are set to 1, we say that the key is probably in the set.

As stated before, it could be that all the buckets are set to one, but the key itself was never added — creating a false positive.

It’s worth noting that Bloom Filter’s are linearly increasing structures — you cannot remove a key once you add it.

Mathematical Limits

The probability of a false positive is

$$ P(\text{false pos}) = ( 1 - (1 - \frac{1}{m})^{kn})^k $$

where $k$ = number of hash functions, $n$ = E(num elements inserted in filter) and $m$ = size of bit array.

References

[1] https://www.geeksforgeeks.org/bloom-filters-introduction-and-python-implementation/

[2] https://richardstartin.github.io/posts/building-a-bloom-filter-from-scratch