View on GitHub

samrat.github.io

Understanding Bloom Filter

Landed into this amazing blog post where the author explains about crawling a quarter billion webpage in 40 hours

In this post author used Bloom Filter to capture if the web page is already crawled.

What is a bloom filter?

How does bloom filter works


Working

Given a word, it goes through the sequence of hash functions, set the bloom filter bit based on the output of the hash

Example:

We set 2,4 and 7th bits in the bloom filter

Querying


Working1

In case we need to check if the word exists we can do the same set of functions again, if we get all the bits as 1 then the word may exist else it does not exist

There are chances of false positives, bloom filter can return a word that exists even though the word might not be present. But it never returns true negatives where it returns word that does not exist but it exist. This is because bloom filter properties like fixed size and word can never be deleted once recorded in the bloom filter.

Generally, there will be a second layer to check if the word exists if the bloom filter says exists.