View on GitHub

Samrat

Understanding Bloom Filter

I recently came across an fascinating blog post where the author explains how they crawled a quarter billion webpages in just 40 hours. One of the key techniques they used was a Bloom filter to efficiently track which webpages had already been crawled.

In this post author used Bloom Filter to capture if the web page is already crawled.

What is a bloom filter?

How does bloom filter works


Working

When inserting an element (e.g., a word or URL) into a Bloom filter:

The element is passed through multiple hash functions. Each hash function outputs a position in the filter’s bit array. The bits at these positions are set to 1.

Example:

We would set the 2nd, 4th, and 7th bits in the Bloom filter to 1.

Querying


Working1

To check if an element exists in the set:

Pass the element through the same hash functions. Check if all the corresponding bits in the filter are set to 1. If all bits are 1, the element may exist in the set. If any bit is 0, the element definitely does not exist.

Limitations

Time complexity

Operation Complexity
Insertion O(k)
Find O(k)

k = Number of Hash function