First of all lets understand what is a Bloom Filter and how does it work and what are its usages.
Bloom Filter :
Bloom filter is a data structure designed to tell you, rapidly and memory-efficiently, whether an element is present in a set.
The price paid for this efficiency is that a Bloom filter is a probabilistic data structure: it tells us that the element either definitely is not in the set or may be in the set
For example, checking availability of username. In this case its a set membership problem, where the set is the list of all registered username. The price we pay for efficiency is that it is probabilistic in nature. That means, there might be some False Positive results. False positive means, it might tell that given username is already taken but actually it’s not.
here in this scenario we are searching for the user name Mary
in the bloom filter. So from the picture we can see a bloom filter is a combination of a hash function and a bit array or list. Where after hashing the user name will give us a number in which we will set the bit to true.
So if we use different hash functions and then update the bit array with 1 where the index of the array would be number provided by the hash function. Then if we search the name again with those hash functions and check the index’s again and if we find 1 there then we can plausibly say that, that name exist in the bloom filter.
So, in a nutshell:
- If we search for a value and see any of the hashed indexes for this value is ‘0’ then, the value is definitely not on the list.
- If all of the hashed indexes is ‘1’ then ‘maybe’ the searched value is on the list.
Properties :
- unlike a standard hash table, a Bloom filter of a fixed size can represent a set with large number of elements.
- False positive rate increases steadily as elements are added until all bits in the filter are set to 1, at which point all queries yield a positive result.
So let us discuss what is false positive in the case of getting information from a bloom filter.
The question is why we said “Probabilistic”, why this uncertainty? Let’s understand this with an example. Suppose we want to check whether “Mohibul” is present or not. We’ll calculate hashes using h1, h2 and h3 functions
|
|
If we check the bit array, bits at these indices are set to 1.
Now lets check for another name “Mary”
|
|
but we know that “Mary” was never added to the filter. Bit at index 2
and 5
was set when we added “Mohibul” . And if another name made 1 bit true then this bloom filter will say that “Mary” exists in the filter but in real we didnt add Mary.
This behavior causes false positives.
- Bloom filters never generate false negative result, i.e., telling you that a username doesn’t exist when it actually exists.
- Deleting elements from filter is not possible. Because, if we delete a single element by clearing bits at indices generated by k hash functions, it might cause deletion of few other elements. Example – if we delete “Mohibul” (in given example) by clearing bit at 4, 5 and 2, we might end up deleting “Mary” also if its present Because bit at index 2 and 5 becomes 0 and bloom filter claims that “ Mary ” is not present.
Let us make a bloom filter ourselves.
Before doing that let us ask some question that will help us better design and understand
- What are these hash functions?
- How big should I make the bloom filters?
- How many hash functions should i use?
- Benefits of bloom filter?
- Where can I use them?
What is Hash functions
The hash functions used in a Bloom filter should be independent and uniformly distributed. They should also be as fast as possible (cryptographic hashes such as sha1
, though widely used thus are not very good choices).
Examples of fast, simple hashes that are independent enough includes murmur, xxHash, Fowler–Noll–Vo hash function and many others
How big should I make the bloom filters
It’s a nice property of Bloom filters that you can modify the false positive rate of your filter. A larger filter will have less false positives, and a smaller one more.
Your false positive rate (P) will be approximately
(1-e^((-kn)/m)))^k
so you can just plug the number n of elements you expect to insert, and try various values of k and m to configure your filter for your application.
Here
m - is the size of the bit array
n - is the number of elements
k - is the number hash funtions
This leads to an obvious question
How many hash functions should I use?
The more hash functions you have, the slower your bloom filter, and the quicker it fills up. If you have too few, however, you may suffer too many false positives.
Since you have to pick k when you create the filter, you’ll have to ballpark what range you expect n to be in. Once you have that, you still have to choose a potential m (the number of bits) and k (the number of hash functions).
So now we can design a process to build a bloom filter
- Choose a approx value for n
- Choose a value for m
- Calculate the optimal value of k
- Calculate the error rate for our chosen values of n, m, and k. If it’s unacceptable, return to step 2 and change m; otherwise we’re done.
Now for the code we will use Java’s BitSet API
|
|
here Function<T,Integer> hashFunctions
supplied by the client. And BitSet exactly works like a bit array.
Now we have tried our custom BloomFilter Google guava has a implementation of BloomFilter which can be created like
|
|
Benifits of Bloom Filter
Given a Bloom filter with m bits and k hashing functions, both insertion and membership testing are O(k). That is, each time you want to add an element to the set or check set membership, you just need to run the element through the k hash functions and add it to the set or check those bits.
The space advantages are more difficult to sum up; again it depends on the error rate you’re willing to tolerate. It also depends on the potential range of the elements to be inserted; if it is very limited, a deterministic bit vector can do better. If you can’t even ballpark estimate the number of elements to be inserted, you may be better off with a hash table or a scalable Bloom filter.
Where Can I Use It
Bloom filter is all about testing Membership in a set. The classic example of using bloom filters is to reduce expensive disk (or network) lookups for non-existent keys. As we can see that bloom filters can search for a key in O(k) constant time, where k is the number of hash functions, it will be very fast to test non-existence of a key.
If the element is not in the bloom filter, then we know for sure we don’t need to perform the expensive lookup. On the other hand, if it is in the bloom filter, we perform the lookup, and we can expect it to fail some proportion of the time (the false positive rate).
For some more concrete examples:
- You’ve seen in our given example that we could’ve use it to warn the user for weak passwords.
- You can use bloom filter to prevent your users from accessing malicious sites.
- Instead of making a query to an SQL database to check if a user with a certain email exists, you could first use a bloom filter for an inexpensive lookup check. If the email doesn’t exist, great! If it does exist, you might have to make an extra query to the database. You can do the same to search for if a ‘Username is already taken’.
- You can keep a bloom filter based on the IP address of the visitors to your website to check if a user to your website is a ‘returning user’ or a ‘new user’. Some false positive value for ‘returning user’ won’t hurt you, right?
- You can also make a Spell-checker by using bloom filter to track the dictionary words.
- Want to know how Medium used bloom filter to decide if a user already read post? Read this mind-blowing, freaking awesome article about it.
in this wikipidia link there are multiple other usage of bloom filter.