Suppose you are creating an account on Twitch or any other website, you will immediately think of a cool username like noobmaster69🙃 and when you submit the form you immediately get User Name Already Taken😡 you will try out other cool names or even your own name but you get the same Err Message, Quite Frustrating right😂.
But have ever thought about how this works behind the scenes?🤔
Social media platforms like twitch/youtube have millions of users but still, these companies quickly check the availability of usernames by searching millions of usernames. hearing searching makes you think about linear & binary search Are they using these that we learned in high school? Nop Something far better
Suppose I have an SQL database with 10 million users, with a Unique key on the usernames column. To check if the username is taken or not I will Write the following Query
SELECT username FROM users WHERE username = $1;
It will go through 10million rows and check if the username exists or not, it's Slow right😅 . That's where Bloom Filter comes in.
A Bloom filter is a space-efficient probabilistic data structure that is used to test whether an element is present on a set or not. It's space-efficient and fast but it has a drawback it is probabilistic in nature and there might be false-positive(depends on the number of elements, hash function, and size ) results that are ok for storing usernames.
- False Positive increase as we add elements
- You can't delete elements from the bloom filter
- No false-negative results
Bloom Filter is a bit array of m bits, all set to zero
We pass the element to k no. of the hash function which returns a value and we divide it by the modules of the size of the bit array.
Eg: Suppose we need to add the username noobmaster69
h1("noobmaster69") % 10 = 2
h2("noobmaster69") % 10 = 9
h3("noobmaster69") % 10 = 7
h1 , h2 & h3 - Are 3 different hash function, value 10 is the size of the bit array
For now, we pass the username into 3 different hash functions which return 3 different values. Now we set bits at indices 2,9 & 7 to 1
We hash the value with 3 different hash function
h1("noobmaster69") % 10 = 2
h2("noobmaster69") % 10 = 9
h3("noobmaster69") % 10 = 7
We get indices at 2,9,7 and we check if all the indices are set to 1 in the bit array, If not element is certainly not added to the bloom filter. If all the bits are set then we can say "noobmaster69" is probably present in the bloom filter.
Now we add another username "Doctor Strange"🧙♂️ to our list
h1("Doctor Strange") % 10 = 1
h2("Doctor Strange") % 10 = 6
h3("Doctor Strange") % 10 = 10
Set bits at indices 1,5 & 10 to 1
Now we need to check if the username "wizard" is added or not in the bit array
h1("wizard") % 10 = 7
h2("wizard") % 10 = 10
h3("wizard") % 10 = 2
Now if we check the indices at 7,10,2 it's all set to 1🤷♂️, why is that happening we never added "wizard" to the bit array this is the drawback of bloom filter we get false-positive results. so how do we prevent false-positive results we can't but we can control the probability of getting a false positive by controlling the size of the bit array and the number of the hash function.
- Increase In the size of the bit array results in fewer false positives.
- Increase in the number of the hash functions results in a decrease in the probability of the false-positive results.
Check this website - Online Bloom Filter Calculator.
Data structures like hashmaps, arrays, or linked lists store the element itself which is not efficient. in hashmap, We have to store the string itself with key-value pair, what if we have 10million usernames it will take quite some space.
In the bloom filter, we don't store the data itself. we pass the string to a hash function and flip 0 to 1 in the bit array this would reduce the space of the bloom filter. we don't use a bloom filter to test if an element is present, you use it to test whether it's certainly not present, since it guarantees no false negatives. where is the Bloom filter used?🤔
- Username is taken or not
- Weak password detection (github.com/leoantony72/weak-password-detector)
- Blocking Ip
- Chrome uses a bloom filter to identify malicious URLs
Redis is an open-source in-memory database used as a cache mainly but recently Redis is being used as a primary database because it's blazingly fast✈.
Redis has a bloom filter that does most of the work we only have to insert data into Redis. So let's check it out
We will be using Docker to set up Redis🚀
docker run -p 6379:6379 --name redis-redisbloom redislabs/rebloom:latest
Our Redis instance will be running in port 6379, lets access it by Redis-CLI
docker exec -it redis-redisbloom bash -c redis-cli
We will use BF.RESERVE to make a bloom filter, it will take three arguments
BF.RESERVE usernames 0.00001 10000
- Key(usernames) - Name of the bloom filter
- err rate(0.00001) - The desired probability for false positives.
- capacity(10000) - The number of entries intended to be added to the filter.
BF.ADD keyword for adding elements to the bloom filter which takes two arguments
BF.ADD usernames noobmaster69
- key - usernames
- element - "noobmaster69"
This will add the username "noobmaster69" to the bloom filter
BF.EXISTS to check if the element exists or not, this takes two argument
BF.EXISTS usernames noobmaster69
- key - usernames
- element - noobmaster69
If the element exists it will return 1 and if it doesn't it will return 0
BF. INFO to get info on bloom filter
BF.INFO usernames
It will return these informations
I made a weak password detection system with Redis bloom filter, So how does this work ?🤔
I took 1000 most commonly used passwords from weak password github
Passed these passwords to the Redis Bloom filter and when users submit a form with a password like "password123" we will alert the user to use a strong password thus preventing weak passwords.
Check out my GitHub repo for the weak passwords detection: github.com/leoantony72/weak-password-detector
To check if the username is taken or not: github.com/leoantony72/redis-bloom-filter
That concludes this, have a great day😁