leoantony72/bloom filter.md

## bloom filter.md

      
    Raw
  

              bloom filter.md
            
          
    Suppose you are creating an account on Twitch or any other website, you will immediately think of a cool username like noobmaster69🙃 and when you submit the form you immediately get User Name Already Taken😡 you will try out other cool names or even your own name but you get the same Err Message, Quite Frustrating right😂.
But have ever thought about how this works behind the scenes?🤔
Social media platforms like twitch/youtube have millions of users but still, these companies quickly check the availability of usernames by searching millions of usernames. hearing searching makes you think about linear & binary search Are they using these that we learned in high school? Nop Something far better
Suppose I have an SQL database with 10 million users, with a Unique key on the usernames column. To check if the username is taken or not I will Write the following Query
SELECT username FROM users WHERE username = $1;

It will go through 10million rows and check if the username exists or not, it's Slow right😅 . That's where Bloom Filter comes in.
Bloom Filter

A Bloom filter is a space-efficient probabilistic data structure that is used to test whether an element is present on a set or not. It's space-efficient and fast but it has a drawback it is probabilistic in nature and there might be false-positive(depends on the number of elements, hash function, and size ) results that are ok for storing usernames.

False Positive increase as we add elements
You can't delete elements from the bloom filter
No false-negative results

How does Bloom Filter work ?🤔


Bloom Filter is a bit array of m bits, all set to zero
How do we add elements to Bloom Filter?🤔

We pass the element to k no. of the hash function which returns a value and we divide it by the modules of the size of the bit array.
Eg: Suppose we need to add the username noobmaster69
h1("noobmaster69") % 10 = 2
h2("noobmaster69") % 10 = 9
h3("noobmaster69") % 10 = 7

h1 , h2 & h3 - Are 3 different hash function, value 10 is the size of the bit array
For now, we pass the username into 3 different hash functions which return 3 different values. Now we set bits at indices 2,9 & 7 to 1

How do we check if the element exists in the bloom filter?🤔

We hash the value with 3 different hash function
h1("noobmaster69") % 10 = 2
h2("noobmaster69") % 10 = 9
h3("noobmaster69") % 10 = 7

We get indices at 2,9,7 and we check if all the indices are set to 1 in the bit array, If not element is certainly not added to the bloom filter. If all the bits are set then we can say "noobmaster69" is probably present in the bloom filter.
Now we add another username "Doctor Strange"🧙‍♂️ to our list
h1("Doctor Strange") % 10 = 1
h2("Doctor Strange") % 10 = 6
h3("Doctor Strange") % 10 = 10

Set bits at indices 1,5 & 10 to 1

Now we need to check if the username "wizard" is added or not in the bit array
h1("wizard") % 10 = 7
h2("wizard") % 10 = 10
h3("wizard") % 10 = 2


Now if we check the indices at 7,10,2 it's all set to 1🤷‍♂️, why is that happening we never added "wizard" to the bit array this is the drawback of bloom filter we get false-positive results. so how do we prevent false-positive results we can't but we can control the probability of getting a false positive by controlling the size of the bit array and the number of the hash function.

Increase In the size of the bit array results in fewer false positives.
Increase in the number of the hash functions results in a decrease in the probability of the false-positive results.

Check this website - Online Bloom Filter Calculator.
Is Bloom Filter Space Efficient?🤔

Data structures like hashmaps, arrays, or linked lists store the element itself which is not efficient. in hashmap, We have to store the string itself with key-value pair, what if we have 10million usernames it will take quite some space.
In the bloom filter, we don't store the data itself. we pass the string to a hash function and flip 0 to 1 in the bit array this would reduce the space of the bloom filter. we don't use a bloom filter to test if an element is present, you use it to test whether it's certainly not present, since it guarantees no false negatives.
where is the Bloom filter used?🤔

Username is taken or not
Weak password detection (github.com/leoantony72/weak-password-detector)
Blocking Ip
Chrome uses a bloom filter to identify malicious URLs

Redis Bloom Filter

Redis is an open-source in-memory database used as a cache mainly but recently Redis is being used as a primary database because it's blazingly fast✈.
Redis has a bloom filter that does most of the work we only have to insert data into Redis. So let's check it out
We will be using Docker to set up Redis🚀
docker run -p 6379:6379 --name redis-redisbloom redislabs/rebloom:latest

Our Redis instance will be running in port 6379, lets access it by Redis-CLI
docker exec -it redis-redisbloom bash -c redis-cli

We will use BF.RESERVE to make a bloom filter, it will take three arguments
BF.RESERVE usernames 0.00001 10000


Key(usernames) - Name of the bloom filter
err rate(0.00001) - The desired probability for false positives.
capacity(10000) - The number of entries intended to be added to the filter.

Add elements to the bloom filter🐱‍💻

BF.ADD keyword for adding elements to the bloom filter which takes two arguments
BF.ADD usernames noobmaster69


key - usernames
element - "noobmaster69"

This will add the username "noobmaster69" to the bloom filter
Find out whether an item exists in the bloom filter🧙‍♂️

BF.EXISTS to check if the element exists or not, this takes two argument
BF.EXISTS usernames noobmaster69


key - usernames
element - noobmaster69

If the element exists it will return 1 and if it doesn't it will return 0
Get info about bloom filter

BF. INFO to get info on bloom filter
BF.INFO usernames

It will return these informations

Weak password detection🕵️‍♀️

I made a weak password detection system with Redis bloom filter, So how does this work ?🤔
I took 1000 most commonly used passwords from weak password github
Passed these passwords to the Redis Bloom filter and when users submit a form with a password like "password123" we will alert the user to use a strong password thus preventing weak passwords.
Check out my GitHub repo for the weak passwords detection: github.com/leoantony72/weak-password-detector
To check if the username is taken or not: github.com/leoantony72/redis-bloom-filter
That concludes this, have a great day😁