An interview problem that I've gotten fairly often is, "Given a stream of elements, how do you get the median, or average, or sum of the elements in the stream?"
I've thought about this problem a lot and my naive implementation was to put the elements in a hashmap (dictionary) and then pass over the hashmap with whatever other function you need.
For example,
import typing
import sys
from collections import defaultdict
stream:list[int] = [1,2,3,4,5,6,7,8,9,10]
medium_stream: list[int] = [i for i in range(1000)]
large_stream: list[int] = [i for i in range(1000000)]
aggregate = defaultdict(int)
for element in medium_stream:
aggregate[element] += 1
# Small stream
print(aggregate)
print("Dict size:", sys.getsizeof(aggregate))
defaultdict(<class 'int'>, {1: 1, 2: 1, 3: 1, 4: 1, 5: 1, 6: 1, 7: 1, 8: 1, 9: 1, 10: 1})
Dict size: 360
# Medium stream
print("Dict size:", sys.getsizeof(aggregate))
Dict size: 36960
# Large stream
print("Dict size:", sys.getsizeof(aggregate))
Dict size 41943136 (41MB)
However, insert time into a dictionary is linear O(n)
, and the space complexity in memory it takes up also scales linearly with the amount of elements you're adding to the dictionary.
What do we do when we have a stream with more than a million elements, like we usually do in cases where we have a lot of log data?
I came across this great post recently on ways to implement it above the naive implementation using the algorithm from the paper "distinct elements in streams" and comparing/contrasting to HyperLogLog.
The first example of using probabilites to reduce the problem space is clear and walked through in the blog post.
But HyperLogLog was not clear to me,so I wanted to walk through it.
In a stream of large elements, we can't count every single number. But what if there was a way to look at a simpler representation of that number and find out the probability that part of that number's representation was already in the dataset?
Our assumption for this: We are expecting the search space to be uniformly distributed and random.
Another way to represent numbers is in binary.
We can look at a number's binary representation. For example:
>>> bin(1)
'0b1' #1
``
bin(100)
'0b1100100' #1100100
And so on. If we can simplify each number down to two individual comonent parts, we can check more easily for patterns of those parts.
So, given a random stream of integers, 50%
of the number representations will start with 1. 25%
will start with 01, 12.5%
will start with 001, and so on. The probability of a number starting with a more complex pattern gets smaller and smaller. So if you see a fairly complex pattern, like 001
, there's a higher chance that the cardinality, aka the number of elements in the stream, is 8, because 12.5% represents 1/8 of the set.
If we want x zeros
, we will need to go 2^x
. Look at the maximum number of zeroes you came across? If it's 2^3, it's 8, and you probably came across 8 requests. the more requests you get, the lower the variance is. The space requirement is negligible because we don't need to store 0(n) anymore, just the number of zeros you come across. The at-most number is 64 zeroes, or 64 bits.
Improvements: "whenever there is some problem in computer science, we hash"
- Reduce the variance. When we have numbers, we hash and put them in buckets based on the number of bits from the left side and will tell you the bucket address, and we pick the max of zeros based on the hashed bucket, and take the average of the max number of zeros across requests.
- You can also use k hash functions instead, or take the first two bits and route it to a hash bucket