Skip to content

Instantly share code, notes, and snippets.

Embed
What would you like to do?
A lab for HyperLogLogs using Redis & Python, created with Ipython notebook

Redis is an in-memory network key/value store and nltk is the python natural language toolkit. We'll be using these libraries later for the lab.

    import redis
    import nltk

Setup our redis connection. It's on the VM, so local is fine.

    pool = redis.ConnectionPool(host="127.0.0.1", port=6379, db=0)
    r = redis.Redis(connection_pool=pool)

We ought to test that this is working properly. Let's set the 'breakfast' key to 'spam' and see if Redis will tell us what's for breakfast.

    r.set('breakfast', 'spam')
    r.get('breakfast')




    'spam'

Great! So now let's get weird. We brought the NLTK in earlier, and it comes with digitized book texts. Let's see what we have at our disposal.

    from nltk.book import *
    nltk.corpus.gutenberg.fileids()

    *** Introductory Examples for the NLTK Book ***
    Loading text1, ..., text9 and sent1, ..., sent9
    Type the name of the text or sentence to view it.
    Type: 'texts()' or 'sents()' to list the materials.
    text1: Moby Dick by Herman Melville 1851
    text2: Sense and Sensibility by Jane Austen 1811
    text3: The Book of Genesis
    text4: Inaugural Address Corpus
    text5: Chat Corpus
    text6: Monty Python and the Holy Grail
    text7: Wall Street Journal
    text8: Personals Corpus
    text9: The Man Who Was Thursday by G . K . Chesterton 1908





    [u'austen-emma.txt',
     u'austen-persuasion.txt',
     u'austen-sense.txt',
     u'bible-kjv.txt',
     u'blake-poems.txt',
     u'bryant-stories.txt',
     u'burgess-busterbrown.txt',
     u'carroll-alice.txt',
     u'chesterton-ball.txt',
     u'chesterton-brown.txt',
     u'chesterton-thursday.txt',
     u'edgeworth-parents.txt',
     u'melville-moby_dick.txt',
     u'milton-paradise.txt',
     u'shakespeare-caesar.txt',
     u'shakespeare-hamlet.txt',
     u'shakespeare-macbeth.txt',
     u'whitman-leaves.txt']

Three Shakespeare works... that sounds interesting. The NLTK is nice because it has some fast utilities that are commonly used. We can just get a quick ordered list of the words as they appear in each play and count them.

    caesar = nltk.corpus.gutenberg.words('shakespeare-caesar.txt')
    print "Words in Julias Caesar:", len(caesar)
    hamlet = nltk.corpus.gutenberg.words('shakespeare-hamlet.txt')
    print "Words in Hamlet:", len(hamlet)
    macbeth = nltk.corpus.gutenberg.words('shakespeare-macbeth.txt')
    print "Words in MacBeth:", len(macbeth)

    Words in Julias Caesar: 25833
    Words in Hamlet: 37360
    Words in MacBeth: 23140

Let's bring it back to HyperLogLogs now. Remember that Redis connection? Let's test the HyperLogLog type. If we pass it mostly "spam" and one "eggs" it should count 2 items.

    r.pfadd('test', 'spam', 'spam', 'spam', 'spam', 'eggs')
    r.pfcount('test')




    2

Back to Shakespeare. Let's see how fast HyperLogLogs are (on a VM, mind you) and how much space they might take up if we hit them with a moderate (but still very small) amount of data. We'll tell Redis to clear the keys first so you don't suspect me of cheating the times, but remember that HyperLogLogs are idempotent, so this is only necessary to ensure my setup time isn't skipped. HyperLogLogs in the context of natural language will tell us the cardinality of unique words for each set we count. We know how many total words are in each play, so let's see how abundant Shakespeare's vocabulary is in each play.

    r.delete('caesar')
    r.delete('hamlet')
    r.delete('macbeth')
    
    from time import time
    t0 = time()
    
    r.pfadd('caesar', *caesar)
    r.pfadd('hamlet', *hamlet)
    r.pfadd('macbeth', *macbeth)
    print "Unique words in Julias Caesar:", r.pfcount('caesar')
    print "Unique words in Hamlet:", r.pfcount('hamlet')
    print "Uniqe words in MacBeth:", r.pfcount('macbeth')
    
    t1 = time()
    
    print "It took", t1-t0, "seconds to count this."
    print "Julias Caesar is using", len(r.dump('caesar')), "bytes in memory, but is", len(r.get('caesar')), "bytes"
    print "Hamlet is using", len(r.dump('hamlet')), "bytes in memory, but is", len(r.get('hamlet')), "bytes"
    print "MacBeth is using", len(r.dump('macbeth')), "bytes in memory, but is", len(r.get('macbeth')), "bytes"


    Unique words in Julias Caesar: 3546
    Unique words in Hamlet: 5382
    Uniqe words in MacBeth: 3998
    It took 0.69383597374 seconds to count this.
    Julias Caesar is using 6086 bytes in memory, but is 12304 bytes
    Hamlet is using 7158 bytes in memory, but is 12304 bytes
    MacBeth is using 6342 bytes in memory, but is 12304 bytes

There's our answer and the response time was pretty quick for a VM. The size of HyperLogLogs is always the same in Redis. They are always 12kB. Redis compresses everything when it stores it in memory, but the compressed size is only about 7kB. This maximum size means that HyperLogLogs have an uperlimit of cardinality, at around 2^64 unique items. There is no limit to observed items, however.

So let's check out another neat feature of HyperLogLogs -- we can combine them very quickly! So if we wanted to know how many unique words there are between all 3 plays, we can simply merge the 3 previous HyperLogLogs into 1. Again, we'll clear the key so you don't think I'm cheating the times, but merging is also idempotent, so this doesn't matter.

    r.delete('shakespeare')
    
    t0 = time()
    
    r.pfmerge('shakespeare', 'caesar', 'hamlet', 'macbeth')
    print "Unique words in Julias Caesar, Hamlet and Macbeth:", r.pfcount('shakespeare')
    
    t1 = time()
    
    print "It took", t1-t0, "seconds to merge all 3 HyperLogLogs into 1"

    Unique words in Julias Caesar, Hamlet and Macbeth: 8905
    It took 0.000649929046631 seconds to merge all 3 HyperLogLogs into 1

That was pretty quick, but it's not really a use case you might need in production. HyperLogLogs are inexact because they're probabilistic, for something operating on the size of under 100k entries, it might actually be an overkill.

    import sys
    
    t0 = time()
    
    unique_caesar = set(caesar)
    unique_hamlet = set(hamlet)
    unique_macbeth = set(macbeth)
    
    print "Unique words in Julia Caesar, counted:", len(unique_caesar)
    print "Unique words in Hamlet, counted:", len(unique_hamlet)
    print "Unique words in Macbeth, counted:", len(unique_macbeth)
    
    t1 = time()
    
    print "It took python", t1-t0, "seconds to count unique elements."
    print "Julias Caesar is using", sys.getsizeof(unique_caesar), "bytes"
    print "Hamlet is using", sys.getsizeof(unique_hamlet), "bytes"
    print "MacBeth is using", sys.getsizeof(unique_macbeth), "bytes"

    Unique words in Julia Caesar, counted: 3560
    Unique words in Hamlet, counted: 5447
    Unique words in Macbeth, counted: 4017
    It took python 0.323287963867 seconds to count unique elements.
    Julias Caesar is using 131304 bytes
    Hamlet is using 131304 bytes
    MacBeth is using 131304 bytes

So Python takes half the amount of time that a HyperLogLog, but each set is 131kB, nearly 10x the size of the Redis HyperLogLogs. Redis is a networked datastore though, and Python is communicating with it over compressed TCP, so this may not be a fair comparison, even if they're both on the same host. Let's check Redis' set type as well, because I like Redis.

    r.delete('caesar')
    r.delete('hamlet')
    r.delete('macbeth')
    
    from time import time
    import sys
    
    t0 = time()
    
    r.sadd('caesar', *caesar)
    r.sadd('hamlet', *hamlet)
    r.sadd('macbeth', *macbeth)
    print "Unique words in Julias Caesar:", r.scard('caesar')
    print "Unique words in Hamlet:", r.scard('hamlet')
    print "Uniqe words in MacBeth:", r.scard('macbeth')
    
    t1 = time()
    
    print "It took", t1-t0, "seconds to count this."
    print "Julias Caesar is using", len(r.dump('caesar')), "bytes"
    print "Hamlet is using", len(r.dump('hamlet')), "bytes"
    print "MacBeth is using", len(r.dump('macbeth')), "bytes"

    Unique words in Julias Caesar: 3560
    Unique words in Hamlet: 5447
    Uniqe words in MacBeth: 4017
    It took 0.776929140091 seconds to count this.
    Julias Caesar is using 24561 bytes
    Hamlet is using 39142 bytes
    MacBeth is using 27965 bytes

It seems like a comparable speed between HyperLogLog and a set within the Redis- space, nothing notable at this scale, at least. Redis is very heavily optimized C code, while Python is thick OO-esque C code. It's probable that the speed advantage that Python has shown with sets vs. Redis is due to Python doing it all in memory, rather than communicating over localhost-TCP, like Redis is.

Regardless, What is important to note is that a complete record of all of the values -- a fully functional, parsable set -- is only twice the size at this scale. It's a nifty demo to show what HyperLogLogs are and how they could be useful, but it doesn't really demonstrate their value because storing this many values isn't all that difficult, and HyperLogLogs give you less functionality than a set. But the realm of their use case isn't too ethereal that we can't visualize it. HyperLogLogs may be an overkill for Shakespeare, but let's overkill HyperLogLogs with a bigger problem.

Warning - Here there be dragons: The above code blocks have been very reasonable and returned answers in under a second for me in a VM. Some of the code blocks below may intentionally take a long time (long iteration) and it might be wise to not run them, but instead take my word on the results.

So here, we'll simply add a million distinct (and sequential) items to the list with a random amount of duplicates on each item.

    from numpy import random
    r.delete('test')
    
    for x in xrange(1000000):
        for dupe in xrange(random.randint(1, 10)):
            r.pfadd('test', str(x))
        
    r.pfcount('test')





    1009838

So you can see that the count is off by a small margin, under 1%. But with 1,000,000 unique items...

    len(r.get('test'))




    12304

The HyperLogLog is still only 12kB in memory, which is pretty nice if you're looking to record a lot of information for different points in a service process. So finally, let's do a practical example. Let's observe multiple-URL endpoints for a few features.

  • Unique IP addresses accessing an endpoint
  • Unique responses from the endpoint

So we'll define a list of items, 4 function controllers and map them to URLs.

    items = [
        'apple',
        'banana',
        'cucumber',
        'dogfood',
        'eggplant',
        'flowers',
        'goat cheese',
        'hummus',
        'iceberg lettuce',
        'jalepenos'
    ]
    
    
    def shopping_list_controller(req_url, req_ip):
        r.pfadd('ip-' + req_url, req_ip)
        item_list = list()
        
        # Shopping lists are only 4-9 items here
        for x in xrange(random.randint(4, 10)):
            item_list.append(random.choice(items))
        
        # Ghetto hash function
        item_list.sort()
        
        r.pfadd('list-' + req_url, str(item_list))
        r.pfadd('ip-list-' + req_url, str(item_list) + req_ip)
    
        # Don't even respond to the service call. YOLO
        
    # Don't tell anyone... but all of my fake services do the same thing
    recommended_items_controller = shopping_list_controller
    cheaper_alternatives_controller = shopping_list_controller
    recipe_items_controller = shopping_list_controller
    
    routes = {
        '/list/': shopping_list_controller,
        '/recomended/': recommended_items_controller,
        '/cheapalts/': cheaper_alternatives_controller,
        '/recipe/': recipe_items_controller
    }

So now we have 4 service URLS which return some hashable list of items. You can tell from the definition above that we don't care much about how the items are chosen or that we actually get them to a requester. So let's simulate hitting these services from random IPs to see HyperLogLogs do their thing. Oh, did I mention, we don't use IPv6 or even IPv4 here, we use IPv2. (This is just a joke. We're only using 2 octets to reduce the number of possible users down to something that actually has a chance of reasonably overlapping in only 100,000 random hits.)

    for url in routes.keys():
        r.delete('ip-' + url)
        r.delete('list-' + url)
        r.delete('ip-list-' + url)
    
    def rand_ip():
        return str(random.randint(0,256)) + "." + str(random.randint(0,256))
    
    for x in xrange(100000):
        URL = random.choice(routes.keys())
        routes[URL](URL, rand_ip()) 

Whew. That was a nice simulation of a few minutes worth of fake service use, so let's see what attaching a HyperLogLog to the controller has actually bought us. We captured unique ip, unique list, and unique ip+list combo per url endpoint. Since each HyperLogLog is 12kB, we used 4x3x12kB worth of memory, or just under 144kB to record some of the following information:

    for url in routes.keys():
        print "Unique IPs on", url, ":", r.pfcount('ip-' + url)
        print "Unique lists on", url, ":", r.pfcount('list-' + url)
        print "Unique IP/list on", url, ":", r.pfcount('ip-list-' + url)
    
    
    # On the fly, merge IP recordings into one new sitewide metric
    r.delete("ip")
    r.pfmerge("ip", *["ip-" + key for key in routes.keys()])
    print "Unique IPs, sitewide:", r.pfcount("ip")
    
    # On the fly, merge list recordings into one new sitewide metric
    r.delete("list")
    r.pfmerge("list", *["list-" + key for key in routes.keys()])
    print "Unique lists, sitewide", r.pfcount("list")
    
    # On the fly, merge IP/list recordings into onew new sitewide metric
    r.delete("ip-list")
    r.pfmerge("ip-list", *["ip-list-" + key for key in routes.keys()])
    print "Unique IP/lists, sitewide", r.pfcount("ip-list")

    Unique IPs on /list/ : 20843
    Unique lists on /list/ : 14859
    Unique IP/list on /list/ : 24836
    Unique IPs on /recomended/ : 20744
    Unique lists on /recomended/ : 14956
    Unique IP/list on /recomended/ : 25317
    Unique IPs on /cheapalts/ : 21008
    Unique lists on /cheapalts/ : 15243
    Unique IP/list on /cheapalts/ : 24978
    Unique IPs on /recipe/ : 20459
    Unique lists on /recipe/ : 14911
    Unique IP/list on /recipe/ : 24855
    Unique IPs, sitewide: 51058
    Unique lists, sitewide 35024
    Unique IP/lists, sitewide 99599

Now perhaps expand your mind even further. On the fly merges take a miniscule amount of time and the counters are only 12kB each. If you treat each counter as an atomic observation that can be chained, how far down is appropriate? With 4 services making 3 observations each, let's say the atomic size for observation is a day because each is hit 1 million times. You can programatically set the key that you're using within redis to contain the date. You can set a time to live on the key (how long until Redis cleans it up) for your observation window and now you have a log of the entropy. The size in RAM that this log takes up is total observations multiplied by 12kB multiplied by the number of observations within your window. Say our time to live is 2 months and we make one HyperLogLog per day for 3 observations on 4 service URLS. You have:

    print 2*30*3*4*12, "Kilobytes"

    8640 Kilobytes

Mind you, Redis is extremely powerful, so 8.6MB with an atomic size of 1 day and 60 days of backup means you can combine and run metrics on those days. But perhaps you have a decreasing atomic size. You keep HyperLogLogs for every 10 minutes for the past 24 hours, but after 24 hours combine them into an hourly HLL, then after 60 days combine them into a daily HLL, and so on. Really you can do a ton of things with Redis, and one of the under appreciated ones is tracking entropy of a set (unique cardinality/ cardinality). Use HyperLogLogs. Good day.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.