DavidJFelix/HyperLogLogs_lab.md

## HyperLogLogs_lab.md

      
    Raw
  

              HyperLogLogs_lab.md
            
          
    Redis is an in-memory network key/value store and nltk is the python natural
language toolkit. We'll be using these libraries later for the lab.
    import redis
    import nltk
Setup our redis connection. It's on the VM, so local is fine.
    pool = redis.ConnectionPool(host="127.0.0.1", port=6379, db=0)
    r = redis.Redis(connection_pool=pool)
We ought to test that this is working properly. Let's set the 'breakfast' key to
'spam' and see if Redis will tell us what's for breakfast.
    r.set('breakfast', 'spam')
    r.get('breakfast')


    'spam'
Great! So now let's get weird. We brought the NLTK in earlier, and it comes with
digitized book texts. Let's see what we have at our disposal.
    from nltk.book import *
    nltk.corpus.gutenberg.fileids()

    *** Introductory Examples for the NLTK Book ***
    Loading text1, ..., text9 and sent1, ..., sent9
    Type the name of the text or sentence to view it.
    Type: 'texts()' or 'sents()' to list the materials.
    text1: Moby Dick by Herman Melville 1851
    text2: Sense and Sensibility by Jane Austen 1811
    text3: The Book of Genesis
    text4: Inaugural Address Corpus
    text5: Chat Corpus
    text6: Monty Python and the Holy Grail
    text7: Wall Street Journal
    text8: Personals Corpus
    text9: The Man Who Was Thursday by G . K . Chesterton 1908


    [u'austen-emma.txt',
     u'austen-persuasion.txt',
     u'austen-sense.txt',
     u'bible-kjv.txt',
     u'blake-poems.txt',
     u'bryant-stories.txt',
     u'burgess-busterbrown.txt',
     u'carroll-alice.txt',
     u'chesterton-ball.txt',
     u'chesterton-brown.txt',
     u'chesterton-thursday.txt',
     u'edgeworth-parents.txt',
     u'melville-moby_dick.txt',
     u'milton-paradise.txt',
     u'shakespeare-caesar.txt',
     u'shakespeare-hamlet.txt',
     u'shakespeare-macbeth.txt',
     u'whitman-leaves.txt']
Three Shakespeare works... that sounds interesting. The NLTK is nice because it
has some fast utilities that are commonly used. We can just get a quick ordered
list of the words as they appear in each play and count them.
    caesar = nltk.corpus.gutenberg.words('shakespeare-caesar.txt')
    print "Words in Julias Caesar:", len(caesar)
    hamlet = nltk.corpus.gutenberg.words('shakespeare-hamlet.txt')
    print "Words in Hamlet:", len(hamlet)
    macbeth = nltk.corpus.gutenberg.words('shakespeare-macbeth.txt')
    print "Words in MacBeth:", len(macbeth)

    Words in Julias Caesar: 25833
    Words in Hamlet: 37360
    Words in MacBeth: 23140
Let's bring it back to HyperLogLogs now. Remember that Redis connection? Let's
test the HyperLogLog type. If we pass it mostly "spam" and one "eggs" it should
count 2 items.
    r.pfadd('test', 'spam', 'spam', 'spam', 'spam', 'eggs')
    r.pfcount('test')


    2
Back to Shakespeare. Let's see how fast HyperLogLogs are (on a VM, mind you) and
how much space they might take up if we hit them with a moderate (but still very
small) amount of data. We'll tell Redis to clear the keys first so you don't
suspect me of cheating the times, but remember that HyperLogLogs are idempotent,
so this is only necessary to ensure my setup time isn't skipped. HyperLogLogs in
the context of natural language will tell us the cardinality of unique words for
each set we count. We know how many total words are in each play, so let's see
how abundant Shakespeare's vocabulary is in each play.
    r.delete('caesar')
    r.delete('hamlet')
    r.delete('macbeth')
    
    from time import time
    t0 = time()
    
    r.pfadd('caesar', *caesar)
    r.pfadd('hamlet', *hamlet)
    r.pfadd('macbeth', *macbeth)
    print "Unique words in Julias Caesar:", r.pfcount('caesar')
    print "Unique words in Hamlet:", r.pfcount('hamlet')
    print "Uniqe words in MacBeth:", r.pfcount('macbeth')
    
    t1 = time()
    
    print "It took", t1-t0, "seconds to count this."
    print "Julias Caesar is using", len(r.dump('caesar')), "bytes in memory, but is", len(r.get('caesar')), "bytes"
    print "Hamlet is using", len(r.dump('hamlet')), "bytes in memory, but is", len(r.get('hamlet')), "bytes"
    print "MacBeth is using", len(r.dump('macbeth')), "bytes in memory, but is", len(r.get('macbeth')), "bytes"


    Unique words in Julias Caesar: 3546
    Unique words in Hamlet: 5382
    Uniqe words in MacBeth: 3998
    It took 0.69383597374 seconds to count this.
    Julias Caesar is using 6086 bytes in memory, but is 12304 bytes
    Hamlet is using 7158 bytes in memory, but is 12304 bytes
    MacBeth is using 6342 bytes in memory, but is 12304 bytes
There's our answer and the response time was pretty quick for a VM. The size of
HyperLogLogs is always the same in Redis. They are always 12kB. Redis compresses
everything when it stores it in memory, but the compressed size is only about
7kB. This maximum size means that HyperLogLogs have an uperlimit of cardinality,
at around 2^64 unique items. There is no limit to observed items, however.
So let's check out another neat feature of HyperLogLogs -- we can combine them
very quickly! So if we wanted to know how many unique words there are between
all 3 plays, we can simply merge the 3 previous HyperLogLogs into 1. Again,
we'll clear the key so you don't think I'm cheating the times, but merging is
also idempotent, so this doesn't matter.
    r.delete('shakespeare')
    
    t0 = time()
    
    r.pfmerge('shakespeare', 'caesar', 'hamlet', 'macbeth')
    print "Unique words in Julias Caesar, Hamlet and Macbeth:", r.pfcount('shakespeare')
    
    t1 = time()
    
    print "It took", t1-t0, "seconds to merge all 3 HyperLogLogs into 1"

    Unique words in Julias Caesar, Hamlet and Macbeth: 8905
    It took 0.000649929046631 seconds to merge all 3 HyperLogLogs into 1
That was pretty quick, but it's not really a use case you might need in
production. HyperLogLogs are inexact because they're probabilistic, for
something operating on the size of under 100k entries, it might actually be an
overkill.
    import sys
    
    t0 = time()
    
    unique_caesar = set(caesar)
    unique_hamlet = set(hamlet)
    unique_macbeth = set(macbeth)
    
    print "Unique words in Julia Caesar, counted:", len(unique_caesar)
    print "Unique words in Hamlet, counted:", len(unique_hamlet)
    print "Unique words in Macbeth, counted:", len(unique_macbeth)
    
    t1 = time()
    
    print "It took python", t1-t0, "seconds to count unique elements."
    print "Julias Caesar is using", sys.getsizeof(unique_caesar), "bytes"
    print "Hamlet is using", sys.getsizeof(unique_hamlet), "bytes"
    print "MacBeth is using", sys.getsizeof(unique_macbeth), "bytes"

    Unique words in Julia Caesar, counted: 3560
    Unique words in Hamlet, counted: 5447
    Unique words in Macbeth, counted: 4017
    It took python 0.323287963867 seconds to count unique elements.
    Julias Caesar is using 131304 bytes
    Hamlet is using 131304 bytes
    MacBeth is using 131304 bytes
So Python takes half the amount of time that a HyperLogLog, but each set is
131kB, nearly 10x the size of the Redis HyperLogLogs. Redis is a networked
datastore though, and Python is communicating with it over compressed TCP, so
this may not be a fair comparison, even if they're both on the same host. Let's
check Redis' set type as well, because I like Redis.
    r.delete('caesar')
    r.delete('hamlet')
    r.delete('macbeth')
    
    from time import time
    import sys
    
    t0 = time()
    
    r.sadd('caesar', *caesar)
    r.sadd('hamlet', *hamlet)
    r.sadd('macbeth', *macbeth)
    print "Unique words in Julias Caesar:", r.scard('caesar')
    print "Unique words in Hamlet:", r.scard('hamlet')
    print "Uniqe words in MacBeth:", r.scard('macbeth')
    
    t1 = time()
    
    print "It took", t1-t0, "seconds to count this."
    print "Julias Caesar is using", len(r.dump('caesar')), "bytes"
    print "Hamlet is using", len(r.dump('hamlet')), "bytes"
    print "MacBeth is using", len(r.dump('macbeth')), "bytes"

    Unique words in Julias Caesar: 3560
    Unique words in Hamlet: 5447
    Uniqe words in MacBeth: 4017
    It took 0.776929140091 seconds to count this.
    Julias Caesar is using 24561 bytes
    Hamlet is using 39142 bytes
    MacBeth is using 27965 bytes
It seems like a comparable speed between HyperLogLog and a set within the Redis-
space, nothing notable at this scale, at least. Redis is very heavily optimized
C code, while Python is thick OO-esque C code. It's probable that the speed
advantage that Python has shown with sets vs. Redis is due to Python doing it
all in memory, rather than communicating over localhost-TCP, like Redis is.
Regardless, What is important to note is that a complete record of all of the
values -- a fully functional, parsable set -- is only twice the size at this
scale. It's a nifty demo to show what HyperLogLogs are and how they could be
useful, but it doesn't really demonstrate their value because storing this many
values isn't all that difficult, and HyperLogLogs give you less functionality
than a set. But the realm of their use case isn't too ethereal that we can't
visualize it. HyperLogLogs may be an overkill for Shakespeare, but let's
overkill HyperLogLogs with a bigger problem.
Warning - Here there be dragons: The above code blocks have been very
reasonable and returned answers in under a second for me in a VM. Some of the
code blocks below may intentionally take a long time (long iteration) and it
might be wise to not run them, but instead take my word on the results.
So here, we'll simply add a million distinct (and sequential) items to the list
with a random amount of duplicates on each item.
    from numpy import random
    r.delete('test')
    
    for x in xrange(1000000):
        for dupe in xrange(random.randint(1, 10)):
            r.pfadd('test', str(x))
        
    r.pfcount('test')


    1009838
So you can see that the count is off by a small margin, under 1%. But with
1,000,000 unique items...
    len(r.get('test'))


    12304
The HyperLogLog is still only 12kB in memory, which is pretty nice if you're
looking to record a lot of information for different points in a service
process. So finally, let's do a practical example. Let's observe multiple-URL
endpoints for a few features.

Unique IP addresses accessing an endpoint
Unique responses from the endpoint

So we'll define a list of items, 4 function controllers and map them to URLs.
    items = [
        'apple',
        'banana',
        'cucumber',
        'dogfood',
        'eggplant',
        'flowers',
        'goat cheese',
        'hummus',
        'iceberg lettuce',
        'jalepenos'
    ]
    
    
    def shopping_list_controller(req_url, req_ip):
        r.pfadd('ip-' + req_url, req_ip)
        item_list = list()
        
        # Shopping lists are only 4-9 items here
        for x in xrange(random.randint(4, 10)):
            item_list.append(random.choice(items))
        
        # Ghetto hash function
        item_list.sort()
        
        r.pfadd('list-' + req_url, str(item_list))
        r.pfadd('ip-list-' + req_url, str(item_list) + req_ip)
    
        # Don't even respond to the service call. YOLO
        
    # Don't tell anyone... but all of my fake services do the same thing
    recommended_items_controller = shopping_list_controller
    cheaper_alternatives_controller = shopping_list_controller
    recipe_items_controller = shopping_list_controller
    
    routes = {
        '/list/': shopping_list_controller,
        '/recomended/': recommended_items_controller,
        '/cheapalts/': cheaper_alternatives_controller,
        '/recipe/': recipe_items_controller
    }
So now we have 4 service URLS which return some hashable list of items. You can
tell from the definition above that we don't care much about how the items are
chosen or that we actually get them to a requester. So let's simulate hitting
these services from random IPs to see HyperLogLogs do their thing. Oh, did I
mention, we don't use IPv6 or even IPv4 here, we use IPv2. (This is just a joke.
We're only using 2 octets to reduce the number of possible users down to
something that actually has a chance of reasonably overlapping in only 100,000
random hits.)
    for url in routes.keys():
        r.delete('ip-' + url)
        r.delete('list-' + url)
        r.delete('ip-list-' + url)
    
    def rand_ip():
        return str(random.randint(0,256)) + "." + str(random.randint(0,256))
    
    for x in xrange(100000):
        URL = random.choice(routes.keys())
        routes[URL](URL, rand_ip()) 
Whew. That was a nice simulation of a few minutes worth of fake service use, so
let's see what attaching a HyperLogLog to the controller has actually bought us.
We captured unique ip, unique list, and unique ip+list combo per url endpoint.
Since each HyperLogLog is 12kB, we used 4x3x12kB worth of memory, or just under
144kB to record some of the following information:
    for url in routes.keys():
        print "Unique IPs on", url, ":", r.pfcount('ip-' + url)
        print "Unique lists on", url, ":", r.pfcount('list-' + url)
        print "Unique IP/list on", url, ":", r.pfcount('ip-list-' + url)
    
    
    # On the fly, merge IP recordings into one new sitewide metric
    r.delete("ip")
    r.pfmerge("ip", *["ip-" + key for key in routes.keys()])
    print "Unique IPs, sitewide:", r.pfcount("ip")
    
    # On the fly, merge list recordings into one new sitewide metric
    r.delete("list")
    r.pfmerge("list", *["list-" + key for key in routes.keys()])
    print "Unique lists, sitewide", r.pfcount("list")
    
    # On the fly, merge IP/list recordings into onew new sitewide metric
    r.delete("ip-list")
    r.pfmerge("ip-list", *["ip-list-" + key for key in routes.keys()])
    print "Unique IP/lists, sitewide", r.pfcount("ip-list")

    Unique IPs on /list/ : 20843
    Unique lists on /list/ : 14859
    Unique IP/list on /list/ : 24836
    Unique IPs on /recomended/ : 20744
    Unique lists on /recomended/ : 14956
    Unique IP/list on /recomended/ : 25317
    Unique IPs on /cheapalts/ : 21008
    Unique lists on /cheapalts/ : 15243
    Unique IP/list on /cheapalts/ : 24978
    Unique IPs on /recipe/ : 20459
    Unique lists on /recipe/ : 14911
    Unique IP/list on /recipe/ : 24855
    Unique IPs, sitewide: 51058
    Unique lists, sitewide 35024
    Unique IP/lists, sitewide 99599
Now perhaps expand your mind even further. On the fly merges take a miniscule
amount of time and the counters are only 12kB each. If you treat each counter as
an atomic observation that can be chained, how far down is appropriate? With 4
services making 3 observations each, let's say the atomic size for observation
is a day because each is hit 1 million times. You can programatically set the
key that you're using within redis to contain the date. You can set a time to
live on the key (how long until Redis cleans it up) for your observation window
and now you have a log of the entropy. The size in RAM that this log takes up is
total observations multiplied by 12kB multiplied by the number of observations
within your window. Say our time to live is 2 months and we make one HyperLogLog
per day for 3 observations on 4 service URLS. You have:
    print 2*30*3*4*12, "Kilobytes"

    8640 Kilobytes
Mind you, Redis is extremely powerful, so 8.6MB with an atomic size of 1 day and
60 days of backup means you can combine and run metrics on those days. But
perhaps you have a decreasing atomic size. You keep HyperLogLogs for every 10
minutes for the past 24 hours, but after 24 hours combine them into an hourly
HLL, then after 60 days combine them into a daily HLL, and so on. Really you can
do a ton of things with Redis, and one of the under appreciated ones is tracking
entropy of a set (unique cardinality/ cardinality). Use HyperLogLogs. Good day.