mafrosis/rethinkdb_boomerang.md

## rethinkdb_boomerang.md

      
    Raw
  

              rethinkdb_boomerang.md
            
          
    RethinkDB with Boomerang Data

This guide will get you up and running with a Boomerang dataset, and subsequently run through a couple of small queries using RethinkDB.
Get A Dataset

Create a working directory:
mkdir boomerang-rethink
cd boomerang-rethink

Install Sesame:
sudo pip install --pre sesame

Request the Sesame key from Mafro. You will be supplied with a link to a Google doc, you can then copy the content of that file into a text file on your computer, or select File>Download As>Text.
Assuming the file you acquire from Google docs is called sesame-rethinkdb.key.txt, run these instructions to grab & decrypt the DB dump:
curl -O https://s3-ap-southeast-2.amazonaws.com/boomerang-rethinkdb-backups/rethinkdb_goodlifehealthclubs_com_au_dump_2014-05-14T02_03_02.encrypted
sesame decrypt -k sesame-rethinkdb.key.txt rethinkdb_goodlifehealthclubs_com_au_2014-05-14T02:03:02.encrypted

Now import the GLHC Boomerang dataset into RethinkDB:
rethinkdb restore rethinkdb_goodlifehealthclubs_com_au_dump_2014-05-14T02:03:02.tar.gz

Create Indexes on the Boomerang Dataset

NB: this part will change in the near future
Create a new script initdb.py, with the following content:
#! /usr/bin/env python

client_ref = "goodlifehealthclubs_com_au"

import rethinkdb as r
import datetime

try:
    # create a backup DB
    conn = r.connect('localhost', 28015)
    r.db_create(client_ref).run(conn)
    r.db(client_ref).table_create('backups').run(conn)
    r.db(client_ref).table_create('products').run(conn)
    conn.close()
    print "Created DB {}".format(client_ref)
except:
    pass

# connect to the new DB
conn = r.connect('localhost', 28015, db=client_ref)

# index products: used to easily lookup carts by the product keys they contain
if 'products' not in r.table('backups').index_list().run(conn):
    start = datetime.datetime.now()
    r.table('backups').index_create('products', multi=True).run(conn)
    r.table('backups').index_wait('products').run(conn)
    print "index 'products' created in {}".format(datetime.datetime.now()-start)

# index email address: like the remarkets bucket
if 'email_address' not in r.table('backups').index_list().run(conn):
    start = datetime.datetime.now()
    r.table('backups').index_create('email_address', lambda x: x['user']['EmailAddress']).run(conn)
    r.table('backups').index_wait('email_address').run(conn)
    print "index 'email_address' created in {}".format(datetime.datetime.now()-start)
# index user agent: helpful for debugging bots etc

if 'user_agent' not in r.table('backups').index_list().run(conn):
    start = datetime.datetime.now()
    r.table('backups').index_create('user_agent', lambda x: x['cart']['user_agent'], multi=True).run(conn)
    r.table('backups').index_wait('user_agent').run(conn)
    print "index 'user_agent' created in {}".format(datetime.datetime.now()-start)

print "All indexes present"
conn.close()

Mark the script as executable and run it:
chmod u+x initdb.py
./initdb.py

Query the Boomerang Dataset

Create a file containing the following and the run it with python query.py. Work through each step in the script and understand what's happening:
#! /usr/bin/env python

import json

import rethinkdb as r
r.connect('localhost', 28015, db='goodlifehealthclubs_com_au').repl()


# print out a easy-to-read representation of a single object
def print_obj(obj):
    if 'last_updated' in obj and isinstance(obj['last_updated'], basestring) is False:
        obj['last_updated'] = obj['last_updated'].isoformat()
    print json.dumps(obj, indent=2)


print 'Total number of records'
print r.table('backups').count().run()
print '--'

# load first object which was remarketed, and print it
obj = list(r.table('backups').filter(r.row['cart']['remarketing']).limit(1).run())[0]
print_obj(obj)
print '--'

# how many carts have been remarketed
print 'Total carts remarketed'
print r.table('backups').filter(r.row['cart']['remarketing']).count().run()
print '--'

# how many remarketed carts had their survey responded to
print 'Total survey responses'
surveys = list(
    r.table('backups').filter(r.row['cart']['remarketing'][0]['survey_response']).run()
)
print len(surveys)
print '--'

for s in surveys:
    print s['cart']['last_updated']

Further Reading


General DB and infrastructure questions about RethinkDB: http://rethinkdb.com/faq/
Working examples of Python code to query RethinkDB: http://rethinkdb.com/docs/cookbook/python/
Translate your SQL skillz into ReQL: http://rethinkdb.com/docs/sql-to-reql/