Skip to content

Instantly share code, notes, and snippets.

@mrocklin
Created May 29, 2014 20:25
Show Gist options
  • Save mrocklin/a5ea518d89552964b687 to your computer and use it in GitHub Desktop.
Save mrocklin/a5ea518d89552964b687 to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"metadata": {
"name": "",
"signature": "sha256:91e6f8f9ddaedb05bd6bb6cd8e56b360a03d739093eeac287849bf35cbb6e352"
},
"nbformat": 3,
"nbformat_minor": 0,
"worksheets": [
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Blaze, Bitcoin, and Bodacious Backends\n",
"\n",
"\n",
"**tl;dr: We've redesigned Blaze and need feedback. \n",
"We show off the three components of blaze with with simple examples from Bitcoin. \n",
"We ask how this might have value.**\n",
"\n",
"*This blogpost is not for public consumption. If you've found it then great!\n",
"Please read and give feedback, but please don't tweet or repost.*\n",
"\n",
"Over the last two months the Blaze team has worked hard to redesign Blaze.\n",
"During this time I've had some really useful conversations with people within Continuum\n",
"but outside\n",
"the core development team (namely Ben Zaitlen, Ely Spears, and Hugo Shi) and their \n",
"experience has helped us to reprioritize and rethink our approach. Blaze is at a point\n",
"where I'd like to give you the demo I've given them. Hopefully this generates useful \n",
"feedback to direct the future of Blaze.\n",
"\n",
"The question I have for you is the following:\n",
"\n",
"*I think that what we have is cool and interesting, how can we also make it\n",
"more immediately useful to a broad audience?*"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Bitcoin - our example dataset\n",
"\n",
"We'll demo blaze with a simple dataset about transactions in Bitcoin, our favorite crypto-currency. Here is the top of a 1.6 GB csv file holding Bitcoin transactions. The columns are as follows:\n",
"\n",
"* Transaction ID\n",
"* Sender ID\n",
"* Recipient ID\n",
"* Datetime of transaction, e.g. `20130410142250` -> 2013-04-10 2:22:50 pm\n",
"* Number of bitcoins sent\n",
"\n",
"```\n",
"$ head user_edges.txt\n",
"1,2,2,20130410142250,24.375\n",
"1,2,782477,20130410142250,0.7709\n",
"2,620423,4571210,20111227114312,614.17495129\n",
"2,620423,3,20111227114312,128.0405196\n",
"3,3,782479,20130410142250,47.1405196\n",
"3,3,4,20130410142250,150.0\n",
"4,39337,39337,20120617120202,0.31081764\n",
"4,39337,3,20120617120202,69.1\n",
"5,2071196,2070358,20130304143805,61.60235182\n",
"5,2071196,5,20130304143805,100.0\n",
"```\n",
"\n",
"The astute reader might ask questions like the following:\n",
"\n",
"* \"Wait, isn't bitcoin anonymous? how are there User IDs?\"\n",
"* \"Why are there two transaction IDs for each transaction?\"\n",
"\n",
"Those have interesting answers which I'll discuss with you if you ask.\n",
"For now, on to Blaze!"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Blaze, a story of three parts\n",
"\n",
"Reorganized Blaze is split into three core parts. We'll talk about each one\n",
"separately:\n",
"\n",
"1. `blaze.data` - Provides a uniform indexable view into disparate\n",
" data sources. This piece is the most mature and is ready for internal use.\n",
"2. `blaze.expr` - A symbolic expression of DataFrame-like computations. (Think SymPy or Theano for Pandas/SQL)\n",
"3. `blaze.compute` - An interpreter to various of computational backends.\n",
"\n",
"When we're ready we'll build a fourth piece\n",
"\n",
"* `blaze.interface` - Usable interfaces for data scientists. This will\n",
" include an interactive DataFrame, but could include other interfaces like \n",
" an SQL parser or Datalog engine."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### `Blaze.data` - uniform access to disparate data\n",
"\n",
"Blaze operates on common storage formats including CSV, JSON,\n",
"HDF5, and SQL. For each format we offer the following functionality\n",
"\n",
"* Insert/pull off data in Python format\n",
"* Insert/pull off data in Binary/DyND/NumPy format \n",
"* Fancy indexing\n",
"\n",
"The interface is the same even when the backend is different. Lets see this in action..."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Example - Basic Parsing and Type Coercion\n",
"\n",
"We open up a `csv` file, tweak column names and types, and show of basic indexing"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"filename = '/home/mrocklin/data/bitcoin/data-code/user_edges.txt'"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 1
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"from blaze.data.csv import CSV\n",
"dd = CSV(filename)\n",
"dd.schema"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 2,
"text": [
"dshape(\"{ _0 : int64, _1 : int64, _2 : int64, _3 : int64, _4 : float64 }\")"
]
}
],
"prompt_number": 2
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"dd.py[4]"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 3,
"text": [
"(3L, 3L, 782479L, 20130410142250L, 47.1405196)"
]
}
],
"prompt_number": 3
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Notice that the column names didn't come out (there weren't any in the csv file) and that the timestamp came through as an integer, not a datetime. We provide information on creation to better interpret the underlying data."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"csv = CSV(filename, \n",
" columns=['transaction', 'sender', 'recipient', 'timestamp', 'value'],\n",
" hints={'timestamp': 'datetime'})\n",
"csv.schema"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 4,
"text": [
"dshape(\"{ transaction : int64, sender : int64, recipient : int64, timestamp : datetime, value : float64 }\")"
]
}
],
"prompt_number": 4
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"csv.py[4]"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 5,
"text": [
"(3L, 3L, 782479L, datetime.datetime(2013, 4, 10, 14, 22, 50), 47.1405196)"
]
}
],
"prompt_number": 5
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Example - Fancy Indexing\n",
"\n",
"We can index into the CSV file with standard fancy indexing semantics."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"list(csv.py[:4, ['transaction', 'sender', 'value']])"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 6,
"text": [
"[(1L, 2L, 24.375),\n",
" (1L, 2L, 0.7709),\n",
" (2L, 620423L, 614.17495129),\n",
" (2L, 620423L, 128.0405196)]"
]
}
],
"prompt_number": 6
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can also get DyND arrays back if the user prefers a binary format. These can be easily turned into NumPy arrays."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"csv.dynd[:4, ['transaction', 'sender', 'value']]"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 7,
"text": [
"nd.array([[1, 2, 24.375], [1, 2, 0.7709], [2, 620423, 614.175], [2, 620423, 128.041]], type=\"var * {transaction : int64, sender : int64, value : float64}\")"
]
}
],
"prompt_number": 7
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We only parse what we need"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"timeit list(csv.py[:1000, :])"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"100 loops, best of 3: 2.96 ms per loop\n"
]
}
],
"prompt_number": 8
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"timeit list(csv.py[:1000, ['transaction', 'sender']])"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"1000 loops, best of 3: 1.48 ms per loop\n"
]
}
],
"prompt_number": 9
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Python indexing streams on the first column, supporting streaming workflows"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"csv.py[:]"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 10,
"text": [
"<itertools.chain at 0x2fbc6d0>"
]
}
],
"prompt_number": 10
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Multiple Backends - Using SQLAlchemy"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"That exact same interface works for a number of different backends. \n",
"\n",
"Here we create a SQLite database on disk with the same schema as our `csv` file."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"from blaze.data.sql import SQL\n",
"sql = SQL('sqlite:///demo.db', 'btc', schema=csv.schema)\n",
"sql.columns"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 11,
"text": [
"['transaction', 'sender', 'recipient', 'timestamp', 'value']"
]
}
],
"prompt_number": 11
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We migrate data from the CSV file to the SQL database. It's 1.6 GB and runs in ~20 minutes. \n",
"We optimized this until runtimes were mostly dominated by the SQLite backend. \n",
"Probably this could be improved by someone with the right experience."
]
},
{
"cell_type": "code",
"collapsed": true,
"input": [
"# time sql.extend(csv)\n",
"# CPU times: user 12min 18s, sys: 22.6 s, total: 12min 41s\n",
"# Wall time: 25min 50s"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 12
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We accesss both data sources identically. The results we get back are identical."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"list(csv.py[:4, ['transaction', 'sender', 'value']])"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 13,
"text": [
"[(1L, 2L, 24.375),\n",
" (1L, 2L, 0.7709),\n",
" (2L, 620423L, 614.17495129),\n",
" (2L, 620423L, 128.0405196)]"
]
}
],
"prompt_number": 13
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"list(sql.py[:4, ['transaction', 'sender', 'value']])"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 14,
"text": [
"[(1L, 2L, 24.375),\n",
" (1L, 2L, 0.7709),\n",
" (2L, 620423L, 614.17495129),\n",
" (2L, 620423L, 128.0405196)]"
]
}
],
"prompt_number": 14
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Only the performance characteristics change"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"%timeit list(csv.py[:100000, ['transaction', 'sender', 'value']])"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"1 loops, best of 3: 202 ms per loop\n"
]
}
],
"prompt_number": 15
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"%timeit list(sql.py[:100000, ['transaction', 'sender', 'value']])"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"1 loops, best of 3: 320 ms per loop\n"
]
}
],
"prompt_number": 16
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# `blaze.expr` - Creating Computations\n",
"\n",
"We design a simple split-apply-combine computation. Who are the big spenders out there?\n",
"\n",
"We create this computation abstractly. Think of this like a mathematical SymPy expression. \n",
"No work is being done, we're just describing the result that we want."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"from blaze.expr.table import *\n",
"t = TableSymbol(csv.schema) # an abstract computation\n",
"\n",
"# Top five spenders. Sum the values of each sender, sort and then take the top five.\n",
"big_spenders = (By(t, t['sender'], t['value'].sum())\n",
" .sort('value', ascending=False)\n",
" .head(5))\n",
"big_spenders.dshape"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 17,
"text": [
"dshape(\"5 * { sender : int64, value : float64 }\")"
]
}
],
"prompt_number": 17
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## `blaze.compute` - Executing Computations\n",
"\n",
"We now have two pieces\n",
"\n",
"1. We've used `blaze.data` to access the same data from two different sources, `csv` and `sql`.\n",
"2. We've used `blaze.expr` to define an abstract computation, `big_spenders`\n",
"\n",
"We now use `blaze.compute` to execute the computation on the data. We'll do this both with a Python streaming backend and by driving SQLite."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"from blaze.compute.python import compute\n",
"%time list(compute(big_spenders, csv))"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"CPU times: user 2min 34s, sys: 1.12 s, total: 2min 35s\n",
"Wall time: 2min 35s\n"
]
},
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 19,
"text": [
"[(11L, 52461821.94165766),\n",
" (1374L, 23394277.034151807),\n",
" (25L, 13178095.975724494),\n",
" (29L, 5330179.983046564),\n",
" (12564L, 3669712.399824968)]"
]
}
],
"prompt_number": 19
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"from blaze.compute.sql import compute\n",
"%time list(compute(big_spenders, sql))"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"Hello\n",
"CPU times: user 58.4 s, sys: 5.47 s, total: 1min 3s"
]
},
{
"output_type": "stream",
"stream": "stdout",
"text": [
"\n",
"Wall time: 1min 6s\n"
]
},
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 20,
"text": [
"[(11, 52461821.94165766),\n",
" (1374, 23394277.034151807),\n",
" (25, 13178095.975724494),\n",
" (29, 5330179.983046564),\n",
" (12564, 3669712.399824968)]"
]
}
],
"prompt_number": 20
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Pandas as baseline\n",
"\n",
"I have difficulty fitting this dataset into my personal machine's memory, but it fits just fine in my new work machine (thanks Troy!) \n",
"\n",
"Lets throw this into pandas and see how it performs."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"# Get data out from disk\n",
"%time data = list(csv.py[:])"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"CPU times: user 5min 10s, sys: 2.62 s, total: 5min 13s\n",
"Wall time: 5min 13s\n"
]
}
],
"prompt_number": 21
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"# Load it into a DataFrame\n",
"from pandas import DataFrame\n",
"%time df = DataFrame(data, columns=csv.columns)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"CPU times: user 32.2 s, sys: 4.5 s, total: 36.7 s\n",
"Wall time: 39.3 s\n"
]
}
],
"prompt_number": 22
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"# Execute split-apply-combine operation in Pandas\n",
"from blaze.compute.pandas import compute\n",
"%time compute(big_spenders, df)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"CPU times: user 4.91 s, sys: 2.23 s, total: 7.14 s\n",
"Wall time: 8.75 s\n"
]
},
{
"html": [
"<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>sender</th>\n",
" <th>value</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>10 </th>\n",
" <td> 11</td>\n",
" <td> 52461821.941658</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1373 </th>\n",
" <td> 1374</td>\n",
" <td> 23394277.034152</td>\n",
" </tr>\n",
" <tr>\n",
" <th>24 </th>\n",
" <td> 25</td>\n",
" <td> 13178095.975724</td>\n",
" </tr>\n",
" <tr>\n",
" <th>28 </th>\n",
" <td> 29</td>\n",
" <td> 5330179.983047</td>\n",
" </tr>\n",
" <tr>\n",
" <th>12563</th>\n",
" <td> 12564</td>\n",
" <td> 3669712.399825</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>5 rows \u00d7 2 columns</p>\n",
"</div>"
],
"metadata": {},
"output_type": "pyout",
"prompt_number": 23,
"text": [
" sender value\n",
"10 11 52461821.941658\n",
"1373 1374 23394277.034152\n",
"24 25 13178095.975724\n",
"28 29 5330179.983047\n",
"12563 12564 3669712.399825\n",
"\n",
"[5 rows x 2 columns]"
]
}
],
"prompt_number": 23
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"For fair comparison lets run the Python streaming backend on the in-memory data"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"%time list(compute(big_spenders, data)) "
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"CPU times: user 45.1 s, sys: 1.52 s, total: 46.6 s\n",
"Wall time: 56.7 s\n"
]
},
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 24,
"text": [
"[(11L, 52461821.94165766),\n",
" (1374L, 23394277.034151807),\n",
" (25L, 13178095.975724494),\n",
" (29L, 5330179.983046564),\n",
" (12564L, 3669712.399824968)]"
]
}
],
"prompt_number": 24
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Pandas remains king for in-memory data analytics. (although our Python streaming backend hasn't yet been optimized)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Discussion\n",
"\n",
"How can this abstract approach be made useful? While neat it's not clear that Blaze offers anything on top of Pandas for in-memory analytics.\n",
"\n",
"My thoughts on the potential value of Blaze:\n",
"\n",
"1. The uniform data interface seems generally useful. I've\n",
" chatted with Hugo about using this behind Bokeh's server examples.\n",
" `blaze.data` is a decent plumber.\n",
"2. The dataframe-like syntax might help data scientists who need to use\n",
" systems like SQL or Spark, but who are more comfortable with DataFrame syntax\n",
"3. The symbolic expressions aid portability. Write one expression,\n",
" compute on many backends. This promotes development scalability.\n",
" You can test against a small dataset and the Python backend but execute against\n",
" HDFS and Spark\n",
"4. If popular, this common interface would help new backends rapildy gain a\n",
" trained userbase. I'll claim that writing hooks into Blaze is not-too-hard.\n",
"5. A clever team could do comparative computation, benchmarking different\n",
" backends against each other and then using the right tool for the job.\n",
"\n",
"Most of these seem interesting, few solve pressing concerns. What would make\n",
"Blaze more immediately useful for you?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### What can `blaze.data` do for you today?\n",
"\n",
"We don't yet recommend using `blaze.expr/compute`; these are still in\n",
"heavy flux. The `blaze.data` module is relatively stable however. It provides\n",
"the following:\n",
"\n",
"1. A uniform interface over disparate data\n",
"2. Datatype discovery that mostly works\n",
"3. Trivial migration between data formats\n",
"4. Parsing times that aren't terrible.\n",
"\n",
"It'll probably break on you, but we're pretty responsive these days."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Thanks to binstar-build (and Sean Ross-Ross) you can get a fully operational development Blaze build with \n",
"`conda install -c mwiebe -c mrocklin blaze`. It tracks our `reorg` branch."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Current state\n",
"\n",
"`blaze.data` - We support the following data formats:\n",
"\n",
"* CSV\n",
"* JSON\n",
"* HDF5\n",
"* SQL\n",
"\n",
"`blaze.expr` - We think about Table/DataFrame computations:\n",
"\n",
"* Mathematical operations\n",
"* Reductions\n",
"* Split-apply-combine\n",
"* Join\n",
"\n",
"`blaze.compute` - We're working on the following computational backends. They\n",
"are ordered in terms of maturity:\n",
"\n",
"* Streaming Python\n",
"* Pandas\n",
"* SQL\n",
"* PySpark\n",
"* DyND out-of-core\n"
]
}
],
"metadata": {}
}
]
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment