# Blaze, Bitcoin, and Bodacious Backends
**tl;dr: We've redesigned Blaze and need feedback.
We show off the three components of blaze with with simple examples from Bitcoin.
We ask how this might have value.**
*This blogpost is not for public consumption. If you've found it then great!
Please read and give feedback, but please don't tweet or repost.*
Over the last two months the Blaze team has worked hard to redesign Blaze.
During this time I've had some really useful conversations with people within Continuum
but outside
the core development team (namely Ben Zaitlen, Ely Spears, and Hugo Shi) and their
experience has helped us to reprioritize and rethink our approach. Blaze is at a point
where I'd like to give you the demo I've given them. Hopefully this generates useful
feedback to direct the future of Blaze.
The question I have for you is the following:
*I think that what we have is cool and interesting, how can we also make it
more immediately useful to a broad audience?*
### Bitcoin - our example dataset
We'll demo blaze with a simple dataset about transactions in Bitcoin, our favorite crypto-currency. Here is the top of a 1.6 GB csv file holding Bitcoin transactions. The columns are as follows:
* Transaction ID
* Sender ID
* Recipient ID
* Datetime of transaction, e.g. `20130410142250` -> 2013-04-10 2:22:50 pm
* Number of bitcoins sent
$ head user_edges.txt
The astute reader might ask questions like the following:
* "Wait, isn't bitcoin anonymous? how are there User IDs?"
* "Why are there two transaction IDs for each transaction?"
Those have interesting answers which I'll discuss with you if you ask.
For now, on to Blaze!
### Blaze, a story of three parts
Reorganized Blaze is split into three core parts. We'll talk about each one
1. `` - Provides a uniform indexable view into disparate
data sources. This piece is the most mature and is ready for internal use.
2. `blaze.expr` - A symbolic expression of DataFrame-like computations. (Think SymPy or Theano for Pandas/SQL)
3. `blaze.compute` - An interpreter to various of computational backends.
When we're ready we'll build a fourth piece
* `blaze.interface` - Usable interfaces for data scientists. This will
include an interactive DataFrame, but could include other interfaces like
an SQL parser or Datalog engine.
### `` - uniform access to disparate data
Blaze operates on common storage formats including CSV, JSON,
HDF5, and SQL. For each format we offer the following functionality
* Insert/pull off data in Python format
* Insert/pull off data in Binary/DyND/NumPy format
* Fancy indexing
The interface is the same even when the backend is different. Lets see this in action...
### Example - Basic Parsing and Type Coercion
We open up a `csv` file, tweak column names and types, and show of basic indexing
# Discussion
How can this abstract approach be made useful? While neat it's not clear that Blaze offers anything on top of Pandas for in-memory analytics.
My thoughts on the potential value of Blaze:
1. The uniform data interface seems generally useful. I've
chatted with Hugo about using this behind Bokeh's server examples.
`` is a decent plumber.
2. The dataframe-like syntax might help data scientists who need to use
systems like SQL or Spark, but who are more comfortable with DataFrame syntax
3. The symbolic expressions aid portability. Write one expression,
compute on many backends. This promotes development scalability.
You can test against a small dataset and the Python backend but execute against
HDFS and Spark
4. If popular, this common interface would help new backends rapildy gain a
trained userbase. I'll claim that writing hooks into Blaze is not-too-hard.
5. A clever team could do comparative computation, benchmarking different
backends against each other and then using the right tool for the job.
Most of these seem interesting, few solve pressing concerns. What would make
Blaze more immediately useful for you?
### What can `` do for you today?
We don't yet recommend using `blaze.expr/compute`; these are still in
heavy flux. The `` module is relatively stable however. It provides
the following:
1. A uniform interface over disparate data
2. Datatype discovery that mostly works
3. Trivial migration between data formats
4. Parsing times that aren't terrible.
It'll probably break on you, but we're pretty responsive these days.
Thanks to binstar-build (and Sean Ross-Ross) you can get a fully operational development Blaze build with
`conda install -c mwiebe -c mrocklin blaze`. It tracks our `reorg` branch.
### Current state
`` - We support the following data formats:
* CSV
* JSON
* HDF5
* SQL
`blaze.expr` - We think about Table/DataFrame computations:
* Mathematical operations
* Reductions
* Split-apply-combine
* Join
`blaze.compute` - We're working on the following computational backends. They
are ordered in terms of maturity:
* Streaming Python
* Pandas
* SQL
* PySpark
* DyND out-of-core
