A few notes about Wes:
MIT 2007 Math SQL 2007-2010 created pandas in 2008 (dropped out of a stats PhD program) :)
Problem: Python Scalability
"The larger the data, the less sophisticated the analysis"
Numpy is ~ 10 years old
Now everyone needs to be able to use JSON for passing data around
pandas is mostly for flat data
pandas is sort of its own DSL
--> add a figure here to show the decoupling (he said he would share his slides)
Hadoop systems can now handle nested/complex data structures (like JSON)
ultimately we want better ways to analyze JSON without having to worry about flattening
hard part is to infer the schema
Ibis:
Open source, available with Apache license
"python on hadoop"
idea is to remove SQL coding from workflows (!) and make it feel more like pandas
The Ibis DSL will provide pandas-like composition of SELECT statements
can do exists and sub-queries
- currently supports Impala and SQLite
- he's looking for contributors to add other engines, e.g. Postgres, Redshift, Vertica are all high-priority (common SQL dialect)
- a lot of the work is just mapping function names, just because all the dialects are different (+ test coverage)
http://docs.ibis-project.org/sql.html
the Ibis blog includes both development info and example use cases
ex. Ibis datatable.execute() --> pandas dataframe
expr.__repr() gives you the SQL it ran
nunique() is equivalent to COUNT DISTINCT
mutate() lets you add columns to a table
Peformance concerns:
- serialization overhead
- scalar vs. vectorized
- remote procedure calls (RPCs)
optimized cpython > 150x faster than pure python on an array size of ~2000 numpy alone is >50x faster
- there currently no standard data format for in-memory materialized data files/RPC
- He's working on a project to create a standard in-memory columnar format, which needs a name (IMC?) before it's released idea is that it's a C/C++ implementation for use in Python/R/Julia
- Want to be able to share data without serialization