szeitlin/Ibis_talk.md

## Ibis_talk.md

      
    Raw
  

              Ibis_talk.md
            
          
    A few notes about Wes:
MIT 2007 Math
SQL 2007-2010
created pandas in 2008
(dropped out of a stats PhD program) :)
Problem: Python Scalability
"The larger the data, the less sophisticated the analysis"
Numpy is ~ 10 years old
Now everyone needs to be able to use JSON for passing data around
pandas is mostly for flat data
pandas is sort of its own DSL
--> add a figure here to show the decoupling (he said he would share his slides)
Hadoop systems can now handle nested/complex data structures (like JSON)
ultimately we want better ways to analyze JSON without having to worry about flattening
hard part is to infer the schema
Ibis:
Open source, available with Apache license
"python on hadoop"
idea is to remove SQL coding from workflows (!) and make it feel more like pandas
The Ibis DSL will provide pandas-like composition of SELECT statements
can do exists and sub-queries

currently supports Impala and SQLite
he's looking for contributors to add other engines, e.g. Postgres, Redshift, Vertica are all high-priority (common SQL dialect)
a lot of the work is just mapping function names, just because all the dialects are different (+ test coverage)

http://docs.ibis-project.org/sql.html
the Ibis blog includes both development info and example use cases
ex. Ibis datatable.execute() --> pandas dataframe
expr.__repr() gives you the SQL it ran
nunique() is equivalent to COUNT DISTINCT
mutate() lets you add columns to a table
Peformance concerns:

serialization overhead
scalar vs. vectorized
remote procedure calls (RPCs)

optimized cpython > 150x faster than pure python on an array size of ~2000
numpy alone is >50x faster

there currently no standard data format for in-memory materialized data files/RPC
He's working on a project to create a standard in-memory columnar format, which needs a name (IMC?) before it's released
idea is that it's a C/C++ implementation for use in Python/R/Julia
Want to be able to share data without serialization