Skip to content

Instantly share code, notes, and snippets.

@szeitlin
Last active January 6, 2016 21:15
Show Gist options
  • Save szeitlin/677ce816ae7e92c40209 to your computer and use it in GitHub Desktop.
Save szeitlin/677ce816ae7e92c40209 to your computer and use it in GitHub Desktop.
notes from Wes McKinney's talk at LinkedIn, October 22, 2015

A few notes about Wes:

MIT 2007 Math SQL 2007-2010 created pandas in 2008 (dropped out of a stats PhD program) :)

Problem: Python Scalability

"The larger the data, the less sophisticated the analysis"

Numpy is ~ 10 years old

Now everyone needs to be able to use JSON for passing data around

pandas is mostly for flat data

pandas is sort of its own DSL

--> add a figure here to show the decoupling (he said he would share his slides)

Hadoop systems can now handle nested/complex data structures (like JSON)

ultimately we want better ways to analyze JSON without having to worry about flattening

hard part is to infer the schema

Ibis:

Open source, available with Apache license

"python on hadoop"

idea is to remove SQL coding from workflows (!) and make it feel more like pandas

The Ibis DSL will provide pandas-like composition of SELECT statements

can do exists and sub-queries

  • currently supports Impala and SQLite
  • he's looking for contributors to add other engines, e.g. Postgres, Redshift, Vertica are all high-priority (common SQL dialect)
  • a lot of the work is just mapping function names, just because all the dialects are different (+ test coverage)

http://docs.ibis-project.org/sql.html

the Ibis blog includes both development info and example use cases

ex. Ibis datatable.execute() --> pandas dataframe

expr.__repr() gives you the SQL it ran

nunique() is equivalent to COUNT DISTINCT

mutate() lets you add columns to a table

Peformance concerns:

  • serialization overhead
  • scalar vs. vectorized
  • remote procedure calls (RPCs)

optimized cpython > 150x faster than pure python on an array size of ~2000 numpy alone is >50x faster

  • there currently no standard data format for in-memory materialized data files/RPC
  • He's working on a project to create a standard in-memory columnar format, which needs a name (IMC?) before it's released idea is that it's a C/C++ implementation for use in Python/R/Julia
  • Want to be able to share data without serialization
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment