Skip to content

Instantly share code, notes, and snippets.

@ezyang
Created April 6, 2022 16:53
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save ezyang/e1d8697968accdf304a13cceaa8c8513 to your computer and use it in GitHub Desktop.
Save ezyang/e1d8697968accdf304a13cceaa8c8513 to your computer and use it in GitHub Desktop.

How to record data from Python fast, if pickle is too slow

  • JSON xxxxxx
    • jq is pretty fast
    • Serde for rust level
    • nb: use json lines
  • CSV xxxx
  • use pickle anyway xxx
    • plus compression
    • python-pickle in Haskell
    • stream your processing so you don’t load it all
  • SQLite xxx
    • use batched inserts, wal, synchronous=normal, temp=memory and a large mmap
  • hdf5 xxx
    • but it’s tabular
  • arrow xxx
  • parquet xx
    • but it’s tabular
  • sstables xx
    • for minimizing disk size
  • duckdb x
  • pandas x
  • vaex x
  • zarr x
  • numpy x

Side note: consider using compression!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment