Create a gist now

Instantly share code, notes, and snippets.

When to use a specific tool

When to use a specific tool

Brief list describing scenarios calling for specific tools.

If you have (larger-than-memory) petabytes of JSON/XML/CSV files, a simple workflow, and a thousand node cluster

If you have (larger-than-memory) 10s-1000s of gigabytes of binary or numeric data (e.g., HDF5, netCDF4, CSV.gz), complex algorithms, and a (single) large multi-core workstation

If you have (larger-than-memory or not) less than a terabyte of content and one writer at a time, if you need local (on-disk) data storage (permanent or temporary) for individual applications or device, if you need to query/analyze a large dataset of text files: CSV/XML (off-memory), if you want to stick to the standard library (is built-in in Python)

MongoDB, PostgreSQL
If you have (larger-than-memory) a terabyte or less of JSON/XML/CSV, if you have multiple writers at a time, if you need/want a cliet/server scheme

If you have lots of data coming in very quickly (from different locations), of etherogeneous types (schema-less), of many terabytes or petabytes in size, if you need multiple servers/distributed system (with potential expansion in future), if you need constant availability (fault-tolerant), and yet simple

If your code will be deploied by others, distributing a package with the optimized extentions, if you need to accelerate code that uses advanced Python features (e.g., list, dict, recursion, array allocation), if you need to directly call C, if your function operates on a pre-defined (fixed) number of dimensions

If you don’t need to distribute your code beyond your computer or your team (especially if you use Conda), if you need to accelerate code that uses scalars or (N-dimensional) arrays, if you want to write functions that work (automatically) on N-dimensional arrays

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment