When to use a specific tool
Brief list describing scenarios calling for specific tools.
If you have (larger-than-memory) petabytes of JSON/XML/CSV files, a simple workflow, and a thousand node cluster
If you have (larger-than-memory) 10s-1000s of gigabytes of binary or numeric data (e.g., HDF5, netCDF4, CSV.gz), complex algorithms, and a (single) large multi-core workstation
If you have (larger-than-memory or not) less than a terabyte of content and one writer at a time, if you need local (on-disk) data storage (permanent or temporary) for individual applications or device, if you need to query/analyze a large dataset of text files: CSV/XML (off-memory), if you want to stick to the standard library (is built-in in Python)
If you have lots of data coming in very quickly (from different locations), of etherogeneous types (schema-less), of many terabytes or petabytes in size, if you need multiple servers/distributed system (with potential expansion in future), if you need constant availability (fault-tolerant), and yet simple
If your code will be deploied by others, distributing a package with the optimized extentions, if you need to accelerate code that uses advanced Python features (e.g., list, dict, recursion, array allocation), if you need to directly call C, if your function operates on a pre-defined (fixed) number of dimensions
If you don’t need to distribute your code beyond your computer or your team (especially if you use Conda), if you need to accelerate code that uses scalars or (N-dimensional) arrays, if you want to write functions that work (automatically) on N-dimensional arrays