One of the simplest configuration approaches in python is to just use python files,
giving you the full power of python - the least hassle approach in a trusted environment.
However, importing config modules can be problematic in interactive environments.
For example, when using jupyter notebooks organised into sub-folders,
we want to access a common config file in the overall project root.
from sklearn.utils import check_X_y
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics.cluster.unsupervised import check_number_of_labels
from numba import jit
@jit(nogil=True, parallel=True)
def euclidean_distances_numba(X, Y=None, Y_norm_squared=None):
# disable checks
XX_ = (X * X).sum(axis=1)
# # we need a reference to the snippets package
# snippetsPackage = require(atom.packages.getLoadedPackage('autocomplete-snippets').path)
# # we need a reference to the original method we'll monkey patch
# __oldGetSnippets = snippetsPackage.getSnippets
# snippetsPackage.getSnippets = (editor) ->
# snippets =, editor)
# # we're only concerned by ruby files
import datetime
import shutil
import tempfile
import tarfile
from collections import namedtuple
from pathlib import Path
from enum import IntEnum
import numpy as np

This small subclass of the Pandas sqlalchemy-based SQL support for reading/storing tables uses the Postgres-specific "COPY FROM" method to insert large amounts of data to the database. It is much faster that using INSERT. To acheive this, the table is created in the normal way using sqlalchemy but no data is inserted. Instead the data is saved to a temporary CSV file (using Pandas' mature CSV support) then read back to Postgres using Psychopg2 support for COPY FROM STDIN.


Easy parallel python with concurrent.futures

As of version 3.3, python includes the very promising concurrent.futures module, with elegant context managers for running tasks concurrently. Thanks to the simple and consistent interface you can use both threads and processes with minimal effort.

For most CPU bound tasks - anything that is heavy number crunching - you want your program to use all the CPUs in your PC. The simplest way to get a CPU bound task to run in parallel is to use the ProcessPoolExecutor, which will create enough sub-processes to keep all your CPUs busy.

We use the context manager thusly:

with concurrent.futures.ProcessPoolExecutor() as executor:
Collection of query wrappers / abstractions to both facilitate data
retrieval and to reduce dependency on DB-specific API.
from pandas.core.api import DataFrame
def _safe_fetch(cur):
result = cur.fetchall()

A "virtualenv activate" for Anaconda environments

I've been using the Anaconda python package from recently and found it to be a good way to get all the complex compiled libs you need for a scientific python environment. Even better, their conda tool lets you create environments much like virtualenv, but without having to re-compile stuff like numpy, which gets old very very quickly with virtualenv and can be a nightmare to get correctly set up on OSX.

The only thing missing was an easy way to switch environments - their docs suggest running python executables from the install folder, which I find a bit of a pain. Coincidentally I came across this article - Virtualenv's bin/activate is Doing It Wrong - which desribes a simple way to launch a sub-shell with certain environment variables set. Now simple was the key word for me since my bash-fu isn't very strong, but I managed to come up with the script below. Put this in a text file called conda-work