Skip to content

Instantly share code, notes, and snippets.

@mrocklin
Created January 1, 2017 18:06
Show Gist options
  • Save mrocklin/738d5b9fd729d13098e1a3b9a02c7e82 to your computer and use it in GitHub Desktop.
Save mrocklin/738d5b9fd729d13098e1a3b9a02c7e82 to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Neuro Imaging Data\n",
"\n",
"Email from last night from colleague at the NIH:\n",
"\n",
"Electron microscopy is probably generating the biggest ndarray datasets in the field - terabytes regularly. Neuroscience need EM to see connections between neurons because the critical features of neural synapses (connections) are below the diffraction limit of light microscopes. The hard part is machine vision on the data to follow small neuron parts from one slice to the next.\n",
"This type of research has been called \"connectomics\". \n",
"\n",
"This data is from drosophila:\n",
"http://emdata.janelia.org/\n",
"\n",
"Here is an example 2d slice of the data. If you take the below URL and change the last number you can get about 5000 different images.\n",
"http://emdata.janelia.org/api/node/bf1/grayscale/raw/xy/2000_2000/1800_2300_5000\n",
"\n",
"The data is a bit sparse (many black pixels) but it's around a 25GB dataset if you pull down all the slices. And it's a 3d ndarray.*\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Quick inpsection with Scikit Image and Requests\n",
"\n",
"We see that some images are strictly black, others show an oval of activity."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"import skimage.io\n",
"import requests\n",
"%matplotlib inline\n",
"\n",
"import matplotlib.pyplot as plt"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"sample = a = skimage.io.imread('http://emdata.janelia.org/api/node/bf1/grayscale/raw/xy/2000_2000/1800_2300_%d' % 5000)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"a = skimage.io.imread('http://emdata.janelia.org/api/node/bf1/grayscale/raw/xy/2000_2000/1800_2300_%d' % 0)\n",
"b = skimage.io.imread('http://emdata.janelia.org/api/node/bf1/grayscale/raw/xy/2000_2000/1800_2300_%d' % 2000)\n",
"c = skimage.io.imread('http://emdata.janelia.org/api/node/bf1/grayscale/raw/xy/2000_2000/1800_2300_%d' % 7000)\n",
"d = skimage.io.imread('http://emdata.janelia.org/api/node/bf1/grayscale/raw/xy/2000_2000/1800_2300_%d' % 10000)\n",
"\n",
"fig, axarr = plt.subplots(2, 2, sharex=True, sharey=True, figsize=(8, 8))\n",
"axarr[0, 0].imshow(a, cmap='gray')\n",
"axarr[0, 1].imshow(b, cmap='gray')\n",
"axarr[1, 0].imshow(c, cmap='gray')\n",
"axarr[1, 1].imshow(d, cmap='gray')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Use Dask.array to inspect all images in parallel"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"import dask.array as da\n",
"from dask import delayed\n",
"\n",
"imread = delayed(skimage.io.imread, pure=True)\n",
"\n",
"urls = ['http://emdata.janelia.org/api/node/bf1/grayscale/raw/xy/2000_2000/1800_2300_%d' % i\n",
" for i in range(10000)]\n",
"\n",
"values = [imread(url) for url in urls] # lazy evaluation\n",
"arrays = [da.from_delayed(value, \n",
" dtype=sample.dtype, \n",
" shape=sample.shape)\n",
" for value in values]\n",
"\n",
"stack = da.stack(arrays, axis=0)\n",
"stack"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Collect images together for greater efficiency\n",
"\n",
"Having many small chunks can be inefficient, especially when doing aggregations like mean."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"stack = stack.rechunk((20, 2000, 2000))\n",
"stack"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Connect to Cluster"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"from dask.distributed import Client, progress\n",
"client = Client('localhost:8786')\n",
"client"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Persist dataset in Distributed RAM for fast access\n",
"\n",
"We only take a subset of the dataset (bandwidth out of the data source is low)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"stack = client.persist(stack)\n",
"progress(stack)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"x = client.compute(stack.mean(axis=0))\n",
"progress(x)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false,
"scrolled": true
},
"outputs": [],
"source": [
"plt.figure(figsize=(8, 8))\n",
"skimage.io.imshow(x.result())"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"plt.figure(figsize=(10, 4))\n",
"plt.plot(stack.mean(axis=[1, 2]))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Recenter Images with Numba\n",
"\n",
"Visually the dominant trait is the movement of the data across the field of the microscope. Lets recenter the data.\n",
"\n",
"For fun, we'll use Numba to compute the image centroid and then slice away a 1000x1000 chunk."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"from numba import jit\n",
"\n",
"@jit(nogil=True)\n",
"def centroid(im):\n",
" n, m = im.shape\n",
" total_x = 0\n",
" total_y = 0\n",
" total = 0\n",
" for i in range(n):\n",
" for j in range(m):\n",
" total += im[i, j]\n",
" total_x += i * im[i, j]\n",
" total_y += j * im[i, j]\n",
" \n",
" if total > 0:\n",
" total_x /= total\n",
" total_y /= total\n",
" return total_x, total_y\n",
" \n",
"centroid(sample)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"%time centroid(sample)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"def recenter(im):\n",
" x, y = centroid(im.squeeze())\n",
" x, y = int(x), int(y)\n",
" if x < 500:\n",
" x = 500\n",
" if y < 500:\n",
" y = 500\n",
" if x > 1500:\n",
" x = 1500\n",
" if y > 1500:\n",
" y = 1500\n",
" \n",
" return im[..., x-500:x+500, y-500:y+500]\n",
"\n",
"plt.figure(figsize=(8, 8))\n",
"skimage.io.imshow(recenter(sample))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"import numpy as np\n",
"def recenter_block(block):\n",
" \"\"\" Recenter a short stack of images \"\"\"\n",
" return np.stack([recenter(block[i]) for i in range(block.shape[0])])"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"recentered = stack.map_blocks(recenter_block, chunks=(20, 1000, 1000), \n",
" dtype=a.dtype)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"recentered = client.persist(recentered)\n",
"progress(recentered)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"x = client.compute(recentered.mean(axis=0))\n",
"progress(x)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"x = recentered.mean(axis=0)\n",
"x = client.persist(x)\n",
"progress(x)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"plt.figure(figsize=(8, 8))\n",
"skimage.io.imshow(x.compute())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Rechunking and Time Series\n",
"\n",
"We rechunk our data by pixel rather than by time. Then we compute a power spectrum for each pixel. This generates a large computation of around 100 000 tasks that is fun to observe. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"recentered"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"rechunked = recentered.rechunk((10000, 10, 10))\n",
"x = da.fft.fft(rechunked, axis=0)\n",
"power = abs(x ** 2).astype('float32')\n",
"\n",
"power = client.persist(power, optimize_graph=False)\n",
"progress(power)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"power"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Power spectrum of dark corner is zero"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false,
"scrolled": false
},
"outputs": [],
"source": [
"plt.figure(figsize=(12, 4))\n",
"plt.semilogy(1 + power[:, 0, 0].compute())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Power spectrum of center has lots of activity"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"plt.figure(figsize=(12, 4))\n",
"plt.semilogy(1 + power[:, 500, 500].compute())"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.5.2"
},
"widgets": {
"state": {
"41b128929cec400ba47c8a7b8b6e332a": {
"views": [
{
"cell_index": 28
}
]
},
"4c0041c7059342d889582acbb813e219": {
"views": [
{
"cell_index": 36
}
]
},
"6cff709f7e3a408caa99d4244b487e01": {
"views": [
{
"cell_index": 11
}
]
},
"902bf137cc504cd584ad2ecbe3210ea1": {
"views": [
{
"cell_index": 26
}
]
},
"b2a7fda206194303a42c7099b0ee7e32": {
"views": [
{
"cell_index": 16
}
]
},
"bf63b9ed302d4647917dc6ea1414bf50": {
"views": [
{
"cell_index": 32
}
]
}
},
"version": "1.2.0"
}
},
"nbformat": 4,
"nbformat_minor": 1
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment