Skip to content

Instantly share code, notes, and snippets.

Created December 8, 2016 23:32
Show Gist options
  • Save anonymous/72ab05a288a651f707a9f830a1e68023 to your computer and use it in GitHub Desktop.
Save anonymous/72ab05a288a651f707a9f830a1e68023 to your computer and use it in GitHub Desktop.

The Setup:

I have MRI data from a group of patients with brain damage (due to stroke, head injury, etc). For each patient, I have a binary MRI image (3D array) for each patient, where a value of 0 or 1 indicates the presence or absence of damage at a particular voxel (3D pixel). Each image is ~870k voxels.

The Goal:

Return a list of patients with damage at a given voxel.

Here's the stupid/inelegant way to do this:

voxel_id = 12345 # Index of the voxel I'm interested in
patient_matches = []
for patient in patient_list:
  patient_img_file = '/path/to/images/{}.MRI'.format(patient) # Get the path to the image for this patient.
  img = load_image(patient_img_file) # Load the image as a NumPy ndarray.
  img_flat = img.ravel() #Flatten to a 1D
  vox_val = img[voxel_id]
  if vox_val == 1:
    patient_matches.append(patient)

The Original Plan

  1. Load and flatten each image, so I have a 1x870k vector for each patient.
  2. Read these vectors into a DB (table='damage'), with 1 row per patient and 1 column per voxel.
  3. Query select patient_id from damage where vox12345 = 1
@wiredfool
Copy link

So, in bit packing, you're going to have 870k/8 or ~100KB/patient? How many patients do you have? (10's hundreds thousands?)

I think this is tractable just treating it as binary data, either mmaping or not. 1000 patients is under 1GB, so pop it in memory, do some lookups, and you're there. If you're doing more than a handful of searches, I'd preprocess your flat images, the read them in and run a bunch of searches.

You could even throw them into a numpy array, shape = (feature, patients) of bool, and just use index projection to get the vector of matches. That's going to scale well till you run out of memory.

@wiredfool
Copy link

A quick test on a vm on a slow laptop:

import numpy as np

# dense bitpacked storage. 
# note row-major ordering, so extracting one feature is just a range, not 100 random accesses.
features = np.random.randint(0,2,(870000,100), np.bool)

def patients(features, foi):
    return [i for i,v in enumerate(features[foi,:]) if v]

>>> patients(12345)
[0, 7, 11, 12, 13, 15, 17, 19, 21, 22, 26, 28, 31, 33, 35, 36, 38, 42, 44, 46, 47, 48, 49, 50, 53, 54, 55, 58, 60, 61, 62, 63, 70, 73, 75, 78, 79, 82, 83, 84, 85, 86, 87, 89, 90, 92, 95, 96, 98]

I'm seeing 100MB process memory usage and 1000 accesses taking 45ms.

Rough guess as to postgres database size would be ( 100 * 870k * 5% (damage)) * (2 * 4bytes/int + header) * 2 (index) or roughly the same.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment