Skip to content

Instantly share code, notes, and snippets.

@pantaray
Last active October 4, 2022 09:26
Show Gist options
  • Save pantaray/6c3f17ef005d2c7b33f17bc6d6841508 to your computer and use it in GitHub Desktop.
Save pantaray/6c3f17ef005d2c7b33f17bc6d6841508 to your computer and use it in GitHub Desktop.
Proposed feature addition for ACME
# Consider the following function
def arr_func(x, y, z=3):
return x * (y + z)
# We want to evaluate `arr_func` for three different values of `x` and the same `y`:
xList = [np.array([1, 2, 3]), np.array([-1, -2, -3]), np.array([-1, 2, -3])]
y = np.array([4, 5, 6])
# Firing off `ParallelMap` and collecting results in memory yields
with ParallelMap(arr_func, xList, y, z=0.5, write_worker_results=False) as pmap:
res = pmap.compute()
>>> res
[array([ 7, 14, 21]), array([ -8, -16, -24]), array([ -9, 18, -27])]
# Now, with `write_worker_results = True` (the default) ACME currently
# creates `n_inputs` HDF5 containers:
with ParallelMap(arr_func, xList, y, z=0.5) as pmap:
res = pmap.compute()
>>> res
['/cs/home/username/ACME_20220906-1135-448825/arr_func_0.h5',
'/cs/home/username/ACME_20220906-1135-448825/arr_func_1.h5',
'/cs/home/username/ACME_20220906-1135-448825/arr_func_2.h5']
# How about, ACME as a new default (additionally) generates a single HDF5
# container that collects all worker-results:
>>> res
'/cs/home/username/ACME_20220906-1135-448825/arr_func.h5'
# Each worker-result can be accessed using a by-worker HDF group:
>>> h5f = h5py.File('/cs/home/username/ACME_20220906-1135-448825/arr_func.h5', 'r')
>>> h5f.keys()
<KeysViewHDF5 ['worker_0', 'worker_1', 'worker_2']>
>>> h5f['worker_0'].keys()
<KeysViewHDF5 ['result_0']>
>>> h5f['worker_1'].keys()
<KeysViewHDF5 ['result_0']>
>>> h5f['worker_2'].keys()
<KeysViewHDF5 ['result_0']>
# The actual results can then be accessed directly from the single container
>>> h5f['worker_0']['result_0'][()]
array([ 7, 14, 21])
>>> h5f['worker_1']['result_0'][()]
array([ -8, -16, -24])
>>> h5f['worker_2']['result_0'][()]
array([ -9, 18, -27])
# In addition, ACME could (optionally) support a `outShape` keyword to not
# store results in different HDF groups/datasets but collect everything in one
# array-like structure. By specifying `outShape = (None, 3)` you could tell
# ACME to stack all resulting 3-element arrays along the first dimension:
with ParallelMap(arr_func, xList, y, z=0.5, outShape=(None, 3)) as pmap:
res = pmap.compute()
>>> h5f = h5py.File('/cs/home/username/ACME_20220906-1135-448825/arr_func.h5', 'r')
>>> h5f['result_0'][()]
array([[ 7, 14, 21],
[ -8, -16, -24],
[ -9, 18, -27]])
@timnaher
Copy link

timnaher commented Sep 6, 2022

Haha, yes I agree it's all so confusing... How about calling them computations instead? comp_0, comp_1, etc

Line 32 does not load anything into memory - it just opens a file handle to the container. However, in the example provided above, the entire dataset is loaded once the [()] operator is used (Line 57, for instance). However, an arbitrarily large HDF5 dataset can still be indexed without overflowing local memory using usual array slicing (e.g., h5f['result_0'][:, 0]) only reads the first column from disk).

Great! Then I think it makes a lot of sense to throw them all in 1 container! I still like the single_file = True option though, just to be flexible.

I'm excited about the new features, sounds all great!

@pantaray
Copy link
Author

pantaray commented Sep 7, 2022

Great! Thanks for the quick feedback!

Haha, yes I agree it's all so confusing... How about calling them computations instead? comp_0, comp_1, etc

Sold! comp_* it is :)

@KatharineShapcott
Copy link

Sorry I'm a bit late to replay here, I really like the idea! comp_* seems like a good naming convention too. Just a small worry, what will happen if not all the computations complete? Will the whole file be corrupted or only the parts that didn't return?

@pantaray
Copy link
Author

pantaray commented Oct 4, 2022

Hi Katharine! Thanks for taking a look at this and the feedback!

In the prototype implementation (results_collection branch in the repo) file crash handling is more or less unchanged: if some computations crash, the "symlinks" in the results file pointing to the incomplete calculations just grasp at nothing. If your "emergency pickling" mechanism kicks in, the results file is removed (since HDF5 files cannot contain links to pickled data) and it's basically back to the current state (ACME generates a bunch of .pickle and .h5 files).
Only the case single_file=True (i.e., all workers really write to the same file, i.e., the results container really is just one regular file not a collection of symlinks) might run into trouble. If a worker dies while writing to the file, it might end up being corrupted and unreadable - I've included a warning message for this case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment