Skip to content

Instantly share code, notes, and snippets.

@rpgoldman
Last active January 7, 2021 16:36
Show Gist options
  • Save rpgoldman/6f03dbeafa71266d7f58de45cbe63eb7 to your computer and use it in GitHub Desktop.
Save rpgoldman/6f03dbeafa71266d7f58de45cbe63eb7 to your computer and use it in GitHub Desktop.
Problems computing a new column value from an existing dask dataframe column
# Here's what I do to get my background data (sorry, not public)
df = dd.read_csv('r1c5va879uaex_r1c639xp952g4.csv', assume_missing=True)
# Now, in order to add a column, I need to be able to add metadata --
# if I don't, I get mysterious errors about failing to infer types
newmeta = df._meta.copy() # get the original metadata
# add new column to metadata
newmeta.insert(len(newmeta.columns), 'well', 'foo')
# specify the dtype of the new column
newmeta = newmeta.astype({'well': str})
# the function we use to compute the new column values --
# note that efm suggests we use a field splitter, instead of
# regular expression matching, which is less efficient
def find_well(x):
assert isinstance(x, str), f"Trying to find well in non-string value {x}"
match = re.match(id_re, x)
if match is None:
raise ValueError(f"Couldn't find well ID in {x}")
return match.group(1)
find_wellv = np.vectorize(find_well)
# now actually compute and add the new column
df4 = df.map_partitions(lambda df: df.assign(well=find_wellv(df['id'])), meta=newmeta)
Display the source blob
Display the rendered blob
Raw
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment