Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save GenevieveBuckley/f9f8219de5c052c3deb234cc44ebc0a2 to your computer and use it in GitHub Desktop.
Save GenevieveBuckley/f9f8219de5c052c3deb234cc44ebc0a2 to your computer and use it in GitHub Desktop.
Dask task graph handling costs on the client
Display the source blob
Display the rendered blob
Raw
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
# Example from "Doing Nothing Poorly: Accelerating the Dask Scheduler" workshop
# Dask Summit 2021
# Task graph handling costs on the client
import pickle
import dask
from dask.datasets import timeseries
# Create dask task graph
%time ddf = timeseries().shuffle("id", shuffle="tasks").head(compute=False)
# Wall time: 4.01 s
# Optimize
%time ddf_opt, = dask.optimize(ddf)
# Wall time: 1.3 s
# Serialize
byte_total = 0
for k, v in ddf_opt.__dask_graph__().items():
byte_total += len(pickle.dumps(k)) + len(pickle.dumps(v))
# Wall time: 731 ms
# Send to the scheduler
dask.utils.format_bytes(byte_total)
# '15.88 MB' (Assume ~587 ms at 100MB/s)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment