Skip to content

Instantly share code, notes, and snippets.

@elephantum
Last active January 26, 2021 07:56
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save elephantum/ed9975ac4e07b8bd59135e0bc8dc8ce8 to your computer and use it in GitHub Desktop.
Save elephantum/ed9975ac4e07b8bd59135e0bc8dc8ce8 to your computer and use it in GitHub Desktop.
Dask data loss with `.set_index`
#VERSION=2020.12.0
VERSION=2021.1.1
version: '3'
services:
scheduler:
image: daskdev/dask:${VERSION}
command: dask-scheduler
volumes:
- ./test.py:/srv/test.py
wrk:
image: daskdev/dask:${VERSION}
command: dask-worker tcp://scheduler:8786
deploy:
replicas: 2
docker-compose up -d
docker-compose exec scheduler python /srv/test.py
(base) ➜ docker-compose exec scheduler python /srv/test.py
10000
10000
5015
import uuid
import pandas as pd
import dask.dataframe as dd
from dask.distributed import Client
with Client('scheduler:8786') as client:
test_ddf = dd.from_pandas(pd.DataFrame({
'uuid': [str(uuid.uuid4()) for i in range(10000)],
}), chunksize=100)
print(len(test_ddf))
print(len(test_ddf.set_index('uuid')))
print(len(test_ddf.set_index('uuid', shuffle='disk')))
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment