I needed to generate random IDs to partition some data for Dask when writing a Parquet file from pandas for a less expensive operation where multiple cores were not required. I didn't like any of the answers that I found, so I decided to hack this recipe myself to remind myself I can still work from API docs :)
I think for efficiency you want to do this via numpy.random.randint
and then make a column out of it via a pandas.Series
, since a Series is just a numpy.ndarray
with some dressing added.
One-dimensional ndarray with axis labels (including time series).
import random
import numpy as np
import pandas as pd
# Setup reproducible randomness if you choose, otherwise importing `random` is optional
random.seed(31337)
np.random.seed(31337)
# ...generate a suitable pandas.DataFrame
df = pd.DataFrame({
"name": ["Russell", "Meriam", "Josh", "Ruth", "Chris", "William"],
"score": [3.0, 4.0, 5.0, 3.1, 3.8, 0.1]
})
# ...generate a random numpy.ndarray of integers to create a pandas.Series
df["random_id"] = np.random.randint(low=1, high=6, size=len(df))
The result [is reproducible because we set the random seeds]:
name score random_id
0 Russell 3.0 2
1 Meriam 4.0 5
2 Josh 5.0 5
3 Ruth 3.1 3
4 Chris 3.8 1
5 William 0.1 5
One reason to create random IDs is to partition a Parquet file so you can use something more... parallel like a Dask DataFrame to read it concurrently. You might write a file using pandas and do a much more expensive process using Dask across multiple cores or GPUs on one or more machines.
# Add a column of random IDs and partiton by it for 16 concurrent cores to read the file
df["random_id"] = pd.Series(np.random.randint(low=1, high=16, size=len(node_df.index)))
# And save a partitioned kind
df.to_parquet(
"data/dblp.nodes.partitioned.parquet",
engine="pyarrow",
compression="snappy",
partition_cols=["random_id"],
)
And… I’m spent! Have fun 😁
...and here’s that in an image in case you need it 👍 and to hype the gist.