Skip to content

Instantly share code, notes, and snippets.

@rjurney
Last active February 4, 2023 00:58
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save rjurney/5c9bcabdc4d22487406b8214451cd0cf to your computer and use it in GitHub Desktop.
Save rjurney/5c9bcabdc4d22487406b8214451cd0cf to your computer and use it in GitHub Desktop.

Add a random ID column to a pandas DataFrame using Numpy

I needed to generate random IDs to partition some data for Dask when writing a Parquet file from pandas for a less expensive operation where multiple cores were not required. I didn't like any of the answers that I found, so I decided to hack this recipe myself to remind myself I can still work from API docs :)

I think for efficiency you want to do this via numpy.random.randint and then make a column out of it via a pandas.Series, since a Series is just a numpy.ndarray with some dressing added.

One-dimensional ndarray with axis labels (including time series).

import random
import numpy as np
import pandas as pd

# Setup reproducible randomness if you choose, otherwise importing `random` is optional
random.seed(31337)
np.random.seed(31337)

# ...generate a suitable pandas.DataFrame
df = pd.DataFrame({
  "name": ["Russell", "Meriam", "Josh", "Ruth", "Chris", "William"],
  "score": [3.0, 4.0, 5.0, 3.1, 3.8, 0.1]
})

# ...generate a random numpy.ndarray of integers to create a pandas.Series
df["random_id"] = np.random.randint(low=1, high=6, size=len(df))

The result [is reproducible because we set the random seeds]:

      name  score  random_id
0  Russell    3.0          2
1   Meriam    4.0          5
2     Josh    5.0          5
3     Ruth    3.1          3
4    Chris    3.8          1
5  William    0.1          5

Randomly Partitioning a Parquet File from Pandas

One reason to create random IDs is to partition a Parquet file so you can use something more... parallel like a Dask DataFrame to read it concurrently. You might write a file using pandas and do a much more expensive process using Dask across multiple cores or GPUs on one or more machines.

# Add a column of random IDs and partiton by it for 16 concurrent cores to read the file
df["random_id"] = pd.Series(np.random.randint(low=1, high=16, size=len(node_df.index)))

# And save a partitioned kind
df.to_parquet(
    "data/dblp.nodes.partitioned.parquet",
    engine="pyarrow",
    compression="snappy",
    partition_cols=["random_id"],
)

And… I’m spent! Have fun 😁

Now for our Sponsors...

...and here’s that in an image in case you need it 👍 and to hype the gist.

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment