rjurney/random_np_id.md

## random_np_id.md

      
    Raw
  

              random_np_id.md
            
          
    Add a random ID column to a pandas DataFrame using Numpy

I needed to generate random IDs to partition some data for Dask when writing a Parquet file from pandas for a less expensive operation where multiple cores were not required. I didn't like any of the answers that I found, so I decided to hack this recipe myself to remind myself I can still work from API docs :)
I think for efficiency you want to do this via numpy.random.randint and then make a column out of it via a pandas.Series, since a Series is just a numpy.ndarray with some dressing added.

One-dimensional ndarray with axis labels (including time series).

import random
import numpy as np
import pandas as pd

# Setup reproducible randomness if you choose, otherwise importing `random` is optional
random.seed(31337)
np.random.seed(31337)

# ...generate a suitable pandas.DataFrame
df = pd.DataFrame({
  "name": ["Russell", "Meriam", "Josh", "Ruth", "Chris", "William"],
  "score": [3.0, 4.0, 5.0, 3.1, 3.8, 0.1]
})

# ...generate a random numpy.ndarray of integers to create a pandas.Series
df["random_id"] = np.random.randint(low=1, high=6, size=len(df))
The result [is reproducible because we set the random seeds]:
      name  score  random_id
0  Russell    3.0          2
1   Meriam    4.0          5
2     Josh    5.0          5
3     Ruth    3.1          3
4    Chris    3.8          1
5  William    0.1          5
Randomly Partitioning a Parquet File from Pandas

One reason to create random IDs is to partition a Parquet file so you can use something more... parallel like a Dask DataFrame to read it concurrently. You might write a file using pandas and do a much more expensive process using Dask across multiple cores or GPUs on one or more machines.
# Add a column of random IDs and partiton by it for 16 concurrent cores to read the file
df["random_id"] = pd.Series(np.random.randint(low=1, high=16, size=len(node_df.index)))

# And save a partitioned kind
df.to_parquet(
    "data/dblp.nodes.partitioned.parquet",
    engine="pyarrow",
    compression="snappy",
    partition_cols=["random_id"],
)
And… I’m spent! Have fun 😁
Now for our Sponsors...

...and here’s that in an image in case you need it 👍 and to hype the gist.