Skip to content

Instantly share code, notes, and snippets.

@austinbrian
Created February 16, 2021 15:14
Show Gist options
  • Save austinbrian/4f7176ecfbcc08895af5c8021e39fdb8 to your computer and use it in GitHub Desktop.
Save austinbrian/4f7176ecfbcc08895af5c8021e39fdb8 to your computer and use it in GitHub Desktop.
If you have a set of rows you would like to repeat x number of times, here is a demonstration of how to use np.repeat to achieve it.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Duplicate DataFrame Rows"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If you have a set of rows you would like to repeat `x` number of times, here is a demonstration of how to use `np.repeat` to achieve it. The motivating use case for this is a dataset that has multiple values (e.g., overlapping jurisdictions) inside the same cell, and for analysis purposes it would be better to duplicate those values as identitical entities, but with the multiple values split out over those lines.\n",
"\n",
"This performs significantly faster than a for-loop over the entire index."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"import numpy as np"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Ex. 1"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Start by creating a dataframe and assigning a column that indicates the number of times you will want a row created. A more complex answer later will demonstrate how to generate the multiplier."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>a</th>\n",
" <th>b</th>\n",
" <th>c</th>\n",
" <th>d</th>\n",
" <th>mult_times</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>0.798289</td>\n",
" <td>0.649386</td>\n",
" <td>0.222324</td>\n",
" <td>0.466508</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>0.763452</td>\n",
" <td>0.959604</td>\n",
" <td>0.572388</td>\n",
" <td>0.774142</td>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>0.087007</td>\n",
" <td>0.172473</td>\n",
" <td>0.100578</td>\n",
" <td>0.685924</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>0.802214</td>\n",
" <td>0.924537</td>\n",
" <td>0.463268</td>\n",
" <td>0.564780</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>0.776287</td>\n",
" <td>0.160867</td>\n",
" <td>0.502297</td>\n",
" <td>0.325034</td>\n",
" <td>4</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>0.779431</td>\n",
" <td>0.718153</td>\n",
" <td>0.783243</td>\n",
" <td>0.786609</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>0.195464</td>\n",
" <td>0.263158</td>\n",
" <td>0.856902</td>\n",
" <td>0.444856</td>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>0.182179</td>\n",
" <td>0.671997</td>\n",
" <td>0.158246</td>\n",
" <td>0.827743</td>\n",
" <td>0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" a b c d mult_times\n",
"0 0.798289 0.649386 0.222324 0.466508 1\n",
"1 0.763452 0.959604 0.572388 0.774142 2\n",
"2 0.087007 0.172473 0.100578 0.685924 1\n",
"3 0.802214 0.924537 0.463268 0.564780 1\n",
"4 0.776287 0.160867 0.502297 0.325034 4\n",
"5 0.779431 0.718153 0.783243 0.786609 1\n",
"6 0.195464 0.263158 0.856902 0.444856 2\n",
"7 0.182179 0.671997 0.158246 0.827743 0"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df = pd.DataFrame(np.random.rand(8,4), columns=list('abcd'))\n",
"df['mult_times'] = [1,2,1,1,4,1,2,0]\n",
"df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Here is the execution. Notice it even drops the incidence where there is a `0` multiplier, which is the intended behavior here."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>a</th>\n",
" <th>b</th>\n",
" <th>c</th>\n",
" <th>d</th>\n",
" <th>mult_times</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>0.798289</td>\n",
" <td>0.649386</td>\n",
" <td>0.222324</td>\n",
" <td>0.466508</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>0.763452</td>\n",
" <td>0.959604</td>\n",
" <td>0.572388</td>\n",
" <td>0.774142</td>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>0.763452</td>\n",
" <td>0.959604</td>\n",
" <td>0.572388</td>\n",
" <td>0.774142</td>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>0.087007</td>\n",
" <td>0.172473</td>\n",
" <td>0.100578</td>\n",
" <td>0.685924</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>0.802214</td>\n",
" <td>0.924537</td>\n",
" <td>0.463268</td>\n",
" <td>0.564780</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>0.776287</td>\n",
" <td>0.160867</td>\n",
" <td>0.502297</td>\n",
" <td>0.325034</td>\n",
" <td>4</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>0.776287</td>\n",
" <td>0.160867</td>\n",
" <td>0.502297</td>\n",
" <td>0.325034</td>\n",
" <td>4</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>0.776287</td>\n",
" <td>0.160867</td>\n",
" <td>0.502297</td>\n",
" <td>0.325034</td>\n",
" <td>4</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>0.776287</td>\n",
" <td>0.160867</td>\n",
" <td>0.502297</td>\n",
" <td>0.325034</td>\n",
" <td>4</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>0.779431</td>\n",
" <td>0.718153</td>\n",
" <td>0.783243</td>\n",
" <td>0.786609</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>0.195464</td>\n",
" <td>0.263158</td>\n",
" <td>0.856902</td>\n",
" <td>0.444856</td>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>0.195464</td>\n",
" <td>0.263158</td>\n",
" <td>0.856902</td>\n",
" <td>0.444856</td>\n",
" <td>2</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" a b c d mult_times\n",
"0 0.798289 0.649386 0.222324 0.466508 1\n",
"1 0.763452 0.959604 0.572388 0.774142 2\n",
"1 0.763452 0.959604 0.572388 0.774142 2\n",
"2 0.087007 0.172473 0.100578 0.685924 1\n",
"3 0.802214 0.924537 0.463268 0.564780 1\n",
"4 0.776287 0.160867 0.502297 0.325034 4\n",
"4 0.776287 0.160867 0.502297 0.325034 4\n",
"4 0.776287 0.160867 0.502297 0.325034 4\n",
"4 0.776287 0.160867 0.502297 0.325034 4\n",
"5 0.779431 0.718153 0.783243 0.786609 1\n",
"6 0.195464 0.263158 0.856902 0.444856 2\n",
"6 0.195464 0.263158 0.856902 0.444856 2"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.loc[np.repeat(df.index.values, df.mult_times)]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Ex. 2\n",
"This time using string/categorical and assigning the duplication column with a lambda."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>a</th>\n",
" <th>b</th>\n",
" <th>c</th>\n",
" <th>j</th>\n",
" <th>mult_times</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>alpha</td>\n",
" <td>bravo</td>\n",
" <td>kilo</td>\n",
" <td>CIA;FBI</td>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>gamma</td>\n",
" <td>delta</td>\n",
" <td>zeta</td>\n",
" <td>TSA</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>iota</td>\n",
" <td>kappa</td>\n",
" <td>upsilon</td>\n",
" <td>FDA;DHS;ATF</td>\n",
" <td>3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>lambda</td>\n",
" <td>lambda</td>\n",
" <td>mu</td>\n",
" <td>ATF</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>tango</td>\n",
" <td>foxtrot</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" a b c j mult_times\n",
"0 alpha bravo kilo CIA;FBI 2\n",
"1 gamma delta zeta TSA 1\n",
"2 iota kappa upsilon FDA;DHS;ATF 3\n",
"3 lambda lambda mu ATF 1\n",
"4 tango foxtrot NaN NaN 0"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df = pd.DataFrame(\n",
" [\n",
" ['alpha','bravo','kilo','CIA;FBI'],\n",
" ['gamma','delta','zeta','TSA'],\n",
" ['iota','kappa','upsilon','FDA;DHS;ATF'],\n",
" ['lambda','lambda','mu','ATF'],\n",
" ['tango','foxtrot',np.nan,np.nan]\n",
" ], \n",
" columns=list('abcj'))\n",
"\n",
"# assign lambda to categorize\n",
"df['mult_times'] = df['j'].apply(\n",
" lambda x: len(x.split(';')) if isinstance(x, str) else 0)\n",
"df"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>a</th>\n",
" <th>b</th>\n",
" <th>c</th>\n",
" <th>j</th>\n",
" <th>mult_times</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>alpha</td>\n",
" <td>bravo</td>\n",
" <td>kilo</td>\n",
" <td>CIA;FBI</td>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>alpha</td>\n",
" <td>bravo</td>\n",
" <td>kilo</td>\n",
" <td>CIA;FBI</td>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>gamma</td>\n",
" <td>delta</td>\n",
" <td>zeta</td>\n",
" <td>TSA</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>iota</td>\n",
" <td>kappa</td>\n",
" <td>upsilon</td>\n",
" <td>FDA;DHS;ATF</td>\n",
" <td>3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>iota</td>\n",
" <td>kappa</td>\n",
" <td>upsilon</td>\n",
" <td>FDA;DHS;ATF</td>\n",
" <td>3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>iota</td>\n",
" <td>kappa</td>\n",
" <td>upsilon</td>\n",
" <td>FDA;DHS;ATF</td>\n",
" <td>3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>lambda</td>\n",
" <td>lambda</td>\n",
" <td>mu</td>\n",
" <td>ATF</td>\n",
" <td>1</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" a b c j mult_times\n",
"0 alpha bravo kilo CIA;FBI 2\n",
"0 alpha bravo kilo CIA;FBI 2\n",
"1 gamma delta zeta TSA 1\n",
"2 iota kappa upsilon FDA;DHS;ATF 3\n",
"2 iota kappa upsilon FDA;DHS;ATF 3\n",
"2 iota kappa upsilon FDA;DHS;ATF 3\n",
"3 lambda lambda mu ATF 1"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.loc[np.repeat(df.index.values,df.mult_times)]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Also possible to just do it all in the `df.loc` call and not worry about having an extra column."
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>a</th>\n",
" <th>b</th>\n",
" <th>c</th>\n",
" <th>j</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>alpha</td>\n",
" <td>bravo</td>\n",
" <td>kilo</td>\n",
" <td>CIA;FBI</td>\n",
" </tr>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>alpha</td>\n",
" <td>bravo</td>\n",
" <td>kilo</td>\n",
" <td>CIA;FBI</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>gamma</td>\n",
" <td>delta</td>\n",
" <td>zeta</td>\n",
" <td>TSA</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>iota</td>\n",
" <td>kappa</td>\n",
" <td>upsilon</td>\n",
" <td>FDA;DHS;ATF</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>iota</td>\n",
" <td>kappa</td>\n",
" <td>upsilon</td>\n",
" <td>FDA;DHS;ATF</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>iota</td>\n",
" <td>kappa</td>\n",
" <td>upsilon</td>\n",
" <td>FDA;DHS;ATF</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>lambda</td>\n",
" <td>lambda</td>\n",
" <td>mu</td>\n",
" <td>ATF</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" a b c j\n",
"0 alpha bravo kilo CIA;FBI\n",
"0 alpha bravo kilo CIA;FBI\n",
"1 gamma delta zeta TSA\n",
"2 iota kappa upsilon FDA;DHS;ATF\n",
"2 iota kappa upsilon FDA;DHS;ATF\n",
"2 iota kappa upsilon FDA;DHS;ATF\n",
"3 lambda lambda mu ATF"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df = pd.DataFrame(\n",
" [\n",
" ['alpha','bravo','kilo','CIA;FBI'],\n",
" ['gamma','delta','zeta','TSA'],\n",
" ['iota','kappa','upsilon','FDA;DHS;ATF'],\n",
" ['lambda','lambda','mu','ATF'],\n",
" ['tango','foxtrot',np.nan,np.nan]\n",
" ], \n",
" columns=list('abcj'))\n",
"\n",
"df2 = df.loc[\n",
" np.repeat(\n",
" df.index.values,\n",
" df['j'].apply(\n",
" lambda x: len(x.split(';')) if isinstance(x, str) else 0)\n",
" )]\n",
"df2"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Then you can write a nice little list comprehension that will assign the adjusted column properly."
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [],
"source": [
"adj_j = [\n",
" i for s in \n",
" [x.split(';') for x in df.j.values if isinstance(x, str)] # cut out NaN\n",
" for i in s]"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>a</th>\n",
" <th>b</th>\n",
" <th>c</th>\n",
" <th>j</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>alpha</td>\n",
" <td>bravo</td>\n",
" <td>kilo</td>\n",
" <td>CIA</td>\n",
" </tr>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>alpha</td>\n",
" <td>bravo</td>\n",
" <td>kilo</td>\n",
" <td>FBI</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>gamma</td>\n",
" <td>delta</td>\n",
" <td>zeta</td>\n",
" <td>TSA</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>iota</td>\n",
" <td>kappa</td>\n",
" <td>upsilon</td>\n",
" <td>FDA</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>iota</td>\n",
" <td>kappa</td>\n",
" <td>upsilon</td>\n",
" <td>DHS</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>iota</td>\n",
" <td>kappa</td>\n",
" <td>upsilon</td>\n",
" <td>ATF</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>lambda</td>\n",
" <td>lambda</td>\n",
" <td>mu</td>\n",
" <td>ATF</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" a b c j\n",
"0 alpha bravo kilo CIA\n",
"0 alpha bravo kilo FBI\n",
"1 gamma delta zeta TSA\n",
"2 iota kappa upsilon FDA\n",
"2 iota kappa upsilon DHS\n",
"2 iota kappa upsilon ATF\n",
"3 lambda lambda mu ATF"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df2['adj_j'] = adj_j\n",
"df2.drop('j',axis=1).rename({'adj_j':'j'}, axis=1)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "sheriffs",
"language": "python",
"name": "sheriffs"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.5"
}
},
"nbformat": 4,
"nbformat_minor": 4
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment