Skip to content

Instantly share code, notes, and snippets.

@gjreda
Last active October 19, 2018 03:45
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save gjreda/ecdd49d133e09eee3264 to your computer and use it in GitHub Desktop.
Save gjreda/ecdd49d133e09eee3264 to your computer and use it in GitHub Desktop.
add grouped cumulative sum column to pandas dataframe
"""
add grouped cumulative sum column to pandas dataframe
Add a new column to a pandas dataframe which holds the cumulative sum for a given grouped window
Desired output:
user_id,day,session_minutes,cumulative_minutes
516530,0,NaN,0
516530,1,0,0
516532,0,5,5
516532,1,33,38
516534,0,32,32
516534,1,104,136
516535,0,5,5
516535,1,0,5
"""
import pandas as pd
from StringIO import StringIO
data = """
user_id,day,session_minutes
516530,0,NaN
516530,1,0
516532,0,5
516532,1,33
516534,0,32
516534,1,104
516535,0,5
516535,1,0
"""
df = pd.read_csv(StringIO(data))
cumsums = df.groupby(['user_id', 'day']).sum().fillna(0).groupby(level=0).cumsum()
df.set_index(['user_id', 'day'], inplace=True)
df['cumulative_minutes'] = cumsums
df.reset_index(inplace=True)
print(df)
# user_id day session_minutes cumulative_minutes
# 0 516530 0 NaN 0
# 1 516530 1 0 0
# 2 516532 0 5 5
# 3 516532 1 33 38
# 4 516534 0 32 32
# 5 516534 1 104 136
# 6 516535 0 5 5
# 7 516535 1 0 5
@earroyo
Copy link

earroyo commented Oct 19, 2018

Hello,

Thank you for sharing. Trying your example with

data="""
PATTERN_NAME | PROD_INJ_DATE | OIL_PROD
_CRCU_1101 | 1/31/1973 | 4752
_CRCU_1101 | 2/28/1973 | 4428
_CRCU_1101 | 3/31/1973 | 5316
_CRCU_1101 | 4/30/1973 | 4776
_CRCU_1101 | 5/31/1973 | 4860
_CRCU_1101 | 6/30/1973 | 4905
_CRCU_1101 | 7/31/1973 | 5208
_CRCU_1101 | 8/31/1973 | 5034
_CRCU_1101 | 9/30/1973 | 3240
_CRCU_1103 | 1/31/1973 | 372
_CRCU_1103 | 2/28/1973 | 345
_CRCU_1103 | 3/31/1973 | 397
_CRCU_1103 | 4/30/1973 | 373
_CRCU_1103 | 5/31/1973 | 432
_CRCU_1103 | 6/30/1973 | 454
_CRCU_1103 | 7/31/1973 | 483
_CRCU_1103 | 8/31/1973 | 467
_CRCU_1103 | 9/30/1973 | 458
_CRCU_1104 | 1/31/1973 | 447
_CRCU_1104 | 2/28/1973 | 414
_CRCU_1104 | 3/31/1973 | 488
_CRCU_1104 | 4/30/1973 | 449
_CRCU_1104 | 5/31/1973 | 432
_CRCU_1104 | 6/30/1973 | 454
_CRCU_1104 | 7/31/1973 | 482
_CRCU_1104 | 8/31/1973 | 467
_CRCU_1104 | 9/30/1973 | 431
"""

Using the code below

df=pd.read_csv("ProdByPattern.csv",delimiter="\t")
cumsums = df.groupby(['PATTERN_NAME','PROD_INJ_DATE']).sum().fillna(0).groupby(level=0).cumsum()
df.set_index(['PATTERN_NAME','PROD_INJ_DATE'], inplace=True)
df['cum_oil'] = cumsums
df.reset_index(inplace=True)
print(df)

I got the expected results.

PATTERN_NAME PROD_INJ_DATE OIL_PROD cum_oil
0 _CRCU_1101 1/31/1973 4752 4752
1 _CRCU_1101 2/28/1973 4428 9180
2 _CRCU_1101 3/31/1973 5316 14496
3 _CRCU_1101 4/30/1973 4776 19272
4 _CRCU_1101 5/31/1973 4860 24132
5 _CRCU_1101 6/30/1973 4905 29037
6 _CRCU_1101 7/31/1973 5208 34245
7 _CRCU_1101 8/31/1973 5034 39279
8 _CRCU_1101 9/30/1973 3240 42519
9 _CRCU_1103 1/31/1973 372 372
10 _CRCU_1103 2/28/1973 345 717
11 _CRCU_1103 3/31/1973 397 1114
12 _CRCU_1103 4/30/1973 373 1487
13 _CRCU_1103 5/31/1973 432 1919
14 _CRCU_1103 6/30/1973 454 2373
15 _CRCU_1103 7/31/1973 483 2856
16 _CRCU_1103 8/31/1973 467 3323
17 _CRCU_1103 9/30/1973 458 3781
18 _CRCU_1104 1/31/1973 447 447
19 _CRCU_1104 2/28/1973 414 861
20 _CRCU_1104 3/31/1973 488 1349
21 _CRCU_1104 4/30/1973 449 1798
22 _CRCU_1104 5/31/1973 432 2230
23 _CRCU_1104 6/30/1973 454 2684
24 _CRCU_1104 7/31/1973 482 3166
25 _CRCU_1104 8/31/1973 467 3633
26 _CRCU_1104 9/30/1973 431 4064

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment