Skip to content

Instantly share code, notes, and snippets.

@toast38coza
Last active October 9, 2015 15:40
Show Gist options
  • Save toast38coza/7ffbf26e2c51a57a8d75 to your computer and use it in GitHub Desktop.
Save toast38coza/7ffbf26e2c51a57a8d75 to your computer and use it in GitHub Desktop.
Some useful snippets for analyzing data with pandas
"""
DataFrame:
project_id hours overtime day
0 3 4 False 2015-10-07
1 3 6 False 2015-10-07
2 3 7 False 2015-10-07
3 2 5 False 2015-10-07
4 2 4 False 2015-10-07
5 1 4 False 2015-10-07
6 1 3 False 2015-10-07
7 1 2 False 2015-10-07
8 1 1 False 2015-10-07
"""
# group by projects and get the sum of their respective hours
df.groupby(by=['project_id'])['hours'].sum()
"""
project_id
1 10
2 9
3 17
Name: hours, dtype: object
"""
# You can also aggregate with multiple functions:
df.groupby('project_id')['hours'].agg(['sum', 'count', 'prod'])
"""
sum count prod
project_id
1 10 4 24
2 9 2 20
3 17 3 168
"""
# create a new column by mapping ids from a lookup:
projects = {1: "P1", 2: "P2", 3: "P3"}
df['project_title'] = df['project_id'].map(projects)
"""
project_id hours overtime day project_title
0 3 4 False 2015-10-07 P3
1 3 6 False 2015-10-07 P3
2 3 7 False 2015-10-07 P3
3 2 5 False 2015-10-07 P2
4 2 4 False 2015-10-07 P2
5 1 4 False 2015-10-07 P1
6 1 3 False 2015-10-07 P1
7 1 2 False 2015-10-07 P1
8 1 1 False 2015-10-07 P1
"""
# calculate overtime hours by project:
...
# Show only data from project_id=1
```
df[df['project_id'] == 1]
```
# Show daily hours for this project:
import datetime
d1=datetime.date(2015,9,10)
d2=datetime.date(2015,9,20)
ts = df[['day', 'hours']].groupby('day').sum()['hours']
# we than can spit out a slice like so:
ts[d1:d2]
# Which gives us:
"""
In [207]: ts[d1:d2]
Out[207]:
day
2015-09-10 182.5
2015-09-11 164.0
2015-09-12 3
2015-09-13 5
2015-09-14 177.0
2015-09-15 173.0
2015-09-16 169
2015-09-17 158.5
2015-09-18 154.5
2015-09-20 9.5
Name: hours, dtype: object
"""
# Convert this to json:
ts[d1:d2].to_json(date_format='iso')
"""
Gives us:
'{"2015-09-10T00:00:00.000Z":182.5,"2015-09-11T00:00:00.000Z":164.0,"2015-09-12T00:00:00.000Z":3.0,"2015-09-13T00:00:00.000Z":5.0,"2015-09-14T00:00:00.000Z":177.0,"2015-09-15T00:00:00.000Z":173.0,"2015-09-16T00:00:00.000Z":169.0,"2015-09-17T00:00:00.000Z":158.5,"2015-09-18T00:00:00.000Z":154.5,"2015-09-20T00:00:00.000Z":9.5}'
"""
# we can fill in the blank days by simply reindexing:
ts.reindex(date_range(min(ts.index), max(ts.index)))
"""
2015-02-12 9
2015-02-13 NaN
2015-02-14 NaN
2015-02-15 NaN
2015-02-16 NaN
2015-02-17 9
2015-02-18 16
2015-02-19 NaN
2015-02-20 NaN
2015-02-21 NaN
2015-02-22 NaN
2015-02-23 10
"""
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment