Skip to content

Instantly share code, notes, and snippets.

@bchirico
Created November 30, 2013 11:53
Show Gist options
  • Save bchirico/7718089 to your computer and use it in GitHub Desktop.
Save bchirico/7718089 to your computer and use it in GitHub Desktop.
Python for data analysis - chapter 2 - example
import json
from pandas import DataFrame, Series
import pandas as pd
path = 'usagov_bitly_data2012-03-16-1331923249.txt'
records = [json.loads(line) for line in open(path)]
time_zones = [rec['tz'] for rec in records if 'tz' in rec]
frame = DataFrame(records)
clean_tz = frame['tz'].fillna('Missing')
clean_tz[clean_tz == ''] = 'Unknown'
tz_counts = clean_tz.value_counts()
tz_counts[:10]
tz_counts[:10].plot(kind='barh', rot = 0)
plt.show()
results = Series([x.split()[0] for x in frame.a.dropna()])
results.value_counts()[:8]
cframe = frame[frame.a.notnull()]
oper_system = np.where(cframe['a'].str.contains('Windows'), 'Windows', 'Not Windows')
oper_system[:10]
by_tz_os = cframe.groupby(['tz', oper_system])
agg_counts = by_tz_os.size().unstack().fillna(0)
agg_counts[:10]
indexer = agg_counts.sum(1).argsort()
indexer[:10]
count_subset = agg_counts.take(indexer)[-10:]
count_subset
normed_subset = count_subset.div(count_subset.sum(1), axis=0)
normed_subset.plot(kind='barh', stacked = True)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment