ericbolo/ml-pandas.md

## ml-pandas.md

      
    Raw
  

              ml-pandas.md
            
          
    Introduction

These are notes from this great video series by Kevin Markham: : http://www.dataschool.io/easier-data-analysis-with-pandas/
Reading tabular data

import pandas as pd
pd.read_table('data/chipotle.tsv')
pd.read_table('http://bit.ly/chiporders')
Assumes that the data file is tab-separated, first row is a header row. In this case, assumptions hold.
pd.read_table('http://bit.ly/movieusers')
Here, original data is pipe-separated (|)
pd.read_table('http://bit.ly/movieusers', sep='|')
Also, first row is not header
pd.read_table('http://bit.ly/movieusers', sep='|', header='none')
Define header.
user_cols = ['user_id', 'age', 'gender', 'occupation', 'zip_code']
pd.read_table('http://bit.ly/movieusers', sep='|', header='none', names=user_cols)
Select a pandas series from a dataframe

ufo = pd.read_table('http://bit.ly/uforeports', sep=',')
OR
ufo = pd.read_csv('http://bit.ly/uforeports')
type(ufo)
//pandas.core.frame.DataFrame
//first 5 rows
ufo.head()
ufo['City']: will select out the 'City' series (column)
type(ufo['City'])
//pandas.core.series.Series
//The following two are equivalent, but dot notation only works in certain conditions
ufo.City
ufo['City']
For column names with spaces, have to use bracket notation
ufo['Colors Reported']
Concatenating string series
ufo['Location'] = ufo.City + ufo.State
Will put city next to the state in a column
You HAVE to use bracket notation to create a new column
ufo.describe(include=['object'])

ipynb tip: anytime you have a method or function and want to remember the argument, click anywhere inside the parentheses, and hit shit + tab (three times for screen split)

Renaming columns

ufo.rename(columns={'Colors Reported':'Colors_Reported', 'Shape Reported':'Shape_Reported'}(, inplace=True))
OR
ufo_cols=['city', 'colors repoted', 'shape reported', 'state', 'time']
ufo.columns = ufo_cols
(Overwrites the columns)
OR
ufo = pd.read_csv('http://bit.ly/uforeports', names=ufo_cols, header=0)

Tip: string manipulation on all columns


ufo.columns = ufo.columns.str.replace(' ', '_')
Dropping columns

ufo.drop('Colors Reported', axis=1, inplace=true)
//axis is a direction (axis 0 is the row axis, axis 1 is the column axis
OR
ufo.drop(['Colors Reported', 'City'], axis=1, inplace=true)

tip: dropping a row

ufo.drop([0,1] , axis=0, inplace=True)
=> will remove rows at indices 0 and 1
Sorting a pandas DataFrame or Series

movies['title'].sort_values(ascending=False)
//preserve dataframe
movies.sort_values('title')
//sorting by two columns: first by content_rating then by duration
movies.sort_values(['content_rating', 'duration'])
Filter rows of a DataFrame by column values

//Returns a list of booleans
is_long = movies.duration >=200
movies[is_long] //will filter out the rows
Shorter:
movies[movies.duration >= 200]
Breaking it down: "movies.duration >= 200" returns a list of booleans with length = length of data. movies[booleans] applies a filter to the rows.
Multiple filter criteria

movies[(movies.duration >=200) | (movies.genre == 'Drama')]
movies[movies.genre.isin(['crime', 'drama', 'action'])]
Using the axis parameter

//drop a column
drinks.drop('continent', axis=1).head()
drinks.drop(2, axis=0).head()
drinks.mean(axis=0)//Mean of each column
drinks.mean(axis=1)//Mean of each row
Using string methods

orders.item_name.str.upper()
//Returns a series of booleans
orders.item_name.str.contains('Chicken')
orders.choice_description.str.replace('[', '')
//Replacing right and left brackets using regex
orders.choice_description.str.replace('[[]]', '')
Changing the data type of a pandas series

//Will show the data types of all columns in the drinks dataframe
drinks.dttypes
dinks['beer_servings'] = drinks.beer_savings.astype(float)
//Will reading from file
drinks = pd.read_csv('path/to/csv', dtype={'beer_servings':float})
orders.item_price.str.replace('$', '').astype(float).mean()

tip change a list of booleans to a list of 0s and 1s


orders.item_name.str.contains('Chicken').astype(float).head()
Using groupby

drinks.groupby('continent').beer_servings.mean()
drinks[drinks.continent='Africa'].beer_servings.mean()
drinks.groupby('continent').beer_servings.max() (or mean(), or min())
drinks.groupby('continent').beer_servings.agg(['count', 'min', 'max', 'mean'])
//If using ipynb/jupyter
%matplotlib inline
//Generates a bar visualization
drinks.groupby('continent').mean().plot(kind='bar')
Exploring a pandas Series

movies.genre.value_counts().head()
Categorical

//All unique values
movies.genre.unique()
//Number of unique values
movies.genre.nunique()
//Cross-tabulation: a tally of how many movies belons to each category of genre and content rating
pd.crosstab(movies.genre, movies.content_rating)
Numeric

movies.duration.describe()
//Tally the number of movies for each duration value. Not that useful here with a numeric column
movies.duration.value_counts()
Visualization

//if in jupyter/ipynb
%matplotlib inline
movies.duration.plot(kind='hist')
movies.genre.value_counts.plot(kind='bar')
Handling missing values

//isnull() will show True/False for a present/missing column
ufo.isnull().tail()
//the exact opposite
ufo.notnull().tail()
//number of missing values in each of the columns
ufo.isnull().sum()
How this works:
//Can sum booleans with pandas (will convert True to 1s and False to 0s)
pd.Series([True, False, True]).sum()
//Sum() is selecting axis=0 by default
so ufo.isnull().sum(axis=0) => summation accross the rows
equivalent to ufo.isnull().sum()

ufo[ufo.City.isnull()]
Options for handling missing data

//Drop any row in which at least one column has a missing value
ufo.dropna(how='any').shape
//Drop any row in which ALL the columns have missing values
ufo.dropna(how='all').shape
//If either column missing
ufo.dropna(subset=['City', 'Shape Reported'], how='any').shape
//If both columns missing
ufo.dropna(subset=['City', 'Shape Reported'], how='any').shape
ufo.dropna(how='all').shape
//With value counts
ufo["Shape Reported'].value_counts(dropna=False)
ufo['Shape Reported'].fillna(value='VARIOUS', inplace=True)
Things to know about indices in pandas

The index is sometimes called the row label.
The index NOT counted in the matrix dimensions.
Why the index exists:

identification
selection
alignment

Identification

drinks[drinks.continent=='South America']
The index was not modified by this filtering. So index can keep track of rows despite of dataframe modifications
Selection

drinks.loc[23, 'beer_servings];//row 23, beer_servings value
//Made the "country" column the index. Cool!
drinks.set_index('country', inplace=True)
//print index
drinks.index
//clear index name (sometimes useful)
drinks.index.name = None
//reset inex
drinks.index.name='country'
drinks.reset_index(inplace=True)

tip: using describe

drinks.describe().loc['25%', 'beer_servings']
Series indices

//the series came from the dataframe
drinks.continent.head()
=> will print the index
drinks.set_index('country', inplace=True)
//Then you can do things like
drinks.continent.value_counts()['Africa']
//Sorting by index
drinks.continent.value_counts().sort_index()
people = pd.Series([50000, 850000], index=['Albania', 'Andorra'], name='population')
//Goal: get total beer servings by country (initial dataframe contains per capita data
drinks.beer_servings * people

tip: concatenating

//axis=1 means concat COLUMNS
pd.concat([drinks, people], axis=1)
Selecting multiple rows and columns

loc

import pandas as pd
//row 0, all columns
ufo.loc[0, :]
//first three rows, all columns
ufo.loc[[0,1,2], :
<=>
ufo.loc[0:2,:]
Note: loc is inclusive on both sides
//column selection
ufo.loc[:, ['City', 'State']]
//City through State
ufo.loc[:, 'City':'State']
//Using booleans
filtering, the old way: ufo[ufo.City=='Oakland']
With loc, and selecting column:
ufo.loc[ufo.City=='Oakland', 'State']
iloc

"i" stands for integer position
//All rows, columns 0 and 3
ufo.iloc[:, [0, 3]]
//All rows, first, second and third column
ufo.iloc[:,0:4]
Important note: iloc is EXCLUSIVE ot the second number, inclusive of the first. (so 0:4, returns columns 1, 2, 3), UNLIKE loc
ix

Kindof a hybrid between loc and iloc. Ix figures out whether you're referring to a position or a label.
drinks.ix['Albania', 0]
drinks.ix[1, 'beer_servings']
drinks.ix['Albania':'Andorra', 0:2]
Logic behind ix: when using position, follows iloc inclusive:exclusive pattern
ufo.ix[0:2, 0:2]
Ix is kinda confusing because you need to keep track of its internal logic. So, not always recommended. Use loc and iloc instead
making dataFrames smaller and faster

//display info in the data frame
drinks.info
//Exhaustive memory usage information
drinks.info(memory_usage='deep')
//break down by column
drinks.memory_usage(deep=True)
drinks.memory_usage(deep=True).sum()
//Casting a dictionary as a category for memory efficiency
Object columns can take up a lot of space. How can I be more space-efficient?
drinks['continent'] = drinks.continent.astype('category')
//.cat method for category
drinks.continent.cat.codes: how pandas represents the continent series as integer
//ordered categories
df['quality'] = df.quality.astype('category', categories=['good', 'very good', 'excellent'], ordered=True)
//Sorting in logical order defined for the category
df.sort_values('quality')
//Using boolean conditions
df.loc(df.quality > 'good', :)
Using pandas with scikit-learn to create Kaggle submissions

//Titanic survival data set (!)
train = pd=read_csv('http://bit.ly/kaggletrain')
train.head()
feature_cols = ['Pclass', 'Parch']
X = train.loc[:, feature_cols]
y = train.survived
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()
logreg.fit(X, y)
test = pd.read_cs('http://bit.ly/kaggletest')
X_new =test.loc[:, feature_cols]
new_pred_class = logreg.predict(X_new)
//Kaggle asks for a .csv file with two columns: passenger id from the testing set, and the predicted classes
pd.DataFrame({'PassengerId':test.PassengerId, 'Survived': new_pred_class)).set_index('PassengerId').to_csv('/name/of/file.csv')

tip: save dataframe to disk

train.to_pickle('train.pkl')
//recover dataframe from pickle file
pd.read_pickle('train.pkl')
Creating dummy variables

//Series map method
train['Sex_male'] = train.Sex.map({'female':0, 'male':1})
//This creates one column for every possible value of the column given as parameter
pd.get_summies(train.Sex)
Generally, if you have K variables, use K-1 dummies.
pd.get_dummies(train.Sex (,prefix='Sex')).iloc[:,1:]
Will produce a single column called "Sex_male".
embarked_dummies = pd.get_dummies(train.Embarked, prefix='Embarked')
Note: 'Embarked' can take on 3 possible values. Will produce two dummy variable columns
//attach to dataframe, using axis=1 because concatenating columns
pd.concat([train, embarked_dummies], axis=1)

passing a dataframe to pd.get_dummies
pd.get_dummies(train, columns=['Sex', 'Embarked'])

//for K-1 dummies
pd.get_dummies(train, columns=['Sex', 'Embarked'], drop_first=True)
Working with dates and times

ufo['Time'] = pd.to_datetime(ufo.Time)
Note: no need to specifiy format of date. Pandas just figured it out. Doesn't always work though, but many fallback options exist.
ufo.head()
using the datetime format

ufo.Time.dt.hour
ufo.time.dt.weekday_name
ts = pd.to_datetime('1/1/1999')
using timestamps for comparisons

ufo.loc[ufo.Time >= ts, :]
//Produces a time delta object
ufo.Time.max() - ufo.Time.min()
//if in jupyter/ipynb
%matplotlib inline
ufo['Year'] = ufo.Time.dt.year
ufo.Year.value_counts().sort_index().plot()
Finding and removing duplicate rows in pandas

users.zip_code.duplicated()
users.zip_code.duplicated().sum()
//getting duplicated rows in dataframe
users.duplicated()
users.loc[users.duplicated(), :]
//mark duplicate as true except first instance
users.loc[users.duplicated(keep='first');, :]
//mark duplicate as true except last instance
users.loc[users.duplicated(keep='first');, :]
//mark all duplicates as true
users.loc[users.duplicated(keep=False);, :]
//Drop duplicates
Let's say age+zip_code consistutes a unique id
//Now looking at duplicate age+zip_code only
users.duplicated(subset['age', 'zip_code']).sum()
Avoiding the SettingWithCopyWarning

Example 1

//tally of missing content rating values
movies.content_rating.isnull().sum()
movies[movies.content_rating.isnull()]
//modifying custom "missing value" label, e.g. NOT RATED => Na
import numpy as np
movies[movies.content_rating=='NOT RATED'].content_rating = np.nan
Yields a warning: SettingWithCopyWarning
Replace command with:
movies.lov[movies.contant_rating=='NOT RATED', 'content_rating'] = np.nan
Does not throw a warning
Why did it throw a warning for
movies[movies.content_rating=='NOT RATED'].content_rating = np.nan
get item: movies[movies.content_rating=='NOT RATED']
set item: .content_ratin
pb:pandas doesn't know if get item is view or copy. Warning says "not sure what happened".
Example 2

top_movies.loc[0, 'duration'] = 150
Also throws a SettingWithCopyWarning
Pandas is not sure whether top_movies is a reference or a copy
Problem came from this line of code:
//removes ambiguity
top_movies = movies.loc[movies.start_rating >=9, :].copy()
Changing display options in panda

What if I want to show all the rows
pd.set_option('display.max_rows', None)
pd.reset_option('display.max_rows')
Creating a dataFrame from another object

//index param specifies row indices if needed
df = pd.DataFrame({'id':[100, 101, 102], 'color':['red', 'blue', 'yellow']}, columns=['id', 'color'], index=['a', 'b', 'c'])
//other format: list of list
pd.DataFrame([[100, 'red'], [101, 'blue'], [102, 'yellow']], columns=['id', 'color'])
import numpy as np
//create a 4*2 numpy array of random numbers b/w 0 and 1
arr = np.random.rand(4,2)
//student ids will be from 100 to 110 with increment 1
//Also, use the student column as index
pd.DataFrame({'student': np.arange(100,110,1), 'test':np.random.randint(60,101,10)}).set_index('student')
Creating a series

s = pd.Series(['round', 'square', index=['c', 'b'], name='shape')
pd.concat([df, s], axis=1)
Note: the name of the series will the column name for the concatenated series
Applying a function to a pandas series or dataframe

train['Sex_num'] = train.Sex.map('female':0, 'male':1})
apply as a series method

train['Name_length'] = train.Name.apply(len)
apply as a dataframe method

Applies a function along either axis of the dataframe.
Apply this function in the vertical direction ("row" direction)
Will return the max value for each column
drinks.loc(:, 'beer_servings':'wine_servings'].apply(max, axis=0)
applymap

Apply a function to EVERY element (cell) of a dataframe