Skip to content

Instantly share code, notes, and snippets.

@codecademydev
Created December 28, 2020 21:21
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save codecademydev/c08f153979400a840b5ee9eaea97bf61 to your computer and use it in GitHub Desktop.
Save codecademydev/c08f153979400a840b5ee9eaea97bf61 to your computer and use it in GitHub Desktop.
Codecademy export
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import codecademylib3_seaborn
#import glob
import glob
import seaborn as sns
#combine files to list
files = glob.glob('states*.csv')
#concatenate dataframes to us_census
list_files = []
for filename in files:
x = pd.read_csv(filename)
list_files.append(x)
us_census = pd.concat(list_files)
#inspect data for columns, dtypes, values
#print(us_census.columns)
#print(us_census.dtypes)
#print(us_census.head())
#Clean data in us_census.Income
us_census.Income = us_census.Income.str.split('$',expand=True)[1]
us_census.Income = pd.to_numeric(us_census.Income).round(2)
#divide GenderPop into two new columns, men and women
us_census['Women'] = us_census.GenderPop.str.split('_',expand=True)[1]
us_census['Men'] = us_census.GenderPop.str.split('_',expand=True)[0]
#Drop alpha
us_census.Women = us_census.Women.str.split('F',expand=True)[0]
us_census.Men = us_census.Men.str.split('M',expand=True)[0]
#convert to numeric
us_census.Women = pd.to_numeric(us_census.Women)
us_census.Men = pd.to_numeric(us_census.Men)
#fill null values with mean
us_census.Women = us_census.Women.fillna(us_census.TotalPop - us_census.Men)
#print(us_census.Women.head())
#print(us_census.Men.head())
#print(us_census[['TotalPop','Men','Women']].sum())
#Create scatterplot
plt.subplot(1,1,1)
plt.scatter(us_census.Women,us_census.Income)
plt.xlabel('Women')
plt.ylabel('Income')
plt.title('Correlation between Women and Income')
plt.show()
#Check for duplicates, eliminate if necessary
us_census = us_census.drop_duplicates(subset='State')
us_census = us_census.drop(['GenderPop','Unnamed: 0'], axis=1)
#Clean data for races
for column in us_census.columns[2:8]:
us_census[column] = us_census[column].str.strip('%')
us_census[column] = pd.to_numeric(us_census[column]).round(2)
#Create Histograms for races
print(us_census.columns)
plt.subplot(2,1,1)
us_census[['Hispanic','White','Black','Native','Asian','Pacific']].plot.hist(alpha=.4,bins=30)
plt.xlabel('Percent of Population')
plt.title('Distribution of Population by Race')
plt.show()
@rproner1
Copy link

On line 25, using Series.str.replace() is probably more efficient:

us_census.Income = us_census.Income.str.replace('$', '', regex=True)

The regex=True is required so that pandas recognizes the argument as a regular expression. If not included, only instances of '$' will be replaced in the column. You could also use str.strip().

Other than that, looks great! I like how you looped through the columns in order to remove '%', I didn't even think of that.

Could you explain to me what alpha=.4 does in .hist() on line 63? I don't see it as a keyword argument in the documentation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment