Created
October 18, 2016 17:02
-
-
Save conredwang/5137fa2a6ee5addef4da582abc3f9f07 to your computer and use it in GitHub Desktop.
I Told You So! ...<R, gglot2, knitR>
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
--- | |
title: “I Told You So!" | |
author: “Conred Wang“ | |
date: “10/10/2016” | |
output: | |
ioslides_presentation: | |
logo: assets/img/i.told.you.so.jpg | |
css: assets/css/ioslides.css | |
abstract: | |
* R visualization using gglot2. | |
* Slide presentation is rendered using knitR. | |
* Most asian parents tell their children: [ Higher Education + Work Hard + Happy Marriage = Good Life] Using 1994 census data, we try to find out if this is a myth or a fact. | |
--- | |
```{r setup, include=FALSE, message=FALSE, warning=FALSE} | |
knitr::opts_chunk$set(echo = FALSE) | |
library(dplyr) | |
library(ggplot2) | |
library(ggthemes) | |
library(knitr) | |
``` | |
```{r, echo=FALSE} | |
logo <- "assets/img/i.told.you.so.jpg" | |
``` | |
## Most asian parents tell their children : | |
#### Higher Education + Work Hard + Happy Marriage = Good Life | |
### ? | |
### ? Myth or Fact ? | |
### ? | |
### Using **R** and **ggplot2**, and visually exploring the *Adult* dataset from *UC Irvine Machine Learning Repository*, I may become the Myth Buster of : | |
### Higher Education + Work Hard + Happy Marriage = Good Life | |
---- | |
```{r, engine='bash', echo=FALSE, eval=TRUE, results='hide'} | |
# Get adult.data from repository. | |
curl -O "http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data" | |
# Move adult.data to folder data. | |
mv adult.data data/ | |
# adult.data uses ", " (comma followed by one space) as field separator. | |
# Transform field separator to "," (a single comma). | |
sed -i -e 's/, /,/g' data/adult.data | |
# adult.data uses "?" for missing values. | |
# Transform missing values with "NA". | |
sed -i -e 's/?/NA/g' data/adult.data | |
``` | |
## ETL | |
### Extract - Transform - Load | |
--- | |
- The *Adult* dataset is originally used to predict whether income exceeds $50K/year, based on 1994 census data. Information can be found at [http://archive.ics.uci.edu/ml/datasets/Adult](http://archive.ics.uci.edu/ml/datasets/Adult). | |
- The *Adult* dataset can be downloaded at [http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.datahttp://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data](http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data). | |
- The *Adult* dataset uses ", " (comma followed by one space) as field separator. I transform it to "," (a single comma). | |
- The *Adult* dataset uses "?" for missing values. I transform it with "NA". | |
--- | |
```{r} | |
adult = read.table('data/adult.data', header=FALSE, sep=',', na.strings=(NA)) | |
``` | |
- I load the *Adult* dataset to data frame *adult* which has `r ncol(adult)` attributes and `r nrow(adult)` observations. | |
```{r, echo=TRUE} | |
dim(adult) | |
``` | |
```{r} | |
adult = na.omit(adult) | |
``` | |
- I remove NA rows and then it has `r ncol(adult)` attributes and `r nrow(adult)` observations. | |
```{r, echo=TRUE} | |
dim(adult) | |
``` | |
--- | |
- I give each attribute a name: | |
```{r, echo=TRUE} | |
colnames(adult) = c( | |
'age', 'workclass', 'fnlwgt', 'education', 'education_num', | |
'marital_status', 'occupation', 'relationship', 'race', 'sex', | |
'capital_gain', 'capital_loss', 'hours_per_week', | |
'native_country', 'class') | |
colnames(adult) | |
``` | |
--- | |
- I rename *Marital Status* for better display and counting: | |
```{r, echo=TRUE} | |
levels(adult$marital_status)[levels(adult$marital_status) | |
=="Divorced"] = "Divor" | |
levels(adult$marital_status)[levels(adult$marital_status) | |
=="Married-AF-spouse"] = "Marri" | |
levels(adult$marital_status)[levels(adult$marital_status) | |
=="Married-civ-spouse"] = "Marri" | |
levels(adult$marital_status)[levels(adult$marital_status) | |
=="Married-spouse-absent"] = "Marri" | |
levels(adult$marital_status)[levels(adult$marital_status) | |
=="Never-married"] = "Never" | |
levels(adult$marital_status)[levels(adult$marital_status) | |
=="Separated"] = "Separ" | |
levels(adult$marital_status)[levels(adult$marital_status) | |
=="Widowed"] = "Widow" | |
levels(adult$marital_status) | |
``` | |
--- | |
- I rename *Work Class* for better display and counting: | |
```{r, echo=TRUE} | |
levels(adult$workclass)[levels(adult$workclass)=="Federal-gov"] ="GOV" | |
levels(adult$workclass)[levels(adult$workclass)=="Local-gov"] = "GOV" | |
levels(adult$workclass)[levels(adult$workclass)=="Never-worked"] = | |
"NEVER" | |
levels(adult$workclass)[levels(adult$workclass)=="Private"] = "PRIV" | |
levels(adult$workclass)[levels(adult$workclass)=="Self-emp-inc"] = | |
"SELF" | |
levels(adult$workclass)[levels(adult$workclass)=="Self-emp-not-inc"] = | |
"SELF" | |
levels(adult$workclass)[levels(adult$workclass)=="State-gov"] = "GOV" | |
levels(adult$workclass)[levels(adult$workclass)=="Without-pay"] = | |
"NOPAY" | |
levels(adult$workclass) | |
``` | |
--- | |
- I add numeric "Ed Level" based on factor "Education": | |
```{r, echo=FALSE, message=FALSE, warning=FALSE} | |
adult = mutate(adult, ed_level = 0) | |
obs = nrow(adult) | |
for (i in 1:obs) { | |
switch(adult$education[i], | |
"Preschool" = { adult$ed_level[i] = 1 }, | |
"1st-4th" = { adult$ed_level[i] = 2 }, | |
"5th-6th" = { adult$ed_level[i] = 3 }, | |
"7th-8th"= { adult$ed_level[i] = 3 }, | |
"9th" = { adult$ed_level[i] = 4 }, | |
"10th" = { adult$ed_level[i] = 4 }, | |
"11th" = { adult$ed_level[i] = 4 }, | |
"12th" = { adult$ed_level[i] = 4 }, | |
"HS-grad" = { adult$ed_level[i] = 4 }, | |
"Some-college" = { adult$ed_level[i] = 4 }, | |
"Assoc-acdm" = { adult$ed_level[i] = 5 }, | |
"Assoc-voc" = { adult$ed_level[i] = 5 }, | |
"Prof-school" = { adult$ed_level[i] = 5 }, | |
"Bachelors" = { adult$ed_level[i] = 6 }, | |
"Masters" = { adult$ed_level[i] = 7 }, | |
"Doctorate" = { adult$ed_level[i] = 8 })} | |
``` | |
```{r, echo=TRUE, eval=FALSE} | |
adult = mutate(adult, ed_level = 0) | |
obs = nrow(adult) | |
for (i in 1:obs) { | |
switch(adult$education[i], | |
"Preschool" = { adult$ed_level[i] = 1 }, | |
"1st-4th" = { adult$ed_level[i] = 2 }, | |
"5th-6th" = { adult$ed_level[i] = 3 }, | |
"7th-8th"= { adult$ed_level[i] = 3 }, | |
"9th" = { adult$ed_level[i] = 4 }, | |
"10th" = { adult$ed_level[i] = 4 }, | |
"11th" = { adult$ed_level[i] = 4 }, | |
"12th" = { adult$ed_level[i] = 4 }, | |
"HS-grad" = { adult$ed_level[i] = 4 }, | |
``` | |
--- | |
```{r, echo=TRUE, eval=FALSE} | |
"Some-college" = { adult$ed_level[i] = 4 }, | |
"Assoc-acdm" = { adult$ed_level[i] = 5 }, | |
"Assoc-voc" = { adult$ed_level[i] = 5 }, | |
"Prof-school" = { adult$ed_level[i] = 5 }, | |
"Bachelors" = { adult$ed_level[i] = 6 }, | |
"Masters" = { adult$ed_level[i] = 7 }, | |
"Doctorate" = { adult$ed_level[i] = 8 })} | |
``` | |
--- | |
- I use numeric *Education Years* from original dataset as education indicator since it agrees with factor *Education*. | |
```{r} | |
ggplot(adult, aes(x=ed_level, y=education_num)) + | |
geom_smooth(method = "lm") + | |
labs(x="Education Level", y="Education Years", | |
title="More/Longer Education Years indicates Higher Education") + | |
theme_economist() + scale_fill_economist() | |
``` | |
--- | |
- I single asian out from *adult* into data frame *asian_adult*: | |
```{r, echo=TRUE} | |
levels(adult$race) | |
summarise( group_by(adult, race), n()) | |
asian_adult = adult[adult$race == "Asian-Pac-Islander",] | |
``` | |
--- | |
- I further single out asian adult working in private section since this sector has more asian adult. | |
```{r} | |
ggplot(asian_adult, aes(x=workclass)) + | |
geom_bar(aes(fill=class)) + facet_wrap(~sex) + | |
xlab("work class [ Government | Private | Self-employed | No-pay ]") + | |
theme_economist() + scale_fill_economist() + | |
ggtitle("Asian Adult - Which work class pay ?") | |
``` | |
--- | |
## Higher Education + Work Hard + Happy Marriage = Good Life ??? | |
```{r} | |
ggplot(asian_adult[asian_adult$workclass=="PRIV",], aes(x=education_num,y=hours_per_week)) + | |
geom_point() + facet_grid(marital_status ~ class) + | |
labs(x = "# education years", y = "work hours / week", | |
title="Asian Adult - higher education + hard work + happy marriage = $$$ ?") + | |
theme_economist() + scale_fill_economist() | |
``` | |
--- | |
## Higher Education - Dad & Mom: You are right. | |
```{r} | |
ggplot(asian_adult[asian_adult$workclass=="PRIV",], aes(x=class, y=education_num)) + | |
geom_boxplot() + facet_wrap(~sex) + | |
labs(x = "", y = "# education years", | |
title = "Asian adult in private sector - Yes, higher education helps!") + | |
theme_economist() + scale_fill_economist() | |
``` | |
--- | |
## Work Hard - Dad & Mom: You are right about boys. | |
```{r} | |
ggplot(asian_adult[asian_adult$workclass=="PRIV",], aes(x=class, y=hours_per_week)) + | |
geom_boxplot() + facet_wrap(~sex) + | |
labs(x = "", y = "work hours / week", | |
title = "Asian adult in private sector - Well, men, you better work longer :( ") + | |
theme_economist() + scale_fill_economist() | |
``` | |
--- | |
## Happy Marriage - Dad & Mom: You are right again! | |
```{r} | |
ggplot(asian_adult[asian_adult$workclass=="PRIV",], aes(x=marital_status)) + | |
geom_bar(aes(fill=marital_status)) + facet_grid(class ~ sex) + | |
xlab("marital status") + | |
theme_economist() + scale_fill_economist() + | |
ggtitle("Asian adult in private sector - Hey, married people do earn more.") | |
``` | |
## A few thoughts | |
Two out of three ain't bad, right? But, I won't conclude the formula is a fact, as I think the asian adult sample in 1994 *Adult* dataset (895) seems too small. A conclusion should be drawn after explored more samples from multiple years. | |
As an asian myself, I listen and obey my parents as they always want the best for me. Most things change as time evolves; thus, it is reasonable to say very few things hold true forever. But, parents' care for their children never change. | |
You may ask, "Are you sure?" **"Yes, because I told you so!!!"** | |
Sincerely, Conred Wang. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment