Skip to content

Instantly share code, notes, and snippets.

Created October 18, 2016 17:02
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save conredwang/5137fa2a6ee5addef4da582abc3f9f07 to your computer and use it in GitHub Desktop.
Save conredwang/5137fa2a6ee5addef4da582abc3f9f07 to your computer and use it in GitHub Desktop.
I Told You So! ...<R, gglot2, knitR>
title: “I Told You So!"
author: “Conred Wang“
date: “10/10/2016”
logo: assets/img/
css: assets/css/ioslides.css
* R visualization using gglot2.
* Slide presentation is rendered using knitR.
* Most asian parents tell their children: [ Higher Education + Work Hard + Happy Marriage = Good Life] Using 1994 census data, we try to find out if this is a myth or a fact.
```{r setup, include=FALSE, message=FALSE, warning=FALSE}
knitr::opts_chunk$set(echo = FALSE)
```{r, echo=FALSE}
logo <- "assets/img/"
## Most asian parents tell their children :
#### Higher Education + Work Hard + Happy Marriage = Good Life
### ?
### ? Myth or Fact ?
### ?
### Using **R** and **ggplot2**, and visually exploring the *Adult* dataset from *UC Irvine Machine Learning Repository*, I may become the Myth Buster of :
### Higher Education + Work Hard + Happy Marriage = Good Life
```{r, engine='bash', echo=FALSE, eval=TRUE, results='hide'}
# Get from repository.
curl -O ""
# Move to folder data.
mv data/
# uses ", " (comma followed by one space) as field separator.
# Transform field separator to "," (a single comma).
sed -i -e 's/, /,/g' data/
# uses "?" for missing values.
# Transform missing values with "NA".
sed -i -e 's/?/NA/g' data/
## ETL
### Extract - Transform - Load
- The *Adult* dataset is originally used to predict whether income exceeds $50K/year, based on 1994 census data. Information can be found at [](
- The *Adult* dataset can be downloaded at [](
- The *Adult* dataset uses ", " (comma followed by one space) as field separator. I transform it to "," (a single comma).
- The *Adult* dataset uses "?" for missing values. I transform it with "NA".
adult = read.table('data/', header=FALSE, sep=',', na.strings=(NA))
- I load the *Adult* dataset to data frame *adult* which has `r ncol(adult)` attributes and `r nrow(adult)` observations.
```{r, echo=TRUE}
adult = na.omit(adult)
- I remove NA rows and then it has `r ncol(adult)` attributes and `r nrow(adult)` observations.
```{r, echo=TRUE}
- I give each attribute a name:
```{r, echo=TRUE}
colnames(adult) = c(
'age', 'workclass', 'fnlwgt', 'education', 'education_num',
'marital_status', 'occupation', 'relationship', 'race', 'sex',
'capital_gain', 'capital_loss', 'hours_per_week',
'native_country', 'class')
- I rename *Marital Status* for better display and counting:
```{r, echo=TRUE}
=="Divorced"] = "Divor"
=="Married-AF-spouse"] = "Marri"
=="Married-civ-spouse"] = "Marri"
=="Married-spouse-absent"] = "Marri"
=="Never-married"] = "Never"
=="Separated"] = "Separ"
=="Widowed"] = "Widow"
- I rename *Work Class* for better display and counting:
```{r, echo=TRUE}
levels(adult$workclass)[levels(adult$workclass)=="Federal-gov"] ="GOV"
levels(adult$workclass)[levels(adult$workclass)=="Local-gov"] = "GOV"
levels(adult$workclass)[levels(adult$workclass)=="Never-worked"] =
levels(adult$workclass)[levels(adult$workclass)=="Private"] = "PRIV"
levels(adult$workclass)[levels(adult$workclass)=="Self-emp-inc"] =
levels(adult$workclass)[levels(adult$workclass)=="Self-emp-not-inc"] =
levels(adult$workclass)[levels(adult$workclass)=="State-gov"] = "GOV"
levels(adult$workclass)[levels(adult$workclass)=="Without-pay"] =
- I add numeric "Ed Level" based on factor "Education":
```{r, echo=FALSE, message=FALSE, warning=FALSE}
adult = mutate(adult, ed_level = 0)
obs = nrow(adult)
for (i in 1:obs) {
"Preschool" = { adult$ed_level[i] = 1 },
"1st-4th" = { adult$ed_level[i] = 2 },
"5th-6th" = { adult$ed_level[i] = 3 },
"7th-8th"= { adult$ed_level[i] = 3 },
"9th" = { adult$ed_level[i] = 4 },
"10th" = { adult$ed_level[i] = 4 },
"11th" = { adult$ed_level[i] = 4 },
"12th" = { adult$ed_level[i] = 4 },
"HS-grad" = { adult$ed_level[i] = 4 },
"Some-college" = { adult$ed_level[i] = 4 },
"Assoc-acdm" = { adult$ed_level[i] = 5 },
"Assoc-voc" = { adult$ed_level[i] = 5 },
"Prof-school" = { adult$ed_level[i] = 5 },
"Bachelors" = { adult$ed_level[i] = 6 },
"Masters" = { adult$ed_level[i] = 7 },
"Doctorate" = { adult$ed_level[i] = 8 })}
```{r, echo=TRUE, eval=FALSE}
adult = mutate(adult, ed_level = 0)
obs = nrow(adult)
for (i in 1:obs) {
"Preschool" = { adult$ed_level[i] = 1 },
"1st-4th" = { adult$ed_level[i] = 2 },
"5th-6th" = { adult$ed_level[i] = 3 },
"7th-8th"= { adult$ed_level[i] = 3 },
"9th" = { adult$ed_level[i] = 4 },
"10th" = { adult$ed_level[i] = 4 },
"11th" = { adult$ed_level[i] = 4 },
"12th" = { adult$ed_level[i] = 4 },
"HS-grad" = { adult$ed_level[i] = 4 },
```{r, echo=TRUE, eval=FALSE}
"Some-college" = { adult$ed_level[i] = 4 },
"Assoc-acdm" = { adult$ed_level[i] = 5 },
"Assoc-voc" = { adult$ed_level[i] = 5 },
"Prof-school" = { adult$ed_level[i] = 5 },
"Bachelors" = { adult$ed_level[i] = 6 },
"Masters" = { adult$ed_level[i] = 7 },
"Doctorate" = { adult$ed_level[i] = 8 })}
- I use numeric *Education Years* from original dataset as education indicator since it agrees with factor *Education*.
ggplot(adult, aes(x=ed_level, y=education_num)) +
geom_smooth(method = "lm") +
labs(x="Education Level", y="Education Years",
title="More/Longer Education Years indicates Higher Education") +
theme_economist() + scale_fill_economist()
- I single asian out from *adult* into data frame *asian_adult*:
```{r, echo=TRUE}
summarise( group_by(adult, race), n())
asian_adult = adult[adult$race == "Asian-Pac-Islander",]
- I further single out asian adult working in private section since this sector has more asian adult.
ggplot(asian_adult, aes(x=workclass)) +
geom_bar(aes(fill=class)) + facet_wrap(~sex) +
xlab("work class [ Government | Private | Self-employed | No-pay ]") +
theme_economist() + scale_fill_economist() +
ggtitle("Asian Adult - Which work class pay ?")
## Higher Education + Work Hard + Happy Marriage = Good Life ???
ggplot(asian_adult[asian_adult$workclass=="PRIV",], aes(x=education_num,y=hours_per_week)) +
geom_point() + facet_grid(marital_status ~ class) +
labs(x = "# education years", y = "work hours / week",
title="Asian Adult - higher education + hard work + happy marriage = $$$ ?") +
theme_economist() + scale_fill_economist()
## Higher Education - Dad & Mom: You are right.
ggplot(asian_adult[asian_adult$workclass=="PRIV",], aes(x=class, y=education_num)) +
geom_boxplot() + facet_wrap(~sex) +
labs(x = "", y = "# education years",
title = "Asian adult in private sector - Yes, higher education helps!") +
theme_economist() + scale_fill_economist()
## Work Hard - Dad & Mom: You are right about boys.
ggplot(asian_adult[asian_adult$workclass=="PRIV",], aes(x=class, y=hours_per_week)) +
geom_boxplot() + facet_wrap(~sex) +
labs(x = "", y = "work hours / week",
title = "Asian adult in private sector - Well, men, you better work longer :( ") +
theme_economist() + scale_fill_economist()
## Happy Marriage - Dad & Mom: You are right again!
ggplot(asian_adult[asian_adult$workclass=="PRIV",], aes(x=marital_status)) +
geom_bar(aes(fill=marital_status)) + facet_grid(class ~ sex) +
xlab("marital status") +
theme_economist() + scale_fill_economist() +
ggtitle("Asian adult in private sector - Hey, married people do earn more.")
## A few thoughts
Two out of three ain't bad, right? But, I won't conclude the formula is a fact, as I think the asian adult sample in 1994 *Adult* dataset (895) seems too small. A conclusion should be drawn after explored more samples from multiple years.
As an asian myself, I listen and obey my parents as they always want the best for me. Most things change as time evolves; thus, it is reasonable to say very few things hold true forever. But, parents' care for their children never change.
You may ask, "Are you sure?" **"Yes, because I told you so!!!"**
Sincerely, Conred Wang.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment