Skip to content

Instantly share code, notes, and snippets.

@mGalarnyk
Created March 7, 2017 18:11
Show Gist options
  • Save mGalarnyk/5a7e0313152cd7d1ba25c045645f19f3 to your computer and use it in GitHub Desktop.
Save mGalarnyk/5a7e0313152cd7d1ba25c045645f19f3 to your computer and use it in GitHub Desktop.
Getting and Cleaning Data Quiz 3 (Week 3) John Hopkins Data Science Specialization Coursera for the github repo https://github.com/mGalarnyk/datasciencecoursera/tree/master/3_Getting_and_Cleaning_Data

Getting and Cleaning Data Quiz 3 (JHU) Coursera

Question 1

The American Community Survey distributes downloadable data about United States communities. Download the 2006 microdata survey about housing for the state of Idaho using download.file() from here:

https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Fss06hid.csv

and load the data into R. The code book, describing the variable names is here:

https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2FPUMSDataDict06.pdf

Create a logical vector that identifies the households on greater than 10 acres who sold more than $10,000 worth of agriculture products. Assign that logical vector to the variable agricultureLogical. Apply the which() function like this to identify the rows of the data frame where the logical vector is TRUE. which(agricultureLogical)

What are the first 3 values that result?

download.file('https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Fss06hid.csv'
              , 'ACS.csv'
              , method='curl' )

# Read data into data.frame
ACS <- read.csv('ACS.csv')

agricultureLogical <- ACS$ACR == 3 & ACS$AGS == 6
head(which(agricultureLogical), 3)

# Answer: 
# 125 238 262

Question 2

Using the jpeg package read in the following picture of your instructor into R

https://d396qusza40orc.cloudfront.net/getdata%2Fjeff.jpg

Use the parameter native=TRUE. What are the 30th and 80th quantiles of the resulting data?

# install.packages('jpeg')
library(jpeg)

# Download the file
download.file('https://d396qusza40orc.cloudfront.net/getdata%2Fjeff.jpg'
              , 'jeff.jpg'
              , mode='wb' )

# Read the image
picture <- jpeg::readJPEG('jeff.jpg'
                          , native=TRUE)

# Get Sample Quantiles corressponding to given prob
quantile(picture, probs = c(0.3, 0.8) )

# Answer: 
#       30%       80% 
# -15259150 -10575416 

Question 3

Load the Gross Domestic Product data for the 190 ranked countries in this data set:

https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2FGDP.csv

Load the educational data from this data set:

https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2FEDSTATS_Country.csv

Match the data based on the country shortcode. How many of the IDs match? Sort the data frame in descending order by GDP rank. What is the 13th country in the resulting data frame?

Original data sources: http://data.worldbank.org/data-catalog/GDP-ranking-table http://data.worldbank.org/data-catalog/ed-stats

# install.packages("data.table)
library("data.table")


# Download data and read FGDP data into data.table
FGDP <- data.table::fread('https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2FGDP.csv'
                          , skip=4
                          , nrows = 190
                          , select = c(1, 2, 4, 5)
                          , col.names=c("CountryCode", "Rank", "Economy", "Total")
                          )

# Download data and read FGDP data into data.table
FEDSTATS_Country <- data.table::fread('https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2FEDSTATS_Country.csv'
                                      )
                                      
mergedDT <- merge(FGDP, FEDSTATS_Country, by = 'CountryCode')

# How many of the IDs match?
nrow(mergedDT)

# Answer: 
# 189

# Sort the data frame in descending order by GDP rank (so United States is last). 
# What is the 13th country in the resulting data frame?
mergedDT[order(-Rank)][13,.(Economy)]

# Answer: 

#                Economy
# 1: St. Kitts and Nevis

Question 4

What is the average GDP ranking for the "High income: OECD" and "High income: nonOECD" group?

# "High income: OECD" 
mergedDT[`Income Group` == "High income: OECD"
         , lapply(.SD, mean)
         , .SDcols = c("Rank")
         , by = "Income Group"]

# Answer:
#
#         Income Group     Rank
# 1: High income: OECD 32.96667

# "High income: nonOECD"
mergedDT[`Income Group` == "High income: nonOECD"
         , lapply(.SD, mean)
         , .SDcols = c("Rank")
         , by = "Income Group"]

# Answer
#            Income Group     Rank
# 1: High income: nonOECD 91.91304

Question 5

Cut the GDP ranking into 5 separate quantile groups. Make a table versus Income.Group. How many countries are Lower middle income but among the 38 nations with highest GDP?

# install.packages('dplyr')
library('dplyr')

breaks <- quantile(mergedDT[, Rank], probs = seq(0, 1, 0.2), na.rm = TRUE)
mergedDT$quantileGDP <- cut(mergedDT[, Rank], breaks = breaks)
mergedDT[`Income Group` == "Lower middle income", .N, by = c("Income Group", "quantileGDP")]

# Answer 
#           Income Group quantileGDP  N
# 1: Lower middle income (38.6,76.2] 13
# 2: Lower middle income   (114,152]  9
# 3: Lower middle income   (152,190] 16
# 4: Lower middle income  (76.2,114] 11
# 5: Lower middle income    (1,38.6]  5
@LadyLazy-77
Copy link

Very helpful for a beginner. Thank you very much!

@Alisa-Swarna
Copy link

Alisa-Swarna commented Jul 5, 2020

In question 1, why 10acres written in 3 and 10,000$ written in 6? Plz clear this confusion. @mGalarnyk

@JerryMN
Copy link

JerryMN commented Jul 21, 2020

In question 1, why 10acres written in 3 and 10,000$ written in 6? Plz clear this confusion. @mGalarnyk

The data is coded that way by the authors. You can take a look at the code book and see this for yourself.

@Jerinrose
Copy link

In question3 why is skip,nrows,select used

@culithay
Copy link

In question 1, why 10acres written in 3 and 10,000$ written in 6? Plz clear this confusion. @mGalarnyk

You can check data Code Book at link
https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2FPUMSDataDict06.pdf
ACR 1
Lot size
b .N/A (GQ/not a one-family house or mobile home)
1 .House on less than one acre
2 .House on one to less than ten acres
3 .House on ten or more acres

AGS 1
Sales of Agriculture Products
b .N/A (less than 1 acre/GQ/vacant/
.2 or more units in structure)
1 .None
2 .$ 1 - $ 999
3 .$ 1000 - $ 2499
4 .$ 2500 - $ 4999
5 .$ 5000 - $ 9999
6 .$10000+

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment