Skip to content

Instantly share code, notes, and snippets.

@kfeoktistoff
Created June 18, 2014 20:22
Show Gist options
  • Star 2 You must be signed in to star a gist
  • Fork 6 You must be signed in to fork a gist
  • Save kfeoktistoff/9fb48bed375d8496b209 to your computer and use it in GitHub Desktop.
Save kfeoktistoff/9fb48bed375d8496b209 to your computer and use it in GitHub Desktop.
The zip file contains 332 comma-separated-value (CSV) files containing pollution monitoring data for fine particulate matter (PM) air pollution at 332 locations in the United States. Each file contains data from a single monitor and the ID number for each monitor is contained in the file name. For example, data for monitor 200 is contained in th…
## Write a function that reads a directory full of files and reports the number of completely observed cases in each data file.
## The function should return a data frame where the first column is the name of the file and the second column is the number
## of complete cases. A prototype of this function follows
complete <- function(directory, id = 1:332) {
## 'directory' is a character vector of length 1 indicating
## the location of the CSV files
## 'id' is an integer vector indicating the monitor ID numbers
## to be used
## Return a data frame of the form:
## id nobs
## 1 117
## 2 1041
## ...
## where 'id' is the monitor ID number and 'nobs' is the
## number of complete cases
comp <- data.frame(id=numeric(), nobs=numeric())
for (i in id) {
filename <- obsFileName(directory, i)
data <- read.csv(filename)
comp <- rbind(comp, data.frame(id=i, nobs=nrow(data[complete.cases(data), ])))
}
comp
}
## Write a function that takes a directory of data files and a threshold
## for complete cases and calculates the correlation between sulfate and
## nitrate for monitor locations where the number of completely observed
## cases (on all variables) is greater than the threshold. The function
## should return a vector of correlations for the monitors that meet the
## threshold requirement. If no monitors meet the threshold requirement,
## then the function should return a numeric vector of length 0.
corr <- function(directory, threshold = 0) {
## 'directory' is a character vector of length 1 indicating
## the location of the CSV files
## 'threshold' is a numeric vector of length 1 indicating the
## number of completely observed observations (on all
## variables) required to compute the correlation between
## nitrate and sulfate; the default is 0
## Return a numeric vector of correlations
source("complete.R")
source("obsFileName.R")
observations <- complete(directory, 1:332)
sulfate <- numeric()
nitrate <- numeric()
result <- numeric()
for (i in observations$id[observations$nobs > threshold]) {
filename <- obsFileName(directory, i)
data <- read.csv(filename)
result <- c(result, cor(data$sulfate, data$nitrate, use="complete.obs"))
}
result
}
## Return relative path to csv file by detector number
obsFileName <- function(directory, obs) {
if (obs<10) {
filename = paste(directory, "/","00", obs, ".csv", sep="")
} else if (obs >= 10 && obs < 100) {
filename = paste(directory, "/", "0", obs, ".csv", sep="")
} else {
filename = paste(directory, "/", obs, ".csv", sep="")
}
}
## Write a function named 'pollutantmean' that calculates the mean of a pollutant
## (sulfate or nitrate) across a specified list of monitors. The function
## 'pollutantmean' takes three arguments: 'directory', 'pollutant', and 'id'.
## Given a vector monitor ID numbers, 'pollutantmean' reads that monitors'
## particulate matter data from the directory specified in the 'directory' argument
## and returns the mean of the pollutant across all of the monitors,
## ignoring any missing values coded as NA
pollutantmean <- function(directory, pollutant, id = 1:332) {
## 'directory' is a character vector of length 1 indicating
## the location of the CSV files
## 'pollutant' is a character vector of length 1 indicating
## the name of the pollutant for which we will calculate the
## mean; either "sulfate" or "nitrate".
## 'id' is an integer vector indicating the monitor ID numbers
## to be used
## Return the mean of the pollutant across all monitors list
## in the 'id' vector (ignoring NA values)
source("obsFileName.R")
allData <- numeric()
for (i in id) {
filename <- obsFileName(directory, i)
data <- read.csv(filename)
if (pollutant == "sulfate") {
allData <- c(allData, data$sulfate)
} else if (pollutant == "nitrate") {
allData <- c(allData, data$nitrate)
}
}
mean(allData, na.rm=TRUE)
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment