Skip to content

Instantly share code, notes, and snippets.

@tomschenkjr
Last active May 5, 2017 22:34
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save tomschenkjr/2ddf1ee2b54f7adb02607ace292c3990 to your computer and use it in GitHub Desktop.
Save tomschenkjr/2ddf1ee2b54f7adb02607ace292c3990 to your computer and use it in GitHub Desktop.
An alpha of a export.socrata() function for the RSocrata package. See https://github.com/Chicago/RSocrata/issues/126
library(devtools)
install_github("Chicago/RSocrata" ref = "issue124") # RSocrata 1.7.2-7 or above
library(RSocrata)
#' Exports CSVs from Socrata data portals
#'
#' Input the URL of a data portal (e.g., "data.cityofchicago.org") and
#' will download all CSV files (no other files supported) and saved in
#' a single directory named after the root URL (e.g., "data.cityofchicago.org/").
#' Downloaded files are compressed to GZip format and timestamped so the download
#' time is saved. No data is saved within the R workspace.
#' @param url - the base URL of a domain (e.g., "data.cityofchicago.org")
#' @return a Gzipped file with the four-by-four and timestamp of when the download began in filename
#' @author Tom Schenk Jr \email{tom.schenk@@cityofchicago.org}
#' @export
export.socrata <- function(url) {
dir.create(basename(url), showWarnings = FALSE) # Create directory based on URL
ls <- ls.socrata(url = url)
for (i in 1:dim(ls)[1]) {
# Track timestamp before download
downloadTime <- Sys.time()
downloadTz <- Sys.timezone()
# Download data
downloadUrl <- ls$distribution[[i]]$downloadURL[1] # Currently grabs CSV, which is the first element
d <- read.socrata(downloadUrl)
# Construct the filename output
downloadTimeChr <- gsub('\\s+','_',downloadTime) # Remove spaces and replaces with underscore
downloadTimeChr <- gsub(':', '', downloadTimeChr) # Removes colon from timestamp to be valid filename
filename <- httr::parse_url(ls$identifier[i])
filename$path <- substr(filename$path, 11, 19)
filename <- paste0(filename$hostname, "/", filename$path, "_", downloadTimeChr, ".", default_format, ".gz")
# Write file
write.csv(d, file = gzfile(filename))
}
}
@nicklucius
Copy link

I needed to add a comma to Line 2--here is the line with the comma added:

install_github("Chicago/RSocrata", ref = "issue124") # RSocrata 1.7.2-7 or above

Then I tried to download Cook County's Data Portal (I figure it will take less time than the City's). I am getting this error. I haven't looked into what's happening yet.

> test <- export.socrata("https://datacatalog.cookcountyil.gov")
Error in validateUrl(url, app_token) : 
  rows.csv is not a valid Socrata dataset unique identifier.

@nicklucius
Copy link

I was able to get this up and running (not sure what I did to get the error reported above). I let it run for a while and it downloaded readable files from datacatalog.cookcountyil.gov and data.cityofchicago.org. It errored out when it hit a PDF file on the Cook County data portal.

@tomschenkjr
Copy link
Author

I've now moved this code over to it's own branch. Can continue the discussion in the corresponding issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment