Skip to content

Instantly share code, notes, and snippets.

@giocomai
Created December 22, 2018 23:16
Show Gist options
  • Save giocomai/810fda6880e4b936f98d351d24ed9675 to your computer and use it in GitHub Desktop.
Save giocomai/810fda6880e4b936f98d351d24ed9675 to your computer and use it in GitHub Desktop.
Extracts Armenia 2011 census data for all settlements from pdf file issued by Armenia's statistical office
library("tabulizer")
library("tidyverse")
dir.create("data", showWarnings = FALSE)
dir.create(file.path("data", "original_files"), showWarnings = FALSE)
census_2011_pdf_url <- "https://www.armstat.am/file/article/1._bajin_1_182-311.pdf"
census_2011_pdf_file <- file.path("data", "original_files", "census_2011.pdf")
if (file.exists(census_2011_pdf_file)==FALSE) {
download.file(url = census_2011_pdf_url, destfile = census_2011_pdf_file)
}
tables <- extract_tables(file = census_2011_pdf_file, pages = 36:101, output = "data.frame")
tables_cleaned <- vector("list", length(36:101))
for (i in seq_along(tables)) {
if (ncol(tables[[i]])==8) {tables[[i]] <- tables[[i]][,-5]} #deals with problematic page
tables_cleaned[[i]] <- set_names(as.data.frame(x = tables[[i]],stringsAsFactors = FALSE), nm = c("Village", "Total_de_facto", "Men_de_facto", "Women_de_facto", "Total_de_jure", "Men_de_jure", "Women_de_jure")) %>%
mutate(Village = lag(Village)) %>%
filter(Total_de_facto != "") %>%
mutate_all(.funs = str_remove_all, pattern = ",") %>%
mutate_if(str_detect(string = colnames(.), pattern = "_"), .funs = as.numeric)
}
census_2011 <- purrr::map_df(.x = tables_cleaned, .f = bind_rows)
write_csv(x = census_2011, path = file.path("data", "armenia_census_2011.csv"))
@giocomai
Copy link
Author

Warning: it's not 100% perfect, does not recognise three lines in the last page of table 1.3 (page 101), and some other data may be missing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment