Skip to content

Instantly share code, notes, and snippets.

@debruine
Last active October 17, 2021 13:03
Show Gist options
  • Star 3 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save debruine/23046c25fd2e8855e3717b54a43b70c2 to your computer and use it in GitHub Desktop.
Save debruine/23046c25fd2e8855e3717b54a43b70c2 to your computer and use it in GitHub Desktop.
Get data from an image
library(dplyr) # for data processing
library(tidyr) # for data processing
library(magick) # for image reading and OCR
library(tesseract) # for OCR
# read image
dataimg <- magick::image_read("https://pbs.twimg.com/media/FBv8P8XXEBgCBvS?format=jpg&name=medium")
# convert image to text
text <- magick::image_ocr(dataimg)
# clean and split text into rows
split_data <- gsub("rs\nCOUNTY VACCINES AL \\|\n\n", "", text) %>%
gsub("ll. ", "11. ", .) %>% # fix OCR misread
gsub("\n", " ", .) %>%
strsplit("\\d+\\. ") %>%
lapply(trimws)
# split rows into columns and clean
data <- data.frame(x = split_data[[1]][-1]) %>%
mutate(county = gsub("\\d.*$", "", x),
number = gsub("^[^0-9]+", "", x)) %>%
separate(col = number,
into = c("number", "percent"),
sep = "\\(|%\\)",
extra = "drop") %>%
mutate(number = as.integer(gsub(",", "", number)),
percent = as.double(percent)) %>%
select(-x)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment