Skip to content

Instantly share code, notes, and snippets.

@agricolamz
Created March 29, 2024 14:26
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save agricolamz/3acdb15c819e29d61ab6bd3302b30d1e to your computer and use it in GitHub Desktop.
Save agricolamz/3acdb15c819e29d61ab6bd3302b30d1e to your computer and use it in GitHub Desktop.
library(tidyverse)
t <- pdftools::pdf_ocr_text("Khan 2008 Jewish Neo-Aramaic Dialect of Urmi-465-497.pdf")
tibble(text = str_split(t, "\n\n") |> unlist()) |>
filter(!str_detect(text, "GLOSSARY OF VERBS"),
nchar(text) > 4) |>
slice(-c(1:2)) |>
mutate(verb = str_extract(text, "\\S{1,}\\s"),
verb = str_squish(verb),
text = str_remove_all(text, "\n")) |>
select(verb, text) |>
writexl::write_xlsx("result.xlsx")
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment