Skip to content

Instantly share code, notes, and snippets.

@arbelt
Last active August 1, 2018 20:11
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save arbelt/25c2139a161fb91a420845584c0e1de9 to your computer and use it in GitHub Desktop.
Save arbelt/25c2139a161fb91a420845584c0e1de9 to your computer and use it in GitHub Desktop.
Cities
library(tidyverse)
library(stringi)
library(stringr)
df <- tibble::tribble(
~id, ~response,
1, "Paris to Berlin blah blah blah",
2, "Hello there stuff berlin London to Madrid berlini-Stuff Dover",
3, "Białystok to Port-au-Prince",
4, "I went to Saudi",
5, "I went to Saudi arabia"
)
cities <- c("Paris", "Berlin", "London", "Madrid", "Dover(-Foxcroft)?",
"Bialystok", "Port-au-Prince", "Beijing", "Saudi( Arabia)?")
codes_rex <- "\\b(BOS|CDG|LAX)((?:-(BOS|CDG|LAX))+)\\b"
stri_match_all_regex("BOS-CDG-LAX and other stuff", codes_rex)
cities_rx <- str_c("(?i)\\b(", str_c(cities, collapse = "|"), ")\\b")
df_ <- df %>%
mutate(response = stri_trans_general(response, "latin-ascii")) %>%
mutate(cities = str_extract_all(response, cities_rx)) %>%
select(id, cities) %>%
unnest %>%
mutate(cities = str_to_title(cities))
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment