Skip to content

Instantly share code, notes, and snippets.

@Tadge-Analytics
Created October 31, 2019 02:59
Show Gist options
  • Save Tadge-Analytics/9ce7f6f5333339cd48c1ad4278a08905 to your computer and use it in GitHub Desktop.
Save Tadge-Analytics/9ce7f6f5333339cd48c1ad4278a08905 to your computer and use it in GitHub Desktop.
# We also want to identify services stations which are the same...
# but where the address is written differently
# in a way not corrected for, above...
unique_stations %>%
distinct(full_name, new_address) %>%
group_by(full_name) %>%
mutate(rank = row_number()) %>%
spread(rank, new_address) %>%
mutate(response = NA_character_) %>%
select(full_name, response, everything()) %>%
openxlsx::write.xlsx("are these dupliate names.xlsx", asTable = T)
# if there are duplicates...
# How should we idenitify them?
# and what should we do?
# If they are the same identify the Rank of the one that you will assign all entries
recoded_addresses <- readxl::read_excel("reference/are these dupliate names.xlsx", col_types = "text") %>%
filter(!is.na(response)) %>%
gather(key, new_address, -full_name, -response) %>%
filter(!is.na(new_address))
addresses_ready <- recoded_addresses %>%
left_join(recoded_addresses %>%
filter(!is.na(as.numeric(response)) & as.numeric(response) == key) %>%
select(-key) %>%
rename(replacement_address = new_address),
by = c("full_name", "response")) %>%
mutate(replacement_address = if_else(is.na(replacement_address), new_address, replacement_address)) %>%
select(-response, -key)
# take the replacements into account
stations_we_are_using <- unique_stations %>%
inner_join(addresses_ready, by = c("full_name", "new_address")) %>%
select(-new_address) %>%
mutate(unique_store_id = paste0(full_name, replacement_address, sep = "|")) %>%
semi_join(valid_stores, by = c("unique_store_id" = "Unique Store Id"))
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment