Skip to content

Instantly share code, notes, and snippets.

@geotheory
Last active March 18, 2019 20:11
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save geotheory/278cf747c22312a8568c7017dde0cbc6 to your computer and use it in GitHub Desktop.
Save geotheory/278cf747c22312a8568c7017dde0cbc6 to your computer and use it in GitHub Desktop.
Parse the Mozilla-initiated Public Suffix List list of TLDs and public subdomains into a useable R data.frame. With function to return root private subdomain
# see https://publicsuffix.org/
require(stringr)
require(dplyr)
ps = readLines('https://publicsuffix.org/list/public_suffix_list.dat') %>% paste(collapse='$') %>%
str_extract('(?<=BEGIN ICANN DOMAINS===).*(?=// ===END PRIVATE DOMAINS)') %>% str_split('[$]') %>% .[[1]] %>%
enframe(name = NULL) %>% rename(subdom = value) %>% mutate(subdom = str_trim(subdom)) %>%
filter(subdom != '') %>% filter(!str_detect(subdom, '^//')) %>%
mutate(tld = subdom %>% str_remove('.*[.]'),
subdom = str_remove(subdom, '^[*]?[.]')) %>% arrange(tld, subdom)
# function to return the root of any domain (i.e. the first subdomain of a public suffix dom
root_domain = function(u, domlist, keep_unmatched = FALSE){
u = str_remove(u, 'https?://') %>% str_remove('www[0-9]?.') %>% str_remove('/.*')
u_spl = str_split(u, '[.]')[[1]]
for(i in 1:length(u_spl)){
if(paste(u_spl[i:length(u_spl)], collapse='.') %in% domlist){
return(paste(u_spl[(i-1):length(u_spl)], collapse='.'))
}
}
ifelse(keep_unmatched, u, NA_character_)
}
## Example
#
# x = c("https://www.news.bbc.co.uk/128631826.htm", "uk", "co.uk", "bbc.co.uk",
# "news.bbc.co.uk", "sub.not-me.uk", "does.not.exist") # sub.not-me.uk is false positive test for me.uk
#
# purrr::map_chr(x, ~ root_domain(.x, ps$subdom))
# [1] "bbc.co.uk" "uk" "co.uk" "bbc.co.uk" "bbc.co.uk" "not-me.uk" NA
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment