Skip to content

Instantly share code, notes, and snippets.

@swuyts
Last active June 14, 2018 16:36
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save swuyts/99f34b6041565672b022e0d8b686afed to your computer and use it in GitHub Desktop.
Save swuyts/99f34b6041565672b022e0d8b686afed to your computer and use it in GitHub Desktop.
library(tidyverse)
library(rvest)
# Read in the website
site <- read_html("https://en.wikipedia.org/wiki/2018_FIFA_World_Cup_squads")
# Parse website for player tables
players <- site %>%
html_table(fill = T) %>%
.[1:32] # Keep only the tables related to the 32 teams
# Parse website for team names
teams <- site %>%
html_nodes("h3 .mw-headline") %>%
html_text() %>%
.[1:32] # keep only the first 32 hits
# Parse website for coach names
coaches <- site %>%
html_nodes("h3+ p") %>%
html_text() %>%
.[1:32] %>% # Keep only the first 32 hits
str_replace_all("Coach: ", "") %>% # Clean up the string
str_trim() # remove leading whitespaces
# Parse website to figure out in which group the team competes
group <- site %>%
html_nodes("h2 .mw-headline") %>%
html_text() %>%
.[1:8] %>% # Keep only the first 8 hits
rep(4) %>% # Make the group vector match the team vector
sort()
# Now that we have all of the tables separatly, let's combine them into one
table <- tibble(team = teams,
coach = coaches,
group = group,
player = players) %>%
unnest() %>% # The players table was a list, we need to unnest this
rename(position = `Pos.`) %>%
mutate(position = str_sub(position, 2,3)) %>% # Fix parsing error
rename(age = `Date of birth (age)`) %>%
mutate(age = as.integer(str_sub(age,-4, -2)))
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment