Skip to content

Instantly share code, notes, and snippets.

@kguidonimartins
Forked from paulrougieux/scraplinks.R
Last active March 21, 2020 16:52
Show Gist options
  • Save kguidonimartins/e6e54293d90f590f562c91c80a33d87e to your computer and use it in GitHub Desktop.
Save kguidonimartins/e6e54293d90f590f562c91c80a33d87e to your computer and use it in GitHub Desktop.
Extract link texts and urls from a web page into an R data frame
@kguidonimartins
Copy link
Author

kguidonimartins commented Mar 21, 2020

How to use:

# loading packages
if (!require("tidyverse")) install.packages("tidyverse")
if (!require("rvest")) install.packages("rvest")

# loading function
#' Extract link texts and urls from a web page
#' @param url character an url
#' @return a data frame of link text and urls
#' @examples
#' \dontrun{
#' scraplinks("http://localhost/")
#' glinks <- scraplinks("http://google.com/")
#' }
#' @export
scraplinks <- function(url){
  # Create an html document from the url
  webpage <- xml2::read_html(url)
  # Extract the URLs
  url_ <- webpage %>%
    rvest::html_nodes("a") %>%
    rvest::html_attr("href")
  # Extract the link text
  link_ <- webpage %>%
    rvest::html_nodes("a") %>%
    rvest::html_text()
  return(tibble(link = link_, url = url_))
}

# getting the links
results <- scraplinks("http://google.com/")

# viewing the tibble
View(results)

# saving the resulting tibble
results %>% 
  write_csv("results.csv")

# extra: you can read the results.csv with:
df <- read_csv("results.csv")

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment