Skip to content

Instantly share code, notes, and snippets.

@paulrougieux
Last active October 3, 2021 06:29
Show Gist options
  • Star 6 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save paulrougieux/e1ee769577b40cd9ed9db7f75e9a2cc2 to your computer and use it in GitHub Desktop.
Save paulrougieux/e1ee769577b40cd9ed9db7f75e9a2cc2 to your computer and use it in GitHub Desktop.
Extract link texts and urls from a web page into an R data frame
@Telecastro
Copy link

Hi Paul,
thanks for your code. I tried to use it for extracting only link text from standalone hyperlink string
something ,but it does not work for me...

Please, could you help me, how to extract the word "something" from that hyperlink ?

Thanks a lot !

Viktor

@ajatoledo
Copy link

ajatoledo commented Apr 5, 2019

ln20 should read:

return(data.frame(link = link_, url = url_))

@philipbard
Copy link

ln20 should read:

return(data.frame(link = link_, url = url_))

don't you mean line 20?

@paulrougieux
Copy link
Author

@ajatoledo the dplyr::data_frame has been replaced with dplyr::tibble you probably have seen the message: "data_frame() is deprecated, use tibble()." I replaced this in the function above.

@ajatoledo
Copy link

@paulrougieux data.frame() != data_frame(); my suggestion returns a data frame, tibble returns a tibble. Approach is yours, but data.frame() is base R while tibble() is tidyverse.

@paulrougieux
Copy link
Author

paulrougieux commented Jun 22, 2020

@ajatoledo data.frame creates factor variables by default, you should add the argument stringsAsFactors=FALSE to have character variables: data.frame(link = link_, url = url_, stringsAsFactors=FALSE). tibble(link = link_, url = url_) is preferable in this simple example, because it will creates the link and url columns as character variables. In addition rvest is also part of the tidyverse suite of packages, and of course tibble is loaded by default by the very well known dplyr package.

@ajatoledo
Copy link

@paulrougieux missed the dplyr items for url and link items meaning tibble() is already in the namespace; however, the description says it outputs into a data frame, which could be misleading as the returned object is a tibble. Which could be problematic for some users as tibble doesn't always play well with other functions that expect data.frame inputs, e.g. write.csv().

As for stringsAsFactors=FALSE in data.frame; if you wanted to set that, you'd likely want to do that in the initial function arguments so user is stuck to pre-defined arguments in the function.

Though either way, do what you want, each use case is different. Either way, the original function provided me a starting point when I needed it and got me where I needed.

@goodyonsen
Copy link

@paulrougieux;
Hi Paul,
I want to use this code but I first need to get into a link href=database.htm in the main page of the website. And then I have to extract information from many more links there.
How do I modify this code accordingly?
Thanks.

@paulrougieux
Copy link
Author

@goodyonsen you will have better luck asking on Stackoverflow. To get you started, prepare answers to the following questions. What have you tried so far? What is the error message? Make a reproducible example with a public URL.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment