Skip to content

Instantly share code, notes, and snippets.

Embed
What would you like to do?
Extract link texts and urls from a web page into an R data frame
@Telecastro

This comment has been minimized.

Copy link

Telecastro commented Mar 19, 2019

Hi Paul,
thanks for your code. I tried to use it for extracting only link text from standalone hyperlink string
something ,but it does not work for me...

Please, could you help me, how to extract the word "something" from that hyperlink ?

Thanks a lot !

Viktor

@ajatoledo

This comment has been minimized.

Copy link

ajatoledo commented Apr 5, 2019

ln20 should read:

return(data.frame(link = link_, url = url_))
@philipbard

This comment has been minimized.

Copy link

philipbard commented Jun 20, 2020

ln20 should read:

return(data.frame(link = link_, url = url_))

don't you mean line 20?

@paulrougieux

This comment has been minimized.

Copy link
Owner Author

paulrougieux commented Jun 22, 2020

@ajatoledo the dplyr::data_frame has been replaced with dplyr::tibble you probably have seen the message: "data_frame() is deprecated, use tibble()." I replaced this in the function above.

@ajatoledo

This comment has been minimized.

Copy link

ajatoledo commented Jun 22, 2020

@paulrougieux data.frame() != data_frame(); my suggestion returns a data frame, tibble returns a tibble. Approach is yours, but data.frame() is base R while tibble() is tidyverse.

@paulrougieux

This comment has been minimized.

Copy link
Owner Author

paulrougieux commented Jun 22, 2020

@ajatoledo data.frame creates factor variables by default, you should add the argument stringsAsFactors=FALSE to have character variables: data.frame(link = link_, url = url_, stringsAsFactors=FALSE). tibble(link = link_, url = url_) is preferable in this simple example, because it will creates the link and url columns as character variables. In addition rvest is also part of the tidyverse suite of packages, and of course tibble is loaded by default by the very well known dplyr package.

@ajatoledo

This comment has been minimized.

Copy link

ajatoledo commented Jun 22, 2020

@paulrougieux missed the dplyr items for url and link items meaning tibble() is already in the namespace; however, the description says it outputs into a data frame, which could be misleading as the returned object is a tibble. Which could be problematic for some users as tibble doesn't always play well with other functions that expect data.frame inputs, e.g. write.csv().

As for stringsAsFactors=FALSE in data.frame; if you wanted to set that, you'd likely want to do that in the initial function arguments so user is stuck to pre-defined arguments in the function.

Though either way, do what you want, each use case is different. Either way, the original function provided me a starting point when I needed it and got me where I needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.