Skip to content

Instantly share code, notes, and snippets.

@phileas-condemine
Last active November 4, 2019 18:50
Show Gist options
  • Save phileas-condemine/72584db5383d931214371b3a3eec3bf4 to your computer and use it in GitHub Desktop.
Save phileas-condemine/72584db5383d931214371b3a3eec3bf4 to your computer and use it in GitHub Desktop.
Un script pour collecter les en-têtes d'articles du site Le Monde.fr à des fins exploratoire (text mining)
library(rvest)
library(data.table)
annees=1980:2019
annee = sample(annees,1)
pbapply::pblapply(annees,function(annee){
pbapply::pblapply(1:250,function(i){
tryCatch({
url = sprintf(paste0("https://www.lemonde.fr/recherche/",
"?search_keywords=a&start_at=01/01/%s&",
"end_at=31/12/%s&search_sort=relevance_desc&page=%s"),annee,annee,i)
page = rvest::html_session(url)
href = page%>%html_nodes("section.teaser")%>%html_node("a")%>%html_attr("href")
title = page%>%html_nodes("section.teaser")%>%html_node("h3")%>%html_text()
abstract = page%>%html_nodes("section.teaser")%>%html_node("p")%>%html_text()
date = page%>%html_nodes("section.teaser")%>%html_node("span.meta__date")%>%html_text()
data = data.frame(href=href,title=title,abstract=abstract,date=date,stringsAsFactors = F)
save(data,file=paste0("lemonde_scraping/a_",annee,"_",i,".RData"))
})
})
})
files = list.files("lemonde_scraping/")
annees = table(substr(files,3,6))
annees = names(annees[annees==250])
grid =expand.grid(annee=as.numeric(annees),nb=1:250,stringsAsFactors = F)
scrape = pbapply::pbapply(grid,1,function(x){
print(x)
load(paste0("lemonde_scraping/a_",x[1],"_",x[2],".RData"))
data
})
scrape_dt = rbindlist(scrape)
scrape_dt=unique(scrape_dt)
save(scrape_dt,file="lemonde_scraping_10k_per_annee.RData")
@phileas-condemine
Copy link
Author

attention mises à jour fréquentes du site

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment