Skip to content

Instantly share code, notes, and snippets.

@roger35972134
Last active September 27, 2017 07:41
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save roger35972134/f2e13bcd78a7d59942cd518e4473d8a2 to your computer and use it in GitHub Desktop.
Save roger35972134/f2e13bcd78a7d59942cd518e4473d8a2 to your computer and use it in GitHub Desktop.
crawler preparation for NEWS analysis
require 'nokogiri'
require 'open-uri'
# Let's try to fetch and parse HTML document
books = Nokogiri::HTML(open('https://udn.com/rank/pv/2/0/1'))
news = []
i = 0
books.css('dt h2 a').each do |link|
news.push link['href']
i+=1
end
string = ''
news.each do |n|
article = Nokogiri::HTML(open(n))
article.css('p').each do |link|
string += link.content
end
end
File.open('article.txt', 'w') { |file| file.write(string) }
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment