Created
August 7, 2013 13:07
-
-
Save easonhan007/6173875 to your computer and use it in GitHub Desktop.
将天涯易读的帖子中的内容抓出来并打印
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#encoding: utf-8 | |
# 文件名 tuoshui.rb | |
# 将天涯易读的帖子中的内容抓出来并打印 | |
# 使用方法: | |
# ruby tuoshui.rb [天涯易读帖子id] > result.txt | |
# 如果没有在运行脚本时指定帖子id的话,默认id为40489 | |
require 'watir-webdriver' | |
def build_url(id) | |
sprintf('http://www.tianyayidu.cc/article-a-%d-%%d.html', id) | |
end | |
id = 40489 | |
id = ARGV.first.nil? ? id : ARGV.first.to_i | |
url = build_url(id) | |
puts url | |
def page(index, url) | |
sprintf(url, index) | |
end | |
b = Watir::Browser.new :chrome | |
b.goto page(1, url) | |
page_text = b.div(:class, 'pageNum1').text | |
m = page_text.match(/(\d+)/) | |
page = m ? m[1] : 10 | |
page = page.to_i | |
(1..page).each do |p| | |
b.goto page(p, url) | |
b.lis(:class, 'at c h2').each {|li| puts li.text} | |
end | |
b.quit |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment