Created
April 18, 2012 01:30
-
-
Save cherenkov/2410393 to your computer and use it in GitHub Desktop.
あるWebページ内にある全ての「リンク先のURLと文字列」を上手く.. - 人力検索はてな http://q.hatena.ne.jp/1334564001
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
使い方: | |
gem install mechanizeしたら、 | |
ruby get_anchor.rb url_list.txt |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# coding: utf-8 | |
#あるWebページ内にある全ての「リンク先のURLと文字列」を上手く.. - 人力検索はてな | |
#http://q.hatena.ne.jp/1334564001 | |
require 'mechanize' | |
class String | |
def strip_with_full_size_space! | |
s = " \s\r\n\t\f\v" | |
gsub!(/^[#{s}]*|[#{s}]*$/o, '') | |
end | |
def strip_with_full_size_space | |
clone.strip_with_full_size_space! | |
end | |
end | |
def get_anchor(url) | |
result = '' | |
agent = Mechanize.new | |
agent.get(url) | |
result << "====================\n#{agent.page.title}\n#{url}\n#{Time.now}\n====================\n\n" | |
agent.page.links.each do |link| | |
#ページURLと相対パスを無理やりくっつけているが正しく絶対パスが取れるみたい | |
result << "#{URI.join(url, link.uri.to_s).to_s} --> #{link.text.gsub(/[\r\n]/, '')}\n" | |
end | |
return result | |
end | |
def create_dir(path) | |
FileUtils.mkdir_p(path) unless FileTest.exist?(path) | |
end | |
create_dir('./output') | |
IO.foreach(ARGV[0]) do |s| | |
output = ''; | |
s.strip_with_full_size_space! | |
unless s.empty? | |
output << get_anchor(s) | |
uri = URI.parse(s) | |
filepath = [uri.host, uri.path, uri.fragment].join.gsub(/\W/, '') + '.txt' | |
filepath = './output/' + filepath | |
output_file = File.open(filepath, 'w') | |
output_file.write(output) | |
output_file.close | |
end | |
end |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
http://q.hatena.ne.jp/1334564001 | |
http://hatena.g.hatena.ne.jp/hatenaquestion/20120306/1331018692 | |
http://www.ruby-lang.org/ja/old-man/html/String.html#gsub.21 | |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment