Skip to content

Instantly share code, notes, and snippets.

@cherenkov
Created April 18, 2012 01:30
Show Gist options
  • Save cherenkov/2410393 to your computer and use it in GitHub Desktop.
Save cherenkov/2410393 to your computer and use it in GitHub Desktop.
あるWebページ内にある全ての「リンク先のURLと文字列」を上手く.. - 人力検索はてな http://q.hatena.ne.jp/1334564001
使い方:
gem install mechanizeしたら、
ruby get_anchor.rb url_list.txt
# coding: utf-8
#あるWebページ内にある全ての「リンク先のURLと文字列」を上手く.. - 人力検索はてな
#http://q.hatena.ne.jp/1334564001
require 'mechanize'
class String
def strip_with_full_size_space!
s = " \s\r\n\t\f\v"
gsub!(/^[#{s}]*|[#{s}]*$/o, '')
end
def strip_with_full_size_space
clone.strip_with_full_size_space!
end
end
def get_anchor(url)
result = ''
agent = Mechanize.new
agent.get(url)
result << "====================\n#{agent.page.title}\n#{url}\n#{Time.now}\n====================\n\n"
agent.page.links.each do |link|
#ページURLと相対パスを無理やりくっつけているが正しく絶対パスが取れるみたい
result << "#{URI.join(url, link.uri.to_s).to_s} --> #{link.text.gsub(/[\r\n]/, '')}\n"
end
return result
end
def create_dir(path)
FileUtils.mkdir_p(path) unless FileTest.exist?(path)
end
create_dir('./output')
IO.foreach(ARGV[0]) do |s|
output = '';
s.strip_with_full_size_space!
unless s.empty?
output << get_anchor(s)
uri = URI.parse(s)
filepath = [uri.host, uri.path, uri.fragment].join.gsub(/\W/, '') + '.txt'
filepath = './output/' + filepath
output_file = File.open(filepath, 'w')
output_file.write(output)
output_file.close
end
end
http://q.hatena.ne.jp/1334564001
http://hatena.g.hatena.ne.jp/hatenaquestion/20120306/1331018692
http://www.ruby-lang.org/ja/old-man/html/String.html#gsub.21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment