Skip to content

Instantly share code, notes, and snippets.

@j138
Last active August 29, 2015 13:57
Show Gist options
  • Save j138/9484026 to your computer and use it in GitHub Desktop.
Save j138/9484026 to your computer and use it in GitHub Desktop.
Googleイメージ検索で出てくるサムネイル画像収集
#!/usr/local/bin/ruby
# coding: utf-8
require 'net/http'
require 'uri'
require 'openssl'
require 'digest/md5'
require 'nokogiri'
# ================================
# TODO: seeq txt
kw_list = %W(lol Diablo blizzard)
target_file = 'jawiki-20140225-all-titles'
save_dir = 'img/'
# ================================
# generate Query for google Image Search
def gen_crawl_uri(kw)
# safe_search_params
safe_querys = %W(off medium high)
safe_query = safe_querys[1]
host = 'https://www.google.co.jp/search?'
query_hash = {
q: URI.escape(kw), safe: safe_query,
tbm: 'isch', source: 'og',
ie: 'UTF-8', oe: 'UTF-8', hl: 'ja', lr: 'ja', client: 'firefox-a',
hs: 'zfJ', bav: 'on.2,or.r_cp.', biw: '1440', bih: '694', um: '1', pws: 0
}
host + query_hash.map { |k, v| "#{k}=#{v}" }.join('&')
end
def http_request(uri)
uri_parsed = URI.parse(uri)
http = Net::HTTP.new(uri_parsed.host, uri_parsed.port)
http.use_ssl = true
http.verify_mode = OpenSSL::SSL::VERIFY_NONE
# http.set_debug_output $stderr
http.get(uri).body
end
# e.g. save_dir/md5(url).jpg
def save_file(uri, save_dir, filename)
# all jpg. i hope so.
filename = save_dir + filename + '.jpg'
open(filename, 'wb') do |file|
file.puts Net::HTTP.get_response(URI.parse(uri)).body
end
printf('save to "%s" as %s' + "\n", uri, filename)
end
def crawl_google_image(kw, save_dir)
doc = Nokogiri::HTML.parse(http_request(gen_crawl_uri kw))
doc.css('a img').each do |node|
uri = node.attribute('src').value
save_file(uri, save_dir, filename(Digest::MD5.hexdigest(uri)))
end
end
# target->hash
kw_list.each { |kw| crawl_google_image(kw, target_save_dir) }
# target->file(kw_list)
count = 0
File.foreach(target_file) do |line|
crawl_google_image(line.chomp!, save_dir)
count += 1
break if count > 50
end
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment