Skip to content

Instantly share code, notes, and snippets.

@kimoto
Created August 18, 2015 08:42
Show Gist options
  • Save kimoto/f097a33610c79c4e019e to your computer and use it in GitHub Desktop.
Save kimoto/f097a33610c79c4e019e to your computer and use it in GitHub Desktop.
fashion press cralwer
#!/bin/env ruby
# encoding: utf-8
require 'net/http'
require 'nokogiri'
uri = URI("http://www.fashion-press.net/brands/en")
doc = Nokogiri::HTML(Net::HTTP.get(uri))
results = []
doc.search("#brandlist > table > tr > td > ul > li > a").each { |a|
en_name = a.text
body_url = a.attributes["href"].value
uri.path = body_url
resp = Net::HTTP.get_response(uri)
if resp.code.to_i == 200
inner_doc = Nokogiri::HTML(resp.body)
kana = inner_doc.search(".title_sub").text
puts [en_name, kana]
results << [en_name, kana].join("\t")
sleep 0.2
else
STDERR.puts "failed fetch [#{resp.code}] #{uri} ignored"
next
end
}
File.write("fashion_brand_master.tsv", results.join("\n"))
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment