Skip to content

Instantly share code, notes, and snippets.

@chezou
Created February 28, 2013 13:57
Show Gist options
  • Save chezou/5056910 to your computer and use it in GitHub Desktop.
Save chezou/5056910 to your computer and use it in GitHub Desktop.
Split a url file into some files. Same hosts are written in the same file.
#!/usr/bin/env ruby
# -*- coding: utf-8 -*-
require 'URI'
file = ARGV.shift
num_split = ARGV.shift.to_i
order = Math::log10(num_split).truncate + 1
url_hash = {}
open(file).each_line do |line|
line ? line.chomp! : next
uri = begin
URI.parse(line)
rescue
next
end
#p uri.host.intern.object_id
key = uri.host.intern.hash % num_split
#p key
(url_hash[key] ||= []) << line
end
url_hash.each_with_index do |(k,v),i|
filename = "%s_%0#{order}d" % [file, i]
#p filename
open(filename, "w"){|f|
v.each{|url| f.puts url}
}
end
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment