Skip to content

Instantly share code, notes, and snippets.

@jronallo
jronallo / common_crawl_hostname_count.rb
Last active September 29, 2017 23:12
Ruby scripts for parsing the output from the Common Crawl URL index: https://github.com/trivio/common_crawl_index/blob/master/bin/remote_read
#!/usr/bin/env ruby
# a quick, simple script to partially parse output from https://github.com/trivio/common_crawl_index/blob/master/bin/remote_read
# and output subdomains in order of count
url_counts = {}
total_urls = 0
File.readlines(ARGV[0]).each do |line|
url = line.split(' ').first
reverse_hostname = url.split('/').first