Skip to content

Instantly share code, notes, and snippets.

@finoradin
Created September 3, 2012 05:42
Show Gist options
  • Save finoradin/3607025 to your computer and use it in GitHub Desktop.
Save finoradin/3607025 to your computer and use it in GitHub Desktop.
Analyzing a CSV of browsing data, where row[0]=timestamp and row[1]=url. Counts first five unique hostnames per day, counts results.
import csv
from collections import defaultdict, Counter
from datetime import datetime
from urlparse import urlsplit
indiv = Counter()
domains = defaultdict(lambda: defaultdict(int))
with open("history.csv", "rb") as f:
for timestr, url in csv.reader(f):
dt = datetime.strptime(timestr, "%Y-%m-%d %H:%M:%S")
if 6 <= dt.hour < 11: # between 6am and 11am
today_domains = domains[dt.date()]
domain = urlsplit(url).hostname
if len(today_domains) < 5 and domain not in today_domains:
today_domains[domain] += 1
indiv += Counter([domain])
for domain in indiv:
print '%s,%d' % (domain, indiv[domain])
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment