Skip to content

Instantly share code, notes, and snippets.

@gedankenstuecke
Created January 25, 2018 00:44
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save gedankenstuecke/c9ad0fb53a586833a9b14ee8b8ba77a8 to your computer and use it in GitHub Desktop.
Save gedankenstuecke/c9ad0fb53a586833a9b14ee8b8ba77a8 to your computer and use it in GitHub Desktop.
month downloads days_with_data
01 9257598 31
02 9854264 27
03 12990139 30
04 8911780 21
05 13959687 30
06 14123204 30
07 18410282 31
08 17418110 30
09 15235240 30
10 4071933 8
11 15878921 30
12 10764704 31
@gedankenstuecke
Copy link
Author

Generated by:

from collections import defaultdict
dl_per_month = defaultdict(int)
days_per_month = defaultdict(dict)
for line in open('2017.statistics.tab','r'):
    la = line.strip().split("\t")
    date = la[0].split(' ')[0].split('-')
    dl_per_month[date[1]] += 1
    if date[2] not in days_per_month[date[1]].keys():
        days_per_month[date[1]][date[2]] = True
print("month\tdownloads\tdays_with_data")
for key in dl_per_month.keys():
    downloads = dl_per_month[key]
    days_in_month = len(days_per_month[key])
    print("{}\t{}\t{}".format(key,downloads,days_in_month))

@dhimmel
Copy link

dhimmel commented Jan 25, 2018

@gedankenstuecke nice. It would also be good to have visitors in addition to downloads. I suspect visitors (unique IP addresses) will be less affected by automated downloads. As per our conversation here regarding the lack of column labels, how confident are you that the third column is an IP address code? If we're not confident, we should avoid reporting visitors and stick with downloads.

@dhimmel
Copy link

dhimmel commented Jan 25, 2018

Thanks @gedankenstuecke for putting this together! It helped inspire this notebook where I calculate these numbers for the old and new logs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment