Skip to content

Instantly share code, notes, and snippets.

@tripleee
Created April 22, 2018 11:25
Show Gist options
  • Save tripleee/ad9d27b6018af49ebd43fca786f2b687 to your computer and use it in GitHub Desktop.
Save tripleee/ad9d27b6018af49ebd43fca786f2b687 to your computer and use it in GitHub Desktop.
Metasmoke Tumblr hits
#!/usr/bin/env python3
import json, fileinput
tumblrs = dict()
for line in fileinput.input():
data = json.loads(line)
for rec in data:
for field in 'title', 'body':
if '.tumblr.com/' in rec[field]:
splits = rec[field].split('<a href="')
for item in splits[1:]:
url = item.split('"')[0]
if '.tumblr.com/' in url:
tumblr = '/'.join(url.split('/')[0:3])
if tumblr in tumblrs:
tumblrs[tumblr]['count'] += 1
else:
tumblrs[tumblr] = {
'count': 1,
'date': rec['created_at']}
skipped = 0
for tumblr in reversed(sorted(tumblrs, key=lambda rec: (tumblrs[rec]['count'], tumblrs[rec]['date']))):
if tumblrs[tumblr]['date'].startswith(('2015-', '2016-')):
skipped += tumblrs[tumblr]['count']
continue
print('%i %s %s' % (
tumblrs[tumblr]['count'], tumblrs[tumblr]['date'], tumblr))
print('skipped %i' % skipped)
@tripleee
Copy link
Author

Added the tag #tumblr-deleted to these; and added #drugs to most of them. There are many which aren't actually drugs, too -- random stuff with typically a single spam.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment