Skip to content

Instantly share code, notes, and snippets.

View Smerity's full-sized avatar

Stephen Merity Smerity

View GitHub Profile
@Smerity
Smerity / babi_rnn.py
Created August 17, 2015 11:32
Epoch tuning through early stopping for bAbi RNN in Keras
from __future__ import absolute_import
from __future__ import print_function
from functools import reduce
import re
import tarfile
import numpy as np
np.random.seed(1337) # for reproducibility
bAs such, I agree strongly with you that this won't make a good test dataset for testing various RNN architectures.from keras.callbacks import EarlyStopping
@Smerity
Smerity / knn.cpp
Created April 7, 2014 04:37
KNN C++ implementation for Kaggle LSHTC
#include <algorithm>
#include <fstream>
#include <iostream>
#include <iterator>
#include <map>
#include <set>
#include <sstream>
#include <unordered_map>
#include <vector>
@Smerity
Smerity / failed_logins
Created June 6, 2017 22:09
List of failed SSH logins produced by `egrep -o "invalid user ([^ ]+?) " /var/log/auth.log | cut -d ' ' -f 3 | sort | uniq -c | sort -nk 1`
1 .+?
1 [^
2 0000
2 010101
2 1111
2 1234
2 12345
2 666666
2 adm
2 anna
@Smerity
Smerity / cartpole.py
Last active May 26, 2017 13:47
Script for Cartpole using policy gradient via Chainer, two layer MLP, dropout, and rejection sampling of historical memories
''' Script for Cartpole using policy gradient via Chainer, two layer MLP, dropout, and rejection sampling of historical memories '''
import gym
import numpy as np
import chainer
from chainer import optimizers
from chainer import ChainList, Variable
import chainer.functions as F
@Smerity
Smerity / get_all_urls.py
Created June 23, 2015 01:05
Collect all URLs for NYTimes in the Common Crawl URL Index
import requests
show_pages = 'http://index.commoncrawl.org/CC-MAIN-2015-18-index?url={query}&output=json&showNumPages=true'
get_page = 'http://index.commoncrawl.org/CC-MAIN-2015-18-index?url={query}&output=json&page={page}'
query = 'nytimes.com/*'
show = requests.get(show_pages.format(query=query))
pages = show.json()['pages']
results = set()
@Smerity
Smerity / gist:2704d3d65aa191ff5f27
Last active May 1, 2017 19:45
About the data

Data Location

The Common Crawl dataset lives on Amazon S3 as part of the Amazon Public Datasets program. Downloading them is free from any instance on Amazon EC2, both via S3 and HTTP.

As the Common Crawl Foundation has evolved over the years, so has the format and metadata that accompany the crawls themselves.

  • [ARC] Archived Crawl #1 - s3://aws-publicdatasets/common-crawl/crawl-001/ - crawl data from 2008/2010
  • [ARC] Archived Crawl #2 - s3://aws-publicdatasets/common-crawl/crawl-002/ - crawl data from 2009/2010
  • [ARC] Archived Crawl #3 - s3://aws-publicdatasets/common-crawl/parse-output/ - crawl data from 2012
  • [WARC] s3://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2013-20/
@Smerity
Smerity / count_wikitext.py
Created February 9, 2017 23:00
Count the number of unique tokens in WikiText-2 and/or WikiText-103
vocab = set()
for i, line in enumerate(open('wiki.train.tokens')):
words = [x for x in line.split(' ') if x]
[vocab.add(word) for word in words]
if i < 10: print(words)
print('Vocab size:', len(vocab))
@Smerity
Smerity / part-r-00000
Created April 6, 2014 23:38
Output from the Common Crawl HTML tag frequency count run over a single compressed 859MB WARC file
0 48
0000 6
0l 1
0xdc00 13
1 69
10 11
100 3
1001 1
100154 1
1004 1
@Smerity
Smerity / buggy_cartpole.py
Last active September 2, 2016 00:09
Buggy (but preserved for posterity) script for Cartpole using policy gradient via Chainer, two layer MLP, dropout, and vaguely rejection sampling of historical memories
""" Quick script for Cartpole using policy gradient via Chainer, two layer MLP, dropout, and vaguely rejection sampling of historical memories """
import gym
import numpy as np
import chainer
from chainer import optimizers
from chainer import ChainList, Variable
import chainer.functions as F
@Smerity
Smerity / README
Created September 5, 2013 15:26
Instructions to install the required Python packages for CS109 on Ubuntu using virtualenv
#!/bin/bash
# If you'd like, you can actually run this file
# It likely makes more sense to read it, understand it, and run the instructions yourself
# Create the virtual environment
virtualenv env
# Enter into the virtual environment
source ./env/bin/activate