Skip to content

Instantly share code, notes, and snippets.

@mittenchops
mittenchops / count_hist.sh
Created February 4, 2014 20:45
Count unique urls where url is the first field in a json, get the top 10 repeats
cat file.json | awk '{print $2}' | sort | uniq -c | sort -rn | head
@mittenchops
mittenchops / rsplit.R
Created February 6, 2014 21:57
rsplit for R
rsplit <- function(mydf, chr){
sapply(sapply(mydf, strsplit,chr, USE.NAMES=F),function(x){x[length(x)]})
}
# USAGE
# val <- rsplit(df$long_url,"/")
@mittenchops
mittenchops / htmlclean.py
Created February 9, 2014 18:45
HTML Cleaner
import requests
from lxml.html.clean import Cleaner
url = "http://en.wikipedia.org/wiki/Zipf%27s_law"
html = requests.get(url).text
cleaner = Cleaner(allow_tags=[''], remove_unknown_tags=False, remove_tags=['<div>','</div>'])
cleaner.scripts = True
cleaner.page_structure = True
cleaner.javascript = True
cleaner.style = True
@mittenchops
mittenchops / alldigits.py
Created February 12, 2014 19:45
regexes for a number with an optional decimal place+decimal digits
# It took 15 tries.
digits = re.findall(r'[0-9]+', cleaned_text)
digits2 = re.findall(r'[0-9]\d*(\.\d+)?', cleaned_text)
digits3 = re.findall(r'[0-9]+((\.([0-9]+))?', cleaned_text)
digits4 = re.findall(r'[0-9]+(\.)?([0-9]+)?', cleaned_text)
digits5 = re.findall(r'\d+\.?\d*?', cleaned_text)
digits6 = re.findall(r'\d+(\.\d*)?', cleaned_text)
digits7 = re.findall(r'\d+(\.?\d*)?', cleaned_text)
digits8 = re.findall(r'\d+(\.?\d*)', cleaned_text)
digits9 = re.findall(r'\d+(\.{1}\d*)?', cleaned_text)
@mittenchops
mittenchops / histmaker.py
Last active August 29, 2015 13:56
Prepare a histogram in python, functional like whoa
mylist = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
@mittenchops
mittenchops / pylist2rlist.py
Created February 13, 2014 22:34
Export python list to R
lambda x: 'c({})'.format(x).replace("[","").replace("]","")
@mittenchops
mittenchops / TrueOCR.sh
Last active August 29, 2015 13:56
Convert a PDF with no text data into a text file
# https://launchpad.net/~gezakovacs/+archive/pdfocr
pdfocr -i "$file" -o /tmp/tmp.pdf
pdftotext /tmp/tmp.pdf "`basename "$file" .pdf`.txt"
@mittenchops
mittenchops / learn.py
Last active August 29, 2015 13:56
Learning pandas
import pandas
import numpy as np
import string
import random
import matplotlib.pyplot as plt
from pandas import DataFrame
#import statsmodels.formula.api as sm
df = DataFrame(np.random.randn(10,3))
df['3'] = random.sample(string.letters,10)
@mittenchops
mittenchops / mongo agg
Created February 21, 2014 23:19
Mongo Aggregation reminder
> db.coll.aggregate( [ { $group : {_id:0, minS : {$min: "$variabletomin"}, maxS : {$max : "$variabletomax"} } } ] )
@mittenchops
mittenchops / groupby.py
Last active August 29, 2015 13:56
In which I get a better handle on how to use groupby in python to almost be as useful as R's native similar features.
from itertools import groupby, islice
from operator import itemgetter
from pprint import pprint
>>> gb = groupby(sorted(xrange(0,11),key=iseven),iseven)
>>> [','.join(map(str,k)) for g,k in gb]
['1,3,5,7,9', '0,2,4,6,8,10']
>>> sent = "This is a long sentence where I want to group words of similar length using the python groupby function"
>>> gb = groupby(sorted(sent.split(),key=len),len)