Skip to content

Instantly share code, notes, and snippets.

View dyerrington's full-sized avatar
💭
I may be slow to respond.

David Yerrington dyerrington

💭
I may be slow to respond.
View GitHub Profile
@dyerrington
dyerrington / removeblank.sh
Created April 25, 2015 00:39
Remove blank lines and comments
grep -v '^[ \t]*$\|^[ \t]*#' /etc/ssh/sshd_config
@dyerrington
dyerrington / removeext
Created April 25, 2015 00:40
Remove files with specific extension, recursively.
find . -type d -name .ext | xargs rm -rf
@dyerrington
dyerrington / word_counts
Created May 1, 2015 06:59
Count words in files recursively
find . -type f -print0 | xargs -0 cat | wc -w
@dyerrington
dyerrington / reject_outliers
Created May 6, 2015 22:40
Pandas remove +/- 3 std
sql_df[np.abs(sql_df['score'].values - sql_df['score'].values.mean())<=(3*sql_df['score'].values.std())]
@dyerrington
dyerrington / preprocess_corpus.py
Last active August 29, 2015 14:21
Preprocessing pipeline for processing documents with Gensim. Easily manage text data to format data frames, run classification, etc.
import numpy as np, pandas as pd, os, seaborn as sns, codecs
from gensim import corpora, models, similarities
from gensim.parsing.preprocessing import STOPWORDS
class preprocess_corpus(object):
files = []
dirs = []
def __init__(self, dir, directory=False, stopwords_file=False):
@dyerrington
dyerrington / probability.py
Created May 30, 2015 20:04
Simple probability and stat functions
import bisect
import random
def Mean(t):
"""Computes the mean of a sequence of numbers.
Args:
t: sequence of numbers
Returns:
import numpy as np
import scipy as sp
import scipy.stats
def mean_confidence_interval(data, confidence=0.95):
a = 1.0*np.array(data)
n = len(a)
m, se = np.mean(a), scipy.stats.sem(a)
h = se * sp.stats.t._ppf((1+confidence)/2., n-1)
return m, m-h, m+h
from geopy.geocoders import Bing
geolocator = Bing("your key here")
location = geolocator.geocode('your location here')
try:
if not location: continue
geo_location = {
'origin_address': location.address,
'origin_latitude': location.latitude,
'origin_longitude': location.longitude
@dyerrington
dyerrington / grouper.py
Last active November 30, 2015 05:02
When I work with date formats, it’s nice to have them as actual “datetime” objects rather than objects. If you notice when you first import the csv, the “Time” feature has a dtype “object”. If we convert this object to a “datetime” type, we can use Pandas Grouper() to actually do a groupby unique time period (ie: days, weeks, months, years).
#Step1, convert Time after loading:
ufo = pd.read_csv('https://raw.githubusercontent.com/sinanuozdemir/SF_DAT_17/master/data/ufo.csv') # can also read csvs directly from the web!
ufo['Time'] = ufo['Time'].apply(pd.to_datetime)
# Step 2: Group by unique days
ufo.groupby([pd.Grouper(key='Time',freq='1D')])[['Shape Reported']].count()
# Also, you can concat Year, Month, and Day into a new feature, and group by that. As an engineer, I much prefer to work on strict types and leverage current method.
@dyerrington
dyerrington / auto_coefficients.py
Created October 9, 2015 07:28
Singular linear coefficients
def auto_coefficients(df):
sorted_coefs = list()
coefs = df.corr()
for row_index, row_values in enumerate(coefs.values):
for col_index, col_value in enumerate(row_values):
if coefs.columns[row_index] == coefs.columns[col_index]: