Skip to content

Instantly share code, notes, and snippets.

I may be slow to respond.

David Yerrington dyerrington

I may be slow to respond.
Block or report user

Report or block dyerrington

Hide content and notifications from this user.

Learn more about blocking users

Contact Support about this user’s behavior.

Learn more about reporting abuse

Report abuse
View GitHub Profile
dyerrington /
Created Apr 25, 2015
Remove blank lines and comments
grep -v '^[ \t]*$\|^[ \t]*#' /etc/ssh/sshd_config
dyerrington / removeext
Created Apr 25, 2015
Remove files with specific extension, recursively.
View removeext
find . -type d -name .ext | xargs rm -rf
dyerrington / word_counts
Created May 1, 2015
Count words in files recursively
View word_counts
find . -type f -print0 | xargs -0 cat | wc -w
dyerrington / reject_outliers
Created May 6, 2015
Pandas remove +/- 3 std
View reject_outliers
sql_df[np.abs(sql_df['score'].values - sql_df['score'].values.mean())<=(3*sql_df['score'].values.std())]
dyerrington /
Last active Aug 29, 2015
Preprocessing pipeline for processing documents with Gensim. Easily manage text data to format data frames, run classification, etc.
import numpy as np, pandas as pd, os, seaborn as sns, codecs
from gensim import corpora, models, similarities
from gensim.parsing.preprocessing import STOPWORDS
class preprocess_corpus(object):
files = []
dirs = []
def __init__(self, dir, directory=False, stopwords_file=False):
dyerrington /
Created May 30, 2015
Simple probability and stat functions
import bisect
import random
def Mean(t):
"""Computes the mean of a sequence of numbers.
t: sequence of numbers
View gist:3d4cdd4d4c2a7f4a66b7
import numpy as np
import scipy as sp
import scipy.stats
def mean_confidence_interval(data, confidence=0.95):
a = 1.0*np.array(data)
n = len(a)
m, se = np.mean(a), scipy.stats.sem(a)
h = se * sp.stats.t._ppf((1+confidence)/2., n-1)
return m, m-h, m+h
View gist:1bcbd0378d65f6562cd9
from geopy.geocoders import Bing
geolocator = Bing("your key here")
location = geolocator.geocode('your location here')
if not location: continue
geo_location = {
'origin_address': location.address,
'origin_latitude': location.latitude,
'origin_longitude': location.longitude
dyerrington /
Last active Nov 30, 2015
When I work with date formats, it’s nice to have them as actual “datetime” objects rather than objects. If you notice when you first import the csv, the “Time” feature has a dtype “object”. If we convert this object to a “datetime” type, we can use Pandas Grouper() to actually do a groupby unique time period (ie: days, weeks, months, years).
#Step1, convert Time after loading:
ufo = pd.read_csv('') # can also read csvs directly from the web!
ufo['Time'] = ufo['Time'].apply(pd.to_datetime)
# Step 2: Group by unique days
ufo.groupby([pd.Grouper(key='Time',freq='1D')])[['Shape Reported']].count()
# Also, you can concat Year, Month, and Day into a new feature, and group by that. As an engineer, I much prefer to work on strict types and leverage current method.
dyerrington /
Created Oct 9, 2015
Singular linear coefficients
def auto_coefficients(df):
sorted_coefs = list()
coefs = df.corr()
for row_index, row_values in enumerate(coefs.values):
for col_index, col_value in enumerate(row_values):
if coefs.columns[row_index] == coefs.columns[col_index]:
You can’t perform that action at this time.