Skip to content

Instantly share code, notes, and snippets.

Russell Jurney rjurney

Block or report user

Report or block rjurney

Hide content and notifications from this user.

Learn more about blocking users

Contact Support about this user’s behavior.

Learn more about reporting abuse

Report abuse
View GitHub Profile
rjurney /
Created Dec 2, 2019
Example of spaCy object Labeling Function
from spacy.matcher import Matcher
matcher = Matcher(nlp.vocab)
pattern = [{'POS': 'VERB'}, {'POS': 'ADP'}, {'POS': 'PROPN'}]
matcher.add("VERB_ADP_PROPN", None, pattern)
def lf_verb_in_noun(x):
"""Return positive if the pattern"""
sp = x['spacy']
matches = matcher(sp)
window = 5
candidates = []
for index, row in df.iterrows():
doc = nlp(row['_Body'])
for ent in doc.ents:
rec = {}
rec['body'] = doc.text
rec['entity'] = ent
rec['entity_text'] = ent.text
rec['entity_start'] = ent.start
rjurney / tty.txt
Created Nov 11, 2019
What /dev/ttyS* port does this correspond to?
View tty.txt
T: Bus=01 Lev=01 Prnt=01 Port=08 Cnt=04 Dev#= 5 Spd=12 MxCh= 0
D: Ver= 2.00 Cls=00(>ifc ) Sub=00 Prot=00 MxPS=64 #Cfgs= 1
P: Vendor=051d ProdID=0002 Rev=00.90
S: Manufacturer=American Power Conversion
S: Product=Back-UPS ES 850M2 FW:931.a7 .D USB FW:a7
S: SerialNumber=4B1716P37698
C: #Ifs= 1 Cfg#= 1 Atr=e0 MxPwr=2mA
I: If#= 0 Alt= 0 #EPs= 1 Cls=03(HID ) Sub=00 Prot=00 Driver=usbhid
rjurney /
Created Nov 4, 2019
Writing Predictions to MongoDB using Kafka and Structured Streaming
# Make the prediction
predictions = rfc.transform(final_vectorized_features)
# Drop the features vector and prediction metadata to give the original fields
predictions = predictions.drop("Features_vec")
final_predictions = predictions.drop("indices").drop("values").drop("rawPrediction").drop("probability")
# Store the results to MongoDB
class MongoWriter:
rjurney /
Created Oct 22, 2019
Custom padding of dense vectors with min/max or mean
padded_posts = []
for post in encoded_docs:
# Pad short posts with alternating min/max
if len(post) < MAX_LENGTH:
# Method 1
pointwise_min = np.minimum.reduce(post)
pointwise_max = np.maximum.reduce(post)
padding = [pointwise_max, pointwise_min]
rjurney /
Last active Oct 22, 2019
Encoding tokenized text with gensim.models.Word2Vec
from gensim.models import Word2Vec
w2v_model = None
model_path = f'models/word2vec.model'
# Load the Word2Vec model if it exists
if os.path.exists(model_path):
w2v_model = Word2Vec.load(model_path)
w2v_model = Word2Vec(
rjurney /
Created Oct 17, 2019
How to create a 0.7/0.2/0.1 Train/Test/Dev split of a dataset
from sklearn.model_selection import train_test_split
X_train, X_test_dev, y_train, y_test_dev = train_test_split(
X_dev, X_test, y_dev, y_test = train_test_split(
rjurney /
Created Oct 4, 2019
Function to put in ~/.bash_aliases to list only directories
# List only directories
lsd () {
if [ $# -eq 0 ]
ls -ld -- "$@"*/
rjurney / pete_josh_quote.txt
Created Sep 26, 2019
How Josh Wills got a quote in Weakly Supervised Learning
View pete_josh_quote.txt
[Quoting Pete] He went on to say in 2019, “Data labeling is a good proxy for whether machine learning is cost effective for a problem. If you can build labeling into normal user activities you track like Facebook, Google and Amazon consumer applications you have a shot. Otherwise, you burn money paying for labeled data. Many people still try to apply machine learning on high profile problems without oxygen, and burn lots of money in the process without solving them.” (Josh Wills responded with, “I want a quote in the book,” and he thusly received.)
rjurney / load_parquet_fro,
Created Aug 28, 2019
How does one load Parquet from S3 in Pandas/PyArrow?
View load_parquet_fro,
import pandas as pd
import pyarrow
import s3fs
posts_df = pd.read_parquet(
columns=['_Body'] + ['label_{}'.format(i) for i in range(0, 24)],
You can’t perform that action at this time.