Skip to content

Instantly share code, notes, and snippets.

Russell Jurney rjurney

Block or report user

Report or block rjurney

Hide content and notifications from this user.

Learn more about blocking users

Contact Support about this user’s behavior.

Learn more about reporting abuse

Report abuse
View GitHub Profile
@rjurney
rjurney / pre.py
Last active Dec 4, 2019
How do you chain a preprocessor for an LF to occur AFTER SpacyPreprocessor?
View pre.py
spacy = SpacyPreprocessor(
text_field='body',
doc_field='spacy',
memoize=True,
language='en_core_web_lg',
disable=['vectors']
)
@preprocessor(memoize=True, pre=[spacy])
def restore_entity(x):
@rjurney
rjurney / matcher_lf.py
Created Dec 2, 2019
Example of spaCy object Labeling Function
View matcher_lf.py
from spacy.matcher import Matcher
matcher = Matcher(nlp.vocab)
pattern = [{'POS': 'VERB'}, {'POS': 'ADP'}, {'POS': 'PROPN'}]
matcher.add("VERB_ADP_PROPN", None, pattern)
@labeling_function()
def lf_verb_in_noun(x):
"""Return positive if the pattern"""
sp = x['spacy']
matches = matcher(sp)
View candidates.py
window = 5
candidates = []
for index, row in df.iterrows():
doc = nlp(row['_Body'])
for ent in doc.ents:
rec = {}
rec['body'] = doc.text
rec['entity'] = ent
rec['entity_text'] = ent.text
rec['entity_start'] = ent.start
@rjurney
rjurney / tty.txt
Created Nov 11, 2019
What /dev/ttyS* port does this correspond to?
View tty.txt
T: Bus=01 Lev=01 Prnt=01 Port=08 Cnt=04 Dev#= 5 Spd=12 MxCh= 0
D: Ver= 2.00 Cls=00(>ifc ) Sub=00 Prot=00 MxPS=64 #Cfgs= 1
P: Vendor=051d ProdID=0002 Rev=00.90
S: Manufacturer=American Power Conversion
S: Product=Back-UPS ES 850M2 FW:931.a7 .D USB FW:a7
S: SerialNumber=4B1716P37698
C: #Ifs= 1 Cfg#= 1 Atr=e0 MxPwr=2mA
I: If#= 0 Alt= 0 #EPs= 1 Cls=03(HID ) Sub=00 Prot=00 Driver=usbhid
@rjurney
rjurney / spark_mongo_kafka_predictions.py
Created Nov 4, 2019
Writing Predictions to MongoDB using Kafka and Structured Streaming
View spark_mongo_kafka_predictions.py
# Make the prediction
predictions = rfc.transform(final_vectorized_features)
# Drop the features vector and prediction metadata to give the original fields
predictions = predictions.drop("Features_vec")
final_predictions = predictions.drop("indices").drop("values").drop("rawPrediction").drop("probability")
# Store the results to MongoDB
class MongoWriter:
@rjurney
rjurney / pad.py
Created Oct 22, 2019
Custom padding of dense vectors with min/max or mean
View pad.py
padded_posts = []
for post in encoded_docs:
# Pad short posts with alternating min/max
if len(post) < MAX_LENGTH:
# Method 1
pointwise_min = np.minimum.reduce(post)
pointwise_max = np.maximum.reduce(post)
padding = [pointwise_max, pointwise_min]
@rjurney
rjurney / gensim_word2vec.py
Last active Oct 22, 2019
Encoding tokenized text with gensim.models.Word2Vec
View gensim_word2vec.py
from gensim.models import Word2Vec
w2v_model = None
model_path = f'models/word2vec.model'
# Load the Word2Vec model if it exists
if os.path.exists(model_path):
w2v_model = Word2Vec.load(model_path)
else:
w2v_model = Word2Vec(
@rjurney
rjurney / train_test_dev_split.py
Created Oct 17, 2019
How to create a 0.7/0.2/0.1 Train/Test/Dev split of a dataset
View train_test_dev_split.py
from sklearn.model_selection import train_test_split
X_train, X_test_dev, y_train, y_test_dev = train_test_split(
df['_Body'],
df['_Index'],
test_size=0.3,
random_state=1337,
)
X_dev, X_test, y_dev, y_test = train_test_split(
X_test_dev,
@rjurney
rjurney / lsd.sh
Created Oct 4, 2019
Function to put in ~/.bash_aliases to list only directories
View lsd.sh
# List only directories
lsd () {
if [ $# -eq 0 ]
then
LS_PATH=""
else
LS_PATH="$1/"
fi
ls -ld -- "$@"*/
@rjurney
rjurney / pete_josh_quote.txt
Created Sep 26, 2019
How Josh Wills got a quote in Weakly Supervised Learning
View pete_josh_quote.txt
[Quoting Pete] He went on to say in 2019, “Data labeling is a good proxy for whether machine learning is cost effective for a problem. If you can build labeling into normal user activities you track like Facebook, Google and Amazon consumer applications you have a shot. Otherwise, you burn money paying for labeled data. Many people still try to apply machine learning on high profile problems without oxygen, and burn lots of money in the process without solving them.” (Josh Wills responded with, “I want a quote in the book,” and he thusly received.)
You can’t perform that action at this time.