Skip to content

Instantly share code, notes, and snippets.

View rjurney's full-sized avatar

Russell Jurney rjurney

View GitHub Profile
@rjurney
rjurney / airflow.env
Created December 14, 2020 19:23
A working docker-compose setup for Apache Airflow
AIRFLOW__CORE__SQL_ALCHEMY_CONN=postgresql+psycopg2://airflow:airflow@postgres/airflow
AIRFLOW__CORE__EXECUTOR=LocalExecutor
@rjurney
rjurney / libpostal.macro.Dockerfile
Last active December 14, 2020 14:12
An example of a potential macro to install libpostal that doesn't need to sub-class anything but enables composition
# Install CRFSuite and dependencies
RUN wget https://github.com/downloads/chokkan/liblbfgs/liblbfgs-1.10.tar.gz \
&& tar -xvzf liblbfgs-1.10.tar.gz \
&& cd liblbfgs-1.10 \
&& ./configure \
&& make \
&& make install \
&& cd \
&& wget https://github.com/downloads/chokkan/crfsuite/crfsuite-0.12.tar.gz \
&& tar -xvzf crfsuite-0.12.tar.gz \
@rjurney
rjurney / a_original.py
Created December 10, 2020 20:21
I did not know you could wrap a chain of method calls in a parenthesis and get the same object without the backslashes for line breaks
counts_per_field = sanctioned.flatMap(map_count_fields) \
.groupBy(lambda x: x[0]) \
.map(
lambda x: (
x[0],
{
"count": len(x[1]),
"proportion": np.around(len(x[1]) / float(total_records), 2),
"sample": random.sample(x[1], min(5, len(x[1]))),
},
@rjurney
rjurney / labeling_functions.py
Created December 7, 2020 06:32
An example of some LabelingFunctions to create labels for a programming language extractor
import re
import jsonlines, sys
from snorkel.labeling import labeling_function, LabelingFunction
from snorkel.preprocess import preprocessor
from snorkel.preprocess.nlp import SpacyPreprocessor
POSITIVE = 1
NEGATIVE = 0
ABSTAIN = -1
@rjurney
rjurney / email.md
Last active December 7, 2020 05:29
An email I sent summarizing my work on creating a Stack Overflow tagger for text documents for 10,000 or more labels

An Email Sent to a Nice Professor about Balancing Data for a Massively Multi-Label Classifier

The paper referred to in this email is On the Stratification of Multi-label Data, which I used to implement something workable that is more of a hack than their stuff.

[Dear Sirs], I wanted to get back to you about what I'm working on. A summary follows. Thank you for taking the time, I really appreciate it.

I see weakly supervised learning and weak supervision (along with Snorkel) as a growth area because neural networks are so data hungry and labeling data is often prohibitively expensive. Accordingly, I'm writing a book, Weakly Supervised Learning that takes the Stack Exchange data dump and applies techniques from NLP along with strategies such as semi-supervised learning, transfer learning, weak supervision and distant supervision to create real world applications using d

@rjurney
rjurney / metrics.py
Created November 15, 2020 21:27
A Tensorflow/Keras implementation of Adjusted R Squared
import typing
import tensorflow as tf
import tensorflow_addons as tfa
from tensorflow_addons.utils.types import AcceptableDTypes
from typeguard import typechecked
class AdjustedRSquared(tfa.metrics.RSquare):
@rjurney
rjurney / awswrangler.py
Created October 29, 2020 18:17
Something is wrong, the local load takes longer than the S3 load from a bad connection
# How can I be faster?
# Setup a session with credentials
boto3_session = BarUtils.boto_session(
aws_access_key_id=s3_key,
aws_secret_access_key=s3_secret,
)
df = wr.s3.read_parquet(
path=path,
@rjurney
rjurney / pyarrow.parquet.2.0.py
Last active October 24, 2020 19:03
PyArrow now takes forever to load partitioned Parquet data. Why?
# Prepare the partition filter
filters = [
[('Ticker', 'in', tickers)]
]
dataset = pq.ParquetDataset(
path_or_paths=LOCAL_PATH,
filesystem=filesystem,
filters=filters,
metadata_nthreads=4,
@rjurney
rjurney / r2_scores.py
Created October 20, 2020 22:21
Implementations of R^2 Score
def r2_score(y_true, y_pred):
"""Implements the Coeffecient of Determination, R^2 or R-squared"""
SS_res = kb.sum(kb.square(y_true - y_pred))
SS_tot = kb.sum(kb.square(y_true - kb.mean(y_true)))
return (1 - SS_res / (SS_tot + kb.epsilon()))
def inverse_r2_score(y_true, y_pred):
"""Implements the inverse Coeffecient of Determination, R^2 or R-squared"""
SS_res = kb.sum(kb.square(y_true - y_pred))
@rjurney
rjurney / start_dask.sh
Last active February 9, 2023 07:52
A script to start single node Dask with as many workers as your machine has processor cores
#!/bin/bash
# Launch the scheduler
nohup dask scheduler --host 127.0.0.1 --port 9000 --protocol tcp --dashboard --no-show 2>&1 >> /tmp/dask.log &
nohup nohup dask worker --nworkers=-1 tcp://127.0.0.1:9000 2>&1 >> /tmp/dask.log &