Skip to content

Instantly share code, notes, and snippets.

Russell Jurney rjurney

Block or report user

Report or block rjurney

Hide content and notifications from this user.

Learn more about blocking users

Contact Support about this user’s behavior.

Learn more about reporting abuse

Report abuse
View GitHub Profile
@rjurney
rjurney / train_test_dev_split.py
Created Oct 17, 2019
How to create a 0.7/0.2/0.1 Train/Test/Dev split of a dataset
View train_test_dev_split.py
from sklearn.model_selection import train_test_split
X_train, X_test_dev, y_train, y_test_dev = train_test_split(
df['_Body'],
df['_Index'],
test_size=0.3,
random_state=1337,
)
X_dev, X_test, y_dev, y_test = train_test_split(
X_test_dev,
@rjurney
rjurney / lsd.sh
Created Oct 4, 2019
Function to put in ~/.bash_aliases to list only directories
View lsd.sh
# List only directories
lsd () {
if [ $# -eq 0 ]
then
LS_PATH=""
else
LS_PATH="$1/"
fi
ls -ld -- "$@"*/
@rjurney
rjurney / pete_josh_quote.txt
Created Sep 26, 2019
How Josh Wills got a quote in Weakly Supervised Learning
View pete_josh_quote.txt
[Quoting Pete] He went on to say in 2019, “Data labeling is a good proxy for whether machine learning is cost effective for a problem. If you can build labeling into normal user activities you track like Facebook, Google and Amazon consumer applications you have a shot. Otherwise, you burn money paying for labeled data. Many people still try to apply machine learning on high profile problems without oxygen, and burn lots of money in the process without solving them.” (Josh Wills responded with, “I want a quote in the book,” and he thusly received.)
@rjurney
rjurney / load_parquet_fro,_s3.py
Created Aug 28, 2019
How does one load Parquet from S3 in Pandas/PyArrow?
View load_parquet_fro,_s3.py
import pandas as pd
import pyarrow
import s3fs
posts_df = pd.read_parquet(
's3://stackoverflow-events/08-05-2019/Questions.Stratified.Final.50000.parquet',
columns=['_Body'] + ['label_{}'.format(i) for i in range(0, 24)],
engine='pyarrow'
)
posts_df.head(5)
@rjurney
rjurney / final_gpu_to_cpu.py
Last active Aug 23, 2019
Concatenate the CPU arrays into one large Numpy array and clear all GPU RAM
View final_gpu_to_cpu.py
# Create a single numpy array out of the others
np_cpu_posts = np.concatenate(cpu_posts, axis=0)
# Free any remaining GPU RAM
mempool.free_all_blocks()
pinned_mempool.free_all_blocks()
# Delete the list of arrays
del cpu_posts
cpu_posts = np_cpu_posts
@rjurney
rjurney / gpu_concat.py
Last active Aug 23, 2019
Creating a large flat array using CuPy and a GPU to rapidly accelerate the operation
View gpu_concat.py
import cupy as cp
import sys
# cupy memory management
mempool = cp.get_default_memory_pool()
pinned_mempool = cp.get_default_pinned_memory_pool()
ITERATOR_MULTIPLE = 25
row_length = MAX_LENGTH * EMBEDDING_SIZE
@rjurney
rjurney / pad_posts_min_max.py
Created Aug 23, 2019
Pad Word2Vec Posts with the Min/Max of the Entire Corpus
View pad_posts_min_max.py
from math import ceil
padded_posts = []
for post in encoded_docs:
# Pad short posts with alternating min/max
if len(post) < MAX_LENGTH:
pointwise_min = np.minimum.reduce(post)
pointwise_max = np.maximum.reduce(post)
padding = [pointwise_max, pointwise_min]
@rjurney
rjurney / 2_cupy.py
Last active Aug 23, 2019
Creating a gensim Word2Vec Encoding
View 2_cupy.py
from os import path
from gensim.models import Word2Vec
VOCAB_SIZE = 5000
MAX_LENGTH = 100
EMBEDDING_SIZE = 50
NUM_CORES = 64
w2v_model = None
model_path = "data/word2vec.50000.model"
@rjurney
rjurney / model.py
Created Aug 21, 2019
Working code using MirroredStrategy
View model.py
## Model imports
from tensorflow.keras.callbacks import ReduceLROnPlateau, ModelCheckpoint, EarlyStopping
from tensorflow.keras.layers import ( Input, Embedding, GlobalMaxPooling1D, Conv1D, Dense, Activation,
Dropout, Lambda, BatchNormalization, concatenate )
from tensorflow.keras.models import Model, Sequential
from tensorflow.keras.optimizers import Adam, SGD
from tensorflow.keras.preprocessing.text import Tokenizer
# Fit imports
from tensorflow.keras.losses import hinge, mae, binary_crossentropy, kld, Huber, squared_hinge
@rjurney
rjurney / code.py
Created Aug 16, 2019
Why aren't all 4 GPUs being utilized?
View code.py
# Model imports
from tensorflow.keras.callbacks import ReduceLROnPlateau, ModelCheckpoint, EarlyStopping
from tensorflow.keras.layers import ( Input, Embedding, GlobalMaxPooling1D, Conv1D, Dense, Activation,
Dropout, Lambda, BatchNormalization, concatenate )
from tensorflow.keras.models import Model, Sequential
from tensorflow.keras.optimizers import Adam, SGD
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.utils import multi_gpu_model
# Fit imports
You can’t perform that action at this time.