Skip to content

Instantly share code, notes, and snippets.

View schaunwheeler's full-sized avatar

Schaun Wheeler schaunwheeler

View GitHub Profile
@schaunwheeler
schaunwheeler / pyspark_minhash_jaccard.py
Last active June 8, 2023 23:22
Use MinHash to get Jaccard Similarity in Pyspark
from numpy.random import RandomState
import pyspark.sql.functions as f
from pyspark import StorageLevel
def hashmin_jaccard_spark(
sdf, node_col, edge_basis_col, suffixes=('A', 'B'),
n_draws=100, storage_level=None, seed=42, verbose=False):
"""
Calculate a sparse Jaccard similarity matrix using MinHash.
@schaunwheeler
schaunwheeler / amortize.r
Last active March 13, 2023 16:24
Amortization function
# The MIT License (MIT)
#
# Copyright (c) 2012 Schaun Jacob Wheeler
#
# Permission is hereby granted, free of charge, to any person obtaining a copy
# of this software and associated documentation files (the "Software"), to deal
# in the Software without restriction, including without limitation the rights
# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
# copies of the Software, and to permit persons to whom the Software is
# furnished to do so, subject to the following conditions:
@schaunwheeler
schaunwheeler / xlsxToR.r
Last active December 11, 2020 16:41
Import an xlsx file into R by parsing the file's XML structure.
# The MIT License (MIT)
#
# Copyright (c) 2012 Schaun Jacob Wheeler
#
# Permission is hereby granted, free of charge, to any person obtaining a copy
# of this software and associated documentation files (the "Software"), to deal
# in the Software without restriction, including without limitation the rights
# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
# copies of the Software, and to permit persons to whom the Software is
# furnished to do so, subject to the following conditions:
@schaunwheeler
schaunwheeler / doc_to_spans.py
Last active May 6, 2020 16:39
Example of how to use spaCy to process many texts at once
from spacy import load as spacy_load
# This loads the largest English corpus, which must be downloaded
# separate from package installation. Other choices are available.
nlp = spacy_load('en_core_web_lg')
def doc_to_spans(list_of_texts, join_string=' ||| '):
all_docs = nlp(' ||| '.join(list_of_texts))
split_inds = [i for i, token in enumerate(all_docs) if token.text == '|||'] + [len(all_docs)]
@schaunwheeler
schaunwheeler / randomforestregressor_predict.scala
Last active April 23, 2019 13:43
An example of using Scala to call the predict function from a Scikit-Learn RandomForestRegressor
import rapture.json.jsonBackends.jawn._
import rapture.json.Json
import scala.annotation.tailrec
case class RandomForestTree(
treeId: Int,
undefinedIndex: Int,
features: Array[Int],
thresholds: Array[Double],
@schaunwheeler
schaunwheeler / rfr_example.json
Created April 19, 2019 13:21
Example JSON output for single tree of RandomForestRegressor
{
"i": 0,
"tree_undefined": -2,
"features": [
3,
3,
2,
3,
-2,
-2,
@schaunwheeler
schaunwheeler / pure_python_rfr.py
Created April 13, 2019 12:18
Create function in pure Python that calculates predictions from a Scikit-Learn RandomForestRegressor
from sklearn.tree import _tree
tree_template = '''
def tree{i}(inputs):
tree_undefined = {tree_undefined}
features = {features}
thresholds = {thresholds}
children_left = {children_left}
@schaunwheeler
schaunwheeler / rfr_to_json.py
Created April 13, 2019 12:12
Funtion to dump trained Scikit-Learn RandomForestRegressor to JSON
from json import dumps
def rfr_to_json(rfr_object, feature_list, json_filepath=None):
'''
Function to convert a scikit-learn RandomForestRegressor object to JSON.
'''
output_dict = dict()
output_dict['name'] = 'rf_regression_pipeline'
@schaunwheeler
schaunwheeler / spacy_pyspark_wordvec_udf.py
Created April 13, 2019 11:44
Example of using spaCy on Spark
import pyspark.sql.types as t
import pyspark.sql.functions as f
def spacy_word2vec_grouped(cat_list, id_col, string_col):
"""
Example usage:
vec_sdf = (
sdf
@schaunwheeler
schaunwheeler / ds_prod_scale1.py
Last active March 12, 2019 19:29
Data science productionizaton: scale - example 1.py
from pandas import DataFrame
from pyspark.sql import types as t, functions as f
df = DataFrame({'ids': [1, 2, 3], 'words': ['abracadabra', 'hocuspocus', 'shazam']})
sdf = sparkSession.createDataFrame(df)
normalize_word_udf = f.udf(normalize_word, t.StringType())
stops = f.array([f.lit(c) for c in STOPCHARS])
results = sdf.select('ids', normalize_word_udf(f.col('words'), stops).alias('norms'))