Skip to content

Instantly share code, notes, and snippets.

View wpm's full-sized avatar

W.P. McNeill wpm

View GitHub Profile
@wpm
wpm / ItemSet.java
Created September 13, 2011 18:35
ItemSet: a Hadoop ArrayWritable of Text
package wpmcn.structure;
import org.apache.commons.lang.StringUtils;
import org.apache.hadoop.io.ArrayWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.WritableComparable;
import java.util.*;
/**
@wpm
wpm / spark_parallel_boost.py
Last active December 3, 2018 02:56
A simple example of how to integrate the Spark parallel computing framework and the scikit-learn machine learning toolkit. This script randomly generates test and train data sets, trains an ensemble of decision trees using boosting, and applies the ensemble to the test set. The ensemble training is done in parallel.
from pyspark import SparkContext
import numpy as np
from sklearn.cross_validation import train_test_split, Bootstrap
from sklearn.datasets import make_classification
from sklearn.metrics import accuracy_score
from sklearn.tree import DecisionTreeClassifier
def run(sc):
@wpm
wpm / poll.js
Last active November 14, 2019 09:59
Javascript Polling with Promises
var Promise = require('bluebird');
/**
* Periodically poll a signal function until either it returns true or a timeout is reached.
*
* @param signal function that returns true when the polled operation is complete
* @param interval time interval between polls in milliseconds
* @param timeout period of time before giving up on polling
* @returns true if the signal function returned true, false if the operation timed out
*/
@wpm
wpm / multi_join.py
Last active May 31, 2023 10:57
Pandas multi-table join
import pandas
"""
Join an arbitrary number of data frames, using a multi-index label for each data frame.
For example say you have three data frames each of which lists the classroom and
number of students a teacher has in a given period.
Classroom Students
Teacher
@wpm
wpm / simple_mnist.py
Last active June 22, 2016 20:59
Minimal TensorFlow Example
"""
A minimal implementation of the MNIST handwritten digits classification task in TensorFlow.
This runs MNIST images images through a single hidden layer and softmax loss function.
It demonstrates in a single Python source file the basics of creating a model, training and evaluating data sets, and
writing summaries that can be visualized by TensorBoard.
"""
from __future__ import division
@wpm
wpm / stanford_sentiment_to_csv.py
Created December 3, 2017 19:24
Create CSV files from the Stanford Sentiment Treebank
"""
Put all the Stanford Sentiment Treebank phrase data into test, training, and dev CSVs.
Socher, R., Perelygin, A., Wu, J. Y., Chuang, J., Manning, C. D., Ng, A. Y., & Potts, C. (2013). Recursive Deep Models
for Semantic Compositionality Over a Sentiment Treebank. Presented at the Conference on Empirical Methods in Natural
Language Processing EMNLP.
https://nlp.stanford.edu/sentiment/
"""
@wpm
wpm / Entity Highlighting in Context.ipynb
Created December 4, 2017 16:12
Entity Highlighting in Context
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
@wpm
wpm / spacy_paragraph_segmenter.py
Created December 20, 2017 16:58
Segment a spaCy document into "paragraphs", treating whitespace tokens containing more than one line as a paragraph delimiter.
def paragraphs(document):
start = 0
for token in document:
if token.is_space and token.text.count("\n") > 1:
yield document[start:token.i]
start = token.i
yield document[start:]
@wpm
wpm / spacy_pattern_match.py
Created December 22, 2017 19:30
Utility that matches text patterns in spaCy/Prodigy training data
import json
from json import JSONDecodeError
from typing import Sequence, Iterable, List
import click
import spacy
from spacy.matcher import Matcher
def match_patterns(nlp, patterns: Sequence[dict], corpus: Iterable[str]) -> Iterable[str]:
@wpm
wpm / json_to_jsonl.py
Created December 22, 2017 19:33
Tool to convert a JSON list into a JSONL file.
import json
from json import JSONDecodeError
from typing import Sequence
import click
class JSONList(click.ParamType):
def convert(self, value: str, _, __) -> Sequence: