Skip to content

Instantly share code, notes, and snippets.

View pmlandwehr's full-sized avatar
💭
this is a silly feature for a software repository

Peter M. Landwehr pmlandwehr

💭
this is a silly feature for a software repository
View GitHub Profile
@pmlandwehr
pmlandwehr / send_to_hbase_in_chunks.py
Last active January 26, 2017 01:00
Sends a list of values to hbase in chunks.
import happybase
from tqdm import tqdm
from tqdm import trange
def send_to_hbase_in_chunks(
key_kvdict_list,
table_name,
hbase_ip,
chunksize=1000):
"""
def parallel_apply(df, func, new_col=None, axis=0,
nproc=-1,
min_rows_per_chunk=2000, max_rows_per_chunk=5000):
"""
Split a data frame into chunks,
apply a lambda func to each group of rows in parallel,
return the results or the original data frame with the results as a new column.
:param pandas.DataFrame | pandas.Series df: DataFrame or series on which to apply
:param func: Function to apply
:param str new_col: if not None, return original df with new column named after func.
@pmlandwehr
pmlandwehr / vanilla.yaml
Last active August 18, 2018 17:27
Vanilla Conda Forge Recipe
{% set name = "package_name" %}
{% set version = "insert_real_version" %}
{% set bundle = "tar.gz" %}
{% set hash_type = "sha256" %}
{% set hash = "insert-real-hash" %}
package:
name: {{ name|lower }}
version: {{ version }}
@pmlandwehr
pmlandwehr / dummify_df.py
Last active May 20, 2016 00:24
Take a pandas DataFrame, a list of columns, a list of separators, and a list of columns. Convert to dummies.
def dummify_df(df,
cols_to_dummy,
seps,
keep_covariates='none',
max_vars=2,
vals_to_drop='nan'):
"""
get_dummy() on a df has some issues with dataframe-level operations
when the column has multiple values.
@pmlandwehr
pmlandwehr / column_disagger.py
Last active October 14, 2016 21:27
Takes a columnar file (.tsv, .csv, etc.) with multiple values('thing A|thing B') and splits them
from sys import argv
from tqdm import tqdm
from itertools import product
sep_map = {'tab': '\t',
'comma': ',',
',': ',',
'semicolon': ';',
';': ';',
'space': ' ',
@pmlandwehr
pmlandwehr / seaborn_stacked_bar.py
Last active October 5, 2016 20:22
Draw a stacked bar plot from a pandas dataframe using seaborn (some issues, I think...)
"""
See https://gist.github.com/randyzwitch/b71d47e0d380a1a6bef9#file-seaborn-stacked-bar-py
for the stacked bar plot that was the basis for this function
"""
def stacked_bar(data, x_column, y_columns,
normalize=False, legend=True,
x_label=None, y_label=None,
x_axis_labels=None):
"""
@pmlandwehr
pmlandwehr / get_token_count_tfidf_df.py
Last active December 24, 2021 12:24
Take a list of texts, preprocess and tokenize them, and returns the counts and TF-IDF values for each feature.
def get_token_count_tfidf_df(texts, tokenizer=None, preprocessor=None, analyzer='word', ngram_range=(1,1)):
"""
Take a list of texts, preprocess and tokenize them, and returns the counts and TF-IDF values for each feature.
:param list|pd.Series texts: collection of texts
:param tokenizer: tokenizer for the vectorizers. By default tries to load the punkt sentence tokenizer from NLTK.
:param preprocessor: Preprocessor for texts. By default converts numbers to "<NUM>"
:return pandas.DataFrame: DataFrame of sorted features, counts, and TF-IDFs.
"""
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
@pmlandwehr
pmlandwehr / sandy_background.md
Created August 1, 2015 22:46
Background information for labelling tweets related to Hurricane Sandy

Hurricane Sandy

Basic History

A brief timeline

  • October 24: Hurricane Sandy, south of Kingston in Jamaica, begins to move north.
  • October 29, 7 AM: Sandy reaches peak intensity
  • October 29, 6:30 PM: Sandy makes landfall near Brigantine, NJ and starts moving west-northwest.
  • October 31, 7 AM: Sandy breaks up over Pennsylvania.

Data Notes

The data in the Sandy collection comes from the Northeast, primarily the area around Manhattan. It runs from October 25, when Hurricane sandy was in the news and states were preparing for its onslaught, through November 3 (though data from this late period are relatively sparse.) As such, the tweets in the Sandy data should cover both disaster preparation and some reports of cleanup in the aftermath of the affair.

@pmlandwehr
pmlandwehr / wildfires_background.md
Last active November 22, 2015 18:51
Background information for labelling tweets related to the 2012 Colorado Wildfires

The 2012 Colorado Wildfire Season

Basic History

General Description

The 2012 wildfire season ran from March through July, and is considered one of the worst that Colorado has experienced in recent memory. A number of large and small fires rampaged over the countryside; I’ve found counts of both twelve and sixteen large fires damaging significant acreage reported by different news sources, and the number of small fires is even larger.

The Waldo Canyon Fire, which is most prominent in the data, began on June 23 about four miles northwest of Colorado Springs. As it expanded, several local towns were evacuated. The fire continued to expand over the next several days, and on June 26 Mayor Steve Bach ordered that Colorado Springs be evacuated. The fire spread to the city, and by the early morning of the 27th there were estimates that about 300 homes had been destroyed. Firefighters continued to work against the blaze, and on June 29th President Obama visited Colorado to discuss the problem.

Data

@pmlandwehr
pmlandwehr / haiyan_background.md
Last active September 26, 2015 20:55
Background information for labelling tweets related to Typhoon Haiyan/Yolanda

Typhoon Haiyan

Basic History

Basic timeline

  • November 2: The pressure systems that will become Haiyan are first noted by the Japan Meteorological Agency to the southeast of Micronesia.
  • November 5: Haiyan rapidly intensifies and is classified as a typhoon.
  • November 7: Haiyan has continued building as it moved westward, and at 8:40 PM UST it made landfall at Guiuan on East Samar. It makes three additional landfalls as it crosses the Philippines.
  • November 8: Haiyan leaves the islands, weakened, and still moving west.
  • November 11: Haiyan breaks up over China.

General notes