Skip to content

Instantly share code, notes, and snippets.

View sshleifer's full-sized avatar
🏠
Working from home

Sam Shleifer sshleifer

🏠
Working from home
View GitHub Profile
@sshleifer
sshleifer / pandorable_notes.md
Last active May 14, 2016 22:49
Notes on http://tomaugspurger.github.io/ Modern Pandas blogposts

Will immediately Incorporate

  • df.assign(lambda x: x.px * 2) # x is the DataFrame magically this will save us mucho code
  • df.loc[df.index.get_level_values(1) == 'donger'] can be df.loc[pd.IndexSlice[:,'donger'],]
  • ser.sort_values(ascending=False).head() can be ser.nlargest(5). nsmallest also exists.
  • df.add_suffix is built into pandas
  • df.dropna(thresh=4) If at least thresh items are missing, the row is dropped.

Could be useful

  • pd.TimeGrouper('H')
@sshleifer
sshleifer / zimmerman_chap2.md
Last active July 4, 2016 01:11
Notes on Chapter 2 of Tom Zimmerman's Dissertation

[Paper] (https://dash.harvard.edu/bitstream/handle/1/17467320/ZIMMERMANN-DISSERTATION-2015.pdf?sequence=1])

Intro: Econom(etr)ics vs. ML

  • Economics focused on empirical relationships between features and outcomes, ML focused on predicting outcomes.
  • Beta vs. yhat. cv.coeffs vs cv.metrics.fscore
  • TZ: Can test relationship by seeing if inclusion of variable in big model improves predictions, thereby avoiding omitted control issues.
  • requires ML approach (feature engineering) on investor behavior datasets!
  • implementation details and robustness checks more valuable than actual results on disposition effect.
@sshleifer
sshleifer / kernel_trick.md
Last active October 19, 2016 19:19
Attempt at explaining the kernel trick in preparation for 6.867 Midterm

Problem: Transforming X into φ(X) space can be expensive, and it is usually used as an intermediate result inside of a dot product like <φ(x[i]), φ(x[j])>.

Trick to save computation time: Conditional on having a φ where we know how to compute <φ(x[i]), φ(x[j])> through a shortcut, we can use the shortcut instead of explicitly calling φ and storing the long intermediate result. The savings stem from (a) saving calls to φ, and (b) making the dot product operate on shorter vectors.

Example

φ(x) = (x[1]**2, sqrt(2)*x[1]* x[2], x[2]**2)

&lt;φ(x),φ(z)&gt; = sum((x[1]**2)(z[1]**2), 2x[1]x[2]z[1]z[2], (x[2]**2)(z[2]**2))
@sshleifer
sshleifer / imagerive.md
Created June 7, 2018 17:16
Imagerive Notes

WHERE IS THE DATA? SSH into {FIXME} while connected to ImageRive VPN (must be from windows machine) All data is is /merantix_core/data/hospitals/imagerive/export Anonymized reports in reports anonymized_dicoms/ export/cases_new.json export/patients_new.json

Normal Windows VPN connection.

@sshleifer
sshleifer / generate_boxes_from_masks.py
Created May 1, 2019 17:21
Script for going from mask to bboxes (bbox branch)
import numpy as np
import pandas as pd
import pickle as pkl
import nrrd
import glob
import os
import sys
def find_bounding_box(mask, point, label):
visited = set()
import SimpleITK as sitk
import numpy as np
mask_file = '/data/ct-cspine/test_set_w_masks_2019_05_01/cspine_fx_seg/Cspine_fx_seg/5616571.nrrd'
array_file = '/data/ct-cspine/processed-studies/data_20180524_161757/anonymized_data/images/test/5616571.npy'
def projectImage(reference, moving, interpolate = 'linear'):
# projects moving image onto reference image space
# use  interpolate = 'NN' for segmentation masks
resample = sitk.ResampleImageFilter()
resample.SetReferenceImage(reference)
"""Modified from https://github.com/gan3sh500/mixmatch-pytorch/blob/master/layer.py
Implementation of """
def mixmatch(X_labeled, y, X_unlabeled, model, augment_fn, T=0.5, K=2, alpha=0.75):
"""Generate labeled and unlabeled batches for mixmatch. Helpers are below. Use in dataloader."""
xb = augment_fn(X_labeled)
n_labeled = len(xb)
ub = [augment_fn(X_unlabeled) for _ in range(K)] # unlabeled
qb = sharpen(sum(map(model, ub)) / K, T)
@sshleifer
sshleifer / hardness_grid.py
Created May 29, 2019 15:02
ideal grid/api for hardness sampling
pg1 = update_batch_size(ParameterGrid({
'lr': [1e-4, 1e-3, 3e-3, 1e-2, .05, 1e-1],
'label_smoothing': [True, False],
'size': [128],
'bs': [256],
'hardness_percentile': [.75, .5, .25, .1] # top 50%, top25%
}))
@sshleifer
sshleifer / summarize.py
Created February 21, 2020 15:02
Example Fairseq bart-large-cnn summary
# pip install fairseq
bart = torch.hub.load('pytorch/fairseq', 'bart.large.cnn')
bart.eval()
article = '(CNN)The Palestinian Authority officially became the 123rd member of the International Criminal Court on Wednesday, a step that gives the court jurisdiction over alleged crimes in Palestinian territories. The formal accession was marked with a ceremony at The Hague, in the Netherlands, where the court is based. The Palestinians signed the ICC\'s founding Rome Statute in January, when they also accepted its jurisdiction over alleged crimes committed "in the occupied Palestinian territory, including East Jerusalem, since June 13, 2014." Later that month, the ICC opened a preliminary examination into the situation in Palestinian territories, paving the way for possible war crimes investigations against Israelis. As members of the court, Palestinians may be subject to counter-charges as well. Israel and the United States, neither of which is an ICC member, opposed the Palestinians\' efforts to join the body.
torch_device = 'cuda'
FRANCE_ARTICLE = ' Marseille, France (CNN)The French prosecutor leading an investigation into the crash of Germanwings Flight 9525 insisted Wednesday that he was not aware of any video footage from on board the plane. Marseille prosecutor Brice Robin told CNN that "so far no videos were used in the crash investigation." He added, "A person who has such a video needs to immediately give it to the investigators." Robin\'s comments follow claims by two magazines, German daily Bild and French Paris Match, of a cell phone video showing the harrowing final seconds from on board Germanwings Flight 9525 as it crashed into the French Alps. All 150 on board were killed. Paris Match and Bild reported that the video was recovered from a phone at the wreckage site. The two publications described the supposed video, but did not post it on their websites. The publications said that they watched the video, which was found by a source close to the investigation. "One can hear cries of \'My God\' in seve