Skip to content

Instantly share code, notes, and snippets.

@wch
wch / fork-github.md
Last active September 19, 2021 09:09
How to fork a GitHub repo and work on your fork

This document shows how to fork a GitHub repo and work from your fork to create clean branches and pull requests. The key is to make sure you stay in sync with the upstream fork's master branch.

Setup

First, fork the repo by clicking the Fork button on GitHub. For example, you could fork the tidyverse/ggplot2 repo at https://github.com/tidyverse/ggplot2.

@giefferre
giefferre / store_and_reuse_dataframe_schema.py
Last active July 10, 2021 18:01
Save the schema of a Spark DataFrame to be able to reuse it when reading json files.
# read a part of the whole datalake just to extract the schema
part = spark.read.json("s3a://path/to/json/part")
# create a temporary rdd in order to store the schema as binary file
temp_rdd = sc.parallelize(part.schema)
temp_rdd.coalesce(1).saveAsPickleFile("s3a://path/to/destination_schema.pickle")
# from now on, the schema will be saved.
# it could be used to improve the speed of reading json files.
@amberjrivera
amberjrivera / Pipeline-guide.md
Created January 26, 2018 05:02
Quick tutorial on Sklearn's Pipeline constructor for machine learning

If You've Never Used Sklearn's Pipeline Constructor...You're Doing It Wrong

How To Use sklearn Pipelines, FeatureUnions, and GridSearchCV With Your Own Transformers

By Emily Gill and Amber Rivera

What's a Pipeline and Why Use One?

The Pipeline constructor from sklearn allows you to chain transformers and estimators together into a sequence that functions as one cohesive unit. For example, if your model involves feature selection, standardization, and then regression, those three steps, each as it's own class, could be encapsulated together via Pipeline.

Benefits: readability, reusability and easier experimentation.
@AustinRochford
AustinRochford / MRPyMC3.ipynb
Last active October 9, 2023 01:49
MRPyMC3-Multilevel Regression and Poststratification with PyMC3
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
@klmr
klmr / generator.md
Last active August 28, 2022 02:26
Python-like generators in R

A little experiment using restarts.

(And while we’re at it, let’s torture R’s syntax a little.)

![screenshot][]

In the following we will be using R’s “restarts” feature to implement the state machine that drives generators in languages such as Python. Generators allow lazily generating values on demand: a consumer invokes a generator, and consumes values as they are produced. A new value is only produced once the previous one has been consumed.

@seanjtaylor
seanjtaylor / gist:568141f04a16d518be24
Created February 11, 2015 01:46
Reshaping a Pandas dataframe into a sparse matrix
import pandas as pd
import scipy.sparse as sps
df = pd.DataFrame({'tag1': ['sean', 'udi', 'bogdan'], 'tag2': ['sean', 'udi', 'udi'], 'freq': [1,2,3]})
# tag1 -> rows, tag2 -> columns
df.set_index(['tag1', 'tag2'], inplace=True)
mat = sps.coo_matrix((df.freq, (df.index.labels[0], df.index.labels[1])))
print(mat.todense())
@johanmeiring
johanmeiring / gist:3002458
Created June 27, 2012 08:32
"git lg" alias for pretty git log
# From http://garmoncheg.blogspot.com/2012/06/pretty-git-log.html
git config --global alias.lg "log --color --graph --pretty=format:'%Cred%h%Creset -%C(yellow)%d%Creset %s %Cgreen(%cr) %C(bold blue)<%an>%Creset' --abbrev-commit --"