Skip to content

Instantly share code, notes, and snippets.

💭
I may be slow to respond.

David Yerrington dyerrington

💭
I may be slow to respond.
Block or report user

Report or block dyerrington

Hide content and notifications from this user.

Learn more about blocking users

Contact Support about this user’s behavior.

Learn more about reporting abuse

Report abuse
View GitHub Profile
@dyerrington
dyerrington / subplots.py
Created Mar 29, 2017
Plotting multiple figures with seaborn and matplotlib using subplots.
View subplots.py
##
# Create a figure space matrix consisting of 3 columns and 2 rows
#
# Here is a useful template to use for working with subplots.
#
##################################################################
fig, ax = plt.subplots(figsize=(10,5), ncols=3, nrows=2)
left = 0.125 # the left side of the subplots of the figure
right = 0.9 # the right side of the subplots of the figure
@dyerrington
dyerrington / generate_udf_js_big_query.py
Created Sep 18, 2019
Python code that will create, essentially a pivot from a nested big query set. Based on the original method in the google big query documentation.
View generate_udf_js_big_query.py
# fighting == most common event type
def build_udf_prototype(event_types):
null = "null" # default all types to null in the UDF function
PIVOT_FEATURES = str({"col_" + event_name.replace("-", "_"): null for event_name in event_types.tolist()}).replace("'null'", "null")
SQL_RETURN = "STRUCT<"
for event_type in event_types.tolist():
event_type = event_type.replace("-", "_")
SQL_RETURN += f"col_{event_type} INT64, "
View hiring_guidelines.md

Great Data Science Project Criteria:

  • Problem statement that defines a measurable, and/or falsifiable outcome. “Frequency of [specific event] is influential over [some outcome]”. “Users who use [some feature in app] are differentiable from users who less frequently use [some feature in app]”. etc. If you can’t frame a data problem properly, none of has it has purpose. The biggest challenge in data science is making sense and defining the gray area of business problems. This also comes with experience.
  • EDA EDA EDA. Define your scope. Report only what is necessary and relevant to your problem statement. If the model reports only 4-5 common variables as parameters (logistic regression for instance), focus on those when summarizing your work in terms of EDA.
  • How much data is necessary to make this analysis work? Are you sampling? Is a t-test necessary to gain assurance or a rank order test?
  • Explain which model makes the most sense to use. Are you trying to gain inference about a data problem?
View sf_slicing_apply_map.ipynb
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
@dyerrington
dyerrington / readme.md
Last active Jan 9, 2019
This is a very basic data generator to test recommender systems. A future version may simulate the actual sparseness of ratings data with a simple bootstrap function but for now, numpy generator does the job.
View readme.md

RecData

To use this snippet, install faker:

pip install faker
View dsi_student_install_guide.md
View parse_jupyter.md

Parse Jupyter

This is a basic class that makes it convenient to parse notebooks. I built a larger version of this that was used for clustering documents to create symantic indeices that linked related content together for a personal project. You can use this to parse notebooks for doing things like NLP or preprocessing.

Usage

parser = ParseJupyter("./Untitled.ipynb")
parser.get_cells(source_only = True, source_as_string = True)
View machine_learning_flashcards.py
import tweepy
import wget
import os
oauth = {
"consumer_key": "",
"consumer_secret": ""
}
access = {
View sf_review.ipynb
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
@dyerrington
dyerrington / my_little_pony_lstm.py
Created Jul 16, 2018
As a point of comparison with the default Nietzsche example from the Keras repo, this little experiment swaps out the dataset with forum comments from My Little Pony subreddit.
View my_little_pony_lstm.py
'''Example script to generate text from Nietzsche's writings.
At least 20 epochs are required before the generated text
starts sounding coherent.
It is recommended to run this script on GPU, as recurrent
networks are quite computationally intensive.
If you try this script on new data, make sure your corpus
has at least ~100k characters. ~1M is better.
'''
from __future__ import print_function
You can’t perform that action at this time.