Skip to content

Instantly share code, notes, and snippets.

@IwanThomas
IwanThomas / tally.py
Last active May 24, 2017 14:09
A little recipe to add a tally count label to each member of a groupby object
data['view_number'] = data.groupby('search_id')['search_id'].cumcount()
@IwanThomas
IwanThomas / DroppingDuplicateColumns.py
Last active May 24, 2017 14:09
Pandas DataFrame Manipulation
df.T.drop_duplicates().T
@IwanThomas
IwanThomas / Approach.md
Last active June 29, 2017 13:54
General Approach to Data Analysis prior to Model Building

Below is a quite comprehensive way of analysing your dataset. However, it can easy to get bogged down in the details here - you want to avoid this. Make a quick, dirty, hacky, end-to-end solution to your problem. Then once you have something very basic in place, it's time to get creative and start iterating on your intial approach. Try to improve each component of your solution and measure the impact to see where to spend your time. Many times acquiring more data or improving data cleaning and preprocessing steps have a higher ROI than optimizing the machine learning models themselves.

Now, a more comprehensive list of preprocessing steps to go through once you have a quick and dirty solution in place

  • Create some hypotheses at the beginning of your analysis. This helps you engage with the problem and think of hwo different variables affect the outcome variable.
    • What other features would you like?
    • What features might you like to engineer?
  • Then begin the process of EDA to test htese h
@IwanThomas
IwanThomas / replace_cat_with_mean.py
Created May 18, 2017 10:04
Function to replace ordinal vairable value with mean value for that group
def replace_cat_with_mean(df, feature, target):
"""
Function to replace ordinal vairable value with mean value for that group
Motivation
-----------
Enumerate ordinal variables to avoid one-hot encoding every feature
and significantly increasing the feature space.
As the distance the distance between each category is unknown replace each
@IwanThomas
IwanThomas / NB Algorithms.md
Last active July 7, 2017 12:02
Naive Bayes Algorithm

GaussianNB

  • When dealing with continuous data, a typical assumption is that the values are distributed normally.
  • The algorithm will compute the mean and variance for each class and then use this to compute the probability of a value given a class.
  • Then classify the sample according to the greatest probability.

MultiNomialNB

  • With a multinomial, samples represent frequencies with which certain events have been generated.
  • This is the event model typically ysed for document classification with events representing occurrence of a word in a single document.
@IwanThomas
IwanThomas / Parameter Names in GridSearchCV.py
Last active May 24, 2017 14:11
Parameter Names in GridSearchCV
# To get the names of parameters to optimise in a GridSearchCV, use the following:
clf.get_params().keys()
@IwanThomas
IwanThomas / HandlingMissingData.py
Last active May 24, 2017 14:11
Useful scripts when handling missing data
# plot the missing variables
missing = df.isnull().sum()
missing = missing[missing>0]
missing.sort_values().plot(kind='bar', figsize=(12,5))
# identify qualitative and quantitative missing variables
missing_names = list(missing.index)
missing_qual = [column for column in train.columns if column in missing_names and train.dtypes[column] == 'object']
missing_quant = [column for column in train.columns if column in missing_names and train.dtypes[column] != 'object']
@IwanThomas
IwanThomas / Containers.md
Last active May 24, 2017 14:11
Containers
  • Containers are a light-weight, fast version of traditional VM
  • Unlike VMs, containers don't bundle together a full OS. just the libs and settings required for the software.
  • Pros: light & quick
  • Cons: Guest OS must be the same as the Host OS
@IwanThomas
IwanThomas / Virtual Machine.md
Last active May 24, 2017 14:11
Virtual Machine
  • A simulated computer that runs on another computer.
  • Getting a package to run on every single computer can be hard.
  • Using VE gets around the issue of "It works on my computer but not yours"
  • A VM lets you:
    • install a new OS
    • creates a new, clean environment. This helps with reliability and reproducibility.
  • Can snapshot a VM, store it and ship it to colleagues
@IwanThomas
IwanThomas / plot_transform.py
Last active May 19, 2017 11:01
Function to plot a side-by-side figure comparing the histogram of a variable before and after a data transformation
def plot_transform_log(df, feature):
"Plot the histogram of a variable before and after a log(1+x) transformation."
import matplotlib.pyplot as plt
import seaborn as sns
# create figure
fig, ax = plt.subplots(1,2, figsize=(12,5))