Iwan Thomas IwanThomas

## tally.py
data['view_number'] = data.groupby('search_id')['search_id'].cumcount()

## DroppingDuplicateColumns.py
df.T.drop_duplicates().T

## Approach.md

      
              1 file
            
          
              0 forks
            
          
                0 comments
              
            
              0 stars
            
          
                IwanThomas
                / Approach.md
            
            
              Last active
              June 29, 2017 13:54
            
              
                General Approach to Data Analysis prior to Model Building
              
          
    Below is a quite comprehensive way of analysing your dataset. However, it can easy to get bogged down in the details here - you want to avoid this. Make a quick, dirty, hacky, end-to-end solution to your problem. Then once you have something very basic in place, it's time to get creative and start iterating on your intial approach. Try to improve each component of your solution and measure the impact to see where to spend your time. Many times acquiring more data or improving data cleaning and preprocessing steps have a higher ROI than optimizing the machine learning models themselves.
Now, a more comprehensive list of preprocessing steps to go through once you have a quick and dirty solution in place

Create some hypotheses at the beginning of your analysis. This helps you engage with the problem and think of hwo different variables affect the outcome variable.

What other features would you like?
What features might you like to engineer?


Then begin the process of EDA to test htese h


## replace_cat_with_mean.py
def replace_cat_with_mean(df, feature, target):
    """
    Function to replace ordinal vairable value with mean value for that group

    Motivation
    -----------
    Enumerate ordinal variables to avoid one-hot encoding every feature
    and significantly increasing the feature space.

    As the distance the distance between each category is unknown replace each

## NB Algorithms.md

      
              3 files
            
          
              0 forks
            
          
                0 comments
              
            
              0 stars
            
          
                IwanThomas
                / NB Algorithms.md
            
            
              Last active
              July 7, 2017 12:02
            
              
                Naive Bayes Algorithm
              
          
    GaussianNB

When dealing with continuous data, a typical assumption is that the values are distributed normally.
The algorithm will compute the mean and variance for each class and then use this to compute the probability of a value given a class.
Then classify the sample according to the greatest probability.

MultiNomialNB

With a multinomial, samples represent frequencies with which certain events have been generated.
This is the event model typically ysed for document classification with events representing occurrence of a word in a single document.


## Parameter Names in GridSearchCV.py
# To get the names of parameters to optimise in a GridSearchCV, use the following:

clf.get_params().keys()

## HandlingMissingData.py
# plot the missing variables
missing = df.isnull().sum()
missing = missing[missing>0]
missing.sort_values().plot(kind='bar', figsize=(12,5))

# identify qualitative and quantitative missing variables
missing_names = list(missing.index)
missing_qual = [column for column in train.columns if column in missing_names and train.dtypes[column] == 'object']
missing_quant = [column for column in train.columns if column in missing_names and train.dtypes[column] != 'object']

## Containers.md

      
              1 file
            
          
              0 forks
            
          
                0 comments
              
            
              0 stars
            
          
                IwanThomas
                / Containers.md
            
            
              Last active
              May 24, 2017 14:11
            
              
                Containers
              
          
Containers are a light-weight, fast version of traditional VM
Unlike VMs, containers don't bundle together a full OS. just the libs and settings required for the software.
Pros: light & quick
Cons: Guest OS must be the same as the Host OS


## Virtual Machine.md

      
              1 file
            
          
              0 forks
            
          
                0 comments
              
            
              0 stars
            
          
                IwanThomas
                / Virtual Machine.md
            
            
              Last active
              May 24, 2017 14:11
            
              
                Virtual Machine
              
          
A simulated computer that runs on another computer.
Getting a package to run on every single computer can be hard.
Using VE gets around the issue of "It works on my computer but not yours"
A VM lets you:

install a new OS
creates a new, clean environment. This helps with reliability and reproducibility.


Can snapshot a VM, store it and ship it to colleagues


## plot_transform.py
def plot_transform_log(df, feature):

    "Plot the histogram of a variable before and after a log(1+x) transformation."

    import matplotlib.pyplot as plt
    import seaborn as sns

    # create figure
    fig, ax = plt.subplots(1,2, figsize=(12,5))
	def replace_cat_with_mean(df, feature, target):
	"""
	Function to replace ordinal vairable value with mean value for that group

	Motivation
	-----------
	Enumerate ordinal variables to avoid one-hot encoding every feature
	and significantly increasing the feature space.

	As the distance the distance between each category is unknown replace each
	# To get the names of parameters to optimise in a GridSearchCV, use the following:

	clf.get_params().keys()
	# plot the missing variables
	missing = df.isnull().sum()
	missing = missing[missing>0]
	missing.sort_values().plot(kind='bar', figsize=(12,5))

	# identify qualitative and quantitative missing variables
	missing_names = list(missing.index)
	missing_qual = [column for column in train.columns if column in missing_names and train.dtypes[column] == 'object']
	missing_quant = [column for column in train.columns if column in missing_names and train.dtypes[column] != 'object']
	def plot_transform_log(df, feature):

	"Plot the histogram of a variable before and after a log(1+x) transformation."

	import matplotlib.pyplot as plt
	import seaborn as sns

	# create figure
	fig, ax = plt.subplots(1,2, figsize=(12,5))