Skip to content

Instantly share code, notes, and snippets.

@IwanThomas
IwanThomas / KSTest.py
Created May 16, 2017 13:51
Python function for the Kolmogorov–Smirnov test
def kstest(df, feature):
"Check if distribution fits the null hypothesis"
import scipy.stats as st
mean = df.feature.mean()
std = df.feature.std()
return st.kstest(df.feature, st.norm.cdf, args = (mean, std))
@IwanThomas
IwanThomas / Checks_Regression.py
Last active May 19, 2017 15:25
Regression Analysis Functions
def regression_checks(expected, predicted):
"""
Verify data meets the assumptions of regression analysis.
Parameters
----------
Xtrain : array-like
Training vector
ytrain : array-like
@IwanThomas
IwanThomas / Assumptions in MLR.md
Last active May 24, 2017 14:11
Multiple Linear Regression

The assumptions of simple regression also hold for multiple regression. They are: We can check these four assumptions before running our regression. Assumptions 5 and 6 are checked after running our regression analysis.

  1. The outcome variable should be continuous and cover a wide range.

  2. Each value of the outcome variable should be independent of each over value. For example, this assumption would be violated if there were some sort of time dependency in the data.

  3. The relationship between each predictor variable and the outcome variable should be approximately linear. This is checked by plotting a scatter plot.

  4. The continuous variables should be approximately normally distributed and not contain extreme outliers. We can check this by plotting a histogram and computing the K-S statistic. A transformation of some variables may be required.

@IwanThomas
IwanThomas / t_tests.md
Last active June 26, 2017 16:16
t tests

Three types of t_test:

  • one sample comparison: comparing a sample with a known mean and sd against a population with a known mean but unknown sd. We use the sample standard deviation to compute the standard error.
  • independent two sample t-test: compare two independent samples. H0 is that both datasets come from the same distribution. Ha is that they don't. Our degrees of freedom, df = n1 + n2 -2. We assume that both datasets are independent, that the distributions are approximately normally distributed and that the variances are approximately equal. We can test this last assumption, the homogeneity of variance, by running a Bartlett Test. If this condition is not met, we can run Welch's test.
  • dependent samples t-test: used for dependent data. Might be pairs of datapoints, recorded for a given individual before and after they do something, take a drug. Might also treat a husband and wife as dependent samples (unlikely to be independent).
@IwanThomas
IwanThomas / PIVOT.sql
Created May 12, 2017 12:31
The PIVOT operator
-- Example taken from StackOverflow
-- http://stackoverflow.com/documentation/sql-server/591/pivot-unpivot/8325/simple-pivot-unpivot-t-sql#t=201705121229519428917
CREATE TABLE tbl_stock(item NVARCHAR(10), weekday NVARCHAR(10), price INT);
INSERT INTO tbl_stock VALUES
('Item1', 'Mon', 110), ('Item2', 'Mon', 230), ('Item3', 'Mon', 150),
('Item1', 'Tue', 115), ('Item2', 'Tue', 231), ('Item3', 'Tue', 162),
('Item1', 'Wed', 110), ('Item2', 'Wed', 240), ('Item3', 'Wed', 162),
@IwanThomas
IwanThomas / Datetime Comparisons.md
Last active May 24, 2017 14:13
Datetime Comparisons

Be careful with datetime comparisons in SQL. For example,

Date < '2017-01-30'

is the same as

Date < '2017-01-30 00:00:00' 

That is, it does not include of that day. To account for the day as well, use the following:

@IwanThomas
IwanThomas / EXIST vs IN Operator.sql
Created May 12, 2017 12:25
How does the exist operator vary from the in operator? Let's illustrate this with an example
select bk_title, bk_publisher
from books where exists
(select * from location where
loc_shelf = 4)
-- This query would return al 12 books
-- and not just the ones on shelf 4
-- The exist predicate only requiures that a
-- subquery returns a single result.
@IwanThomas
IwanThomas / Clean HTML Files.py
Last active May 24, 2017 14:14
Extracting Text from HTML Files
# Source: http://stackoverflow.com/questions/328356/extracting-text-from-html-file-using-python
import urllib
from bs4 import BeautifulSoup
url = "http://news.bbc.co.uk/2/hi/health/2284783.stm"
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)
# kill all script and style elements
@IwanThomas
IwanThomas / Hide Code Jupyter.py
Last active September 14, 2017 14:37
Hiding Code in Jupyter Notebook
from IPython.display import HTML
HTML('''<script>
code_show=true;
function code_toggle() {
if (code_show){
$('div.input').hide();
} else {
$('div.input').show();
}
@IwanThomas
IwanThomas / Identify Types.py
Last active May 24, 2017 14:14
Identifying type of data in Pandas DF
qualitative = [f for f in train.columns if train.dtypes[f] == 'object']
quantitative = [f for f in train.columns if train.dtypes[f] != 'object']