Iwan Thomas IwanThomas

## KSTest.py
def kstest(df, feature):
  "Check if distribution fits the null hypothesis"

  import scipy.stats as st
  mean = df.feature.mean()
  std = df.feature.std()

  return st.kstest(df.feature, st.norm.cdf, args = (mean, std))

## Checks_Regression.py
def regression_checks(expected, predicted):
    """
    Verify data meets the assumptions of regression analysis.

    Parameters
    ----------
    Xtrain : array-like
        Training vector

    ytrain : array-like

## Assumptions in MLR.md

      
              4 files
            
          
              0 forks
            
          
                0 comments
              
            
              0 stars
            
          
                IwanThomas
                / Assumptions in MLR.md
            
            
              Last active
              May 24, 2017 14:11
            
              
                Multiple Linear Regression
              
          
    The assumptions of simple regression also hold for multiple regression. They are:
We can check these four assumptions before running our regression. Assumptions 5 and 6 are checked after running our regression analysis.


The outcome variable should be continuous and cover a wide range.


Each value of the outcome variable should be independent of each over value. For example, this assumption would be violated if there were some sort of time dependency in the data.


The relationship between each predictor variable and the outcome variable should be approximately linear. This is checked by plotting a scatter plot.


The continuous variables should be approximately normally distributed and not contain extreme outliers. We can check this by plotting a histogram and computing the K-S statistic. A transformation of some variables may be required.


## t_tests.md

      
              1 file
            
          
              0 forks
            
          
                0 comments
              
            
              0 stars
            
          
                IwanThomas
                / t_tests.md
            
            
              Last active
              June 26, 2017 16:16
            
              
                t tests
              
          
    Three types of t_test:

one sample comparison: comparing a sample with a known mean and sd against a population with a known mean but unknown sd. We use the sample standard deviation to compute the standard error.
independent two sample t-test: compare two independent samples. H0 is that both datasets come from the same distribution. Ha is that they don't. Our degrees of freedom, df = n1 + n2 -2. We assume that both datasets are independent, that the distributions are approximately normally distributed and that the variances are approximately equal. We can test this last assumption, the homogeneity of variance, by running a Bartlett Test. If this condition is not met, we can run Welch's test.
dependent samples t-test: used for dependent data. Might be pairs of datapoints, recorded for a given individual before and after they do something, take a drug. Might also treat a husband and wife as dependent samples (unlikely to be independent).


## PIVOT.sql

 -- Example taken from StackOverflow
 -- http://stackoverflow.com/documentation/sql-server/591/pivot-unpivot/8325/simple-pivot-unpivot-t-sql#t=201705121229519428917

 CREATE TABLE tbl_stock(item NVARCHAR(10), weekday NVARCHAR(10), price INT);

INSERT INTO tbl_stock VALUES
('Item1', 'Mon', 110), ('Item2', 'Mon', 230), ('Item3', 'Mon', 150),
('Item1', 'Tue', 115), ('Item2', 'Tue', 231), ('Item3', 'Tue', 162),
('Item1', 'Wed', 110), ('Item2', 'Wed', 240), ('Item3', 'Wed', 162),

## Datetime Comparisons.md

      
              1 file
            
          
              0 forks
            
          
                0 comments
              
            
              0 stars
            
          
                IwanThomas
                / Datetime Comparisons.md
            
            
              Last active
              May 24, 2017 14:13
            
              
                Datetime Comparisons
              
          
    Be careful with datetime comparisons in SQL. For example,
Date < '2017-01-30'
is the same as
Date < '2017-01-30 00:00:00' 
That is, it does not include of that day. To account for the day as well, use the following:

  
## EXIST vs IN Operator.sql

select bk_title, bk_publisher
from books where exists
(select * from location where
loc_shelf = 4)

-- This query would return al 12 books
-- and not just the ones on shelf 4
-- The exist predicate only requiures that a
-- subquery returns a single result.

## Clean HTML Files.py
# Source: http://stackoverflow.com/questions/328356/extracting-text-from-html-file-using-python

import urllib
from bs4 import BeautifulSoup

url = "http://news.bbc.co.uk/2/hi/health/2284783.stm"
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)

# kill all script and style elements

## Hide Code Jupyter.py
from IPython.display import HTML

HTML('''<script>
code_show=true;
function code_toggle() {
 if (code_show){
 $('div.input').hide();
 } else {
 $('div.input').show();
 }

## Identify Types.py
qualitative = [f for f in train.columns if train.dtypes[f] == 'object']
quantitative = [f for f in train.columns if train.dtypes[f] != 'object']
	def kstest(df, feature):
	"Check if distribution fits the null hypothesis"

	import scipy.stats as st
	mean = df.feature.mean()
	std = df.feature.std()

	return st.kstest(df.feature, st.norm.cdf, args = (mean, std))
	def regression_checks(expected, predicted):
	"""
	Verify data meets the assumptions of regression analysis.

	Parameters
	----------
	Xtrain : array-like
	Training vector

	ytrain : array-like

	-- Example taken from StackOverflow
	-- http://stackoverflow.com/documentation/sql-server/591/pivot-unpivot/8325/simple-pivot-unpivot-t-sql#t=201705121229519428917

	CREATE TABLE tbl_stock(item NVARCHAR(10), weekday NVARCHAR(10), price INT);

	INSERT INTO tbl_stock VALUES
	('Item1', 'Mon', 110), ('Item2', 'Mon', 230), ('Item3', 'Mon', 150),
	('Item1', 'Tue', 115), ('Item2', 'Tue', 231), ('Item3', 'Tue', 162),
	('Item1', 'Wed', 110), ('Item2', 'Wed', 240), ('Item3', 'Wed', 162),

	select bk_title, bk_publisher
	from books where exists
	(select * from location where
	loc_shelf = 4)

	-- This query would return al 12 books
	-- and not just the ones on shelf 4
	-- The exist predicate only requiures that a
	-- subquery returns a single result.
	# Source: http://stackoverflow.com/questions/328356/extracting-text-from-html-file-using-python

	import urllib
	from bs4 import BeautifulSoup

	url = "http://news.bbc.co.uk/2/hi/health/2284783.stm"
	html = urllib.urlopen(url).read()
	soup = BeautifulSoup(html)

	# kill all script and style elements
	from IPython.display import HTML

	HTML('''<script>
	code_show=true;
	function code_toggle() {
	if (code_show){
	$('div.input').hide();
	} else {
	$('div.input').show();
	}
	qualitative = [f for f in train.columns if train.dtypes[f] == 'object']
	quantitative = [f for f in train.columns if train.dtypes[f] != 'object']