Iwan Thomas IwanThomas

## Handling Variables.md

      
              1 file
            
          
              0 forks
            
          
                0 comments
              
            
              0 stars
            
          
                IwanThomas
                / Handling Variables.md
            
            
              Last active
              June 21, 2017 10:38
            
              
                Handling Variables in Regression
              
          
    Sources:
The Paper Learning When to Be Discrete: Continuous vs. Categorical Predictors by David Pasta
There can be instances where it makes sense to treat a continuous predictor as categorical and a categorical predictor as continuous.
Treating a continuous predictor as categorical

If the continuous variable has a linear relationship with the outcome, converting it into a categorical variable can remove information.
On the other hand, if the relationship is not perfectly linear, then choosing to make the variable categorical can make sense enabling you to capture more complicated relationships.


## SQL Database Search.sql
select name from sys.tables where name like '%%';

select t.name as TableName, c.name as ColumnName
from sys.columns c join sys.tables t on c.object_id = t.object_id
where c.name like '%%'

## Identify Types.py
qualitative = [f for f in train.columns if train.dtypes[f] == 'object']
quantitative = [f for f in train.columns if train.dtypes[f] != 'object']

## Hide Code Jupyter.py
from IPython.display import HTML

HTML('''<script>
code_show=true;
function code_toggle() {
 if (code_show){
 $('div.input').hide();
 } else {
 $('div.input').show();
 }

## Clean HTML Files.py
# Source: http://stackoverflow.com/questions/328356/extracting-text-from-html-file-using-python

import urllib
from bs4 import BeautifulSoup

url = "http://news.bbc.co.uk/2/hi/health/2284783.stm"
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)

# kill all script and style elements

## EXIST vs IN Operator.sql

select bk_title, bk_publisher
from books where exists
(select * from location where
loc_shelf = 4)

-- This query would return al 12 books
-- and not just the ones on shelf 4
-- The exist predicate only requiures that a
-- subquery returns a single result.

## Datetime Comparisons.md

      
              1 file
            
          
              0 forks
            
          
                0 comments
              
            
              0 stars
            
          
                IwanThomas
                / Datetime Comparisons.md
            
            
              Last active
              May 24, 2017 14:13
            
              
                Datetime Comparisons
              
          
    Be careful with datetime comparisons in SQL. For example,
Date < '2017-01-30'
is the same as
Date < '2017-01-30 00:00:00' 
That is, it does not include of that day. To account for the day as well, use the following:

  
## PIVOT.sql

 -- Example taken from StackOverflow
 -- http://stackoverflow.com/documentation/sql-server/591/pivot-unpivot/8325/simple-pivot-unpivot-t-sql#t=201705121229519428917

 CREATE TABLE tbl_stock(item NVARCHAR(10), weekday NVARCHAR(10), price INT);

INSERT INTO tbl_stock VALUES
('Item1', 'Mon', 110), ('Item2', 'Mon', 230), ('Item3', 'Mon', 150),
('Item1', 'Tue', 115), ('Item2', 'Tue', 231), ('Item3', 'Tue', 162),
('Item1', 'Wed', 110), ('Item2', 'Wed', 240), ('Item3', 'Wed', 162),

## t_tests.md

      
              1 file
            
          
              0 forks
            
          
                0 comments
              
            
              0 stars
            
          
                IwanThomas
                / t_tests.md
            
            
              Last active
              June 26, 2017 16:16
            
              
                t tests
              
          
    Three types of t_test:

one sample comparison: comparing a sample with a known mean and sd against a population with a known mean but unknown sd. We use the sample standard deviation to compute the standard error.
independent two sample t-test: compare two independent samples. H0 is that both datasets come from the same distribution. Ha is that they don't. Our degrees of freedom, df = n1 + n2 -2. We assume that both datasets are independent, that the distributions are approximately normally distributed and that the variances are approximately equal. We can test this last assumption, the homogeneity of variance, by running a Bartlett Test. If this condition is not met, we can run Welch's test.
dependent samples t-test: used for dependent data. Might be pairs of datapoints, recorded for a given individual before and after they do something, take a drug. Might also treat a husband and wife as dependent samples (unlikely to be independent).


## Assumptions in MLR.md

      
              4 files
            
          
              0 forks
            
          
                0 comments
              
            
              0 stars
            
          
                IwanThomas
                / Assumptions in MLR.md
            
            
              Last active
              May 24, 2017 14:11
            
              
                Multiple Linear Regression
              
          
    The assumptions of simple regression also hold for multiple regression. They are:
We can check these four assumptions before running our regression. Assumptions 5 and 6 are checked after running our regression analysis.


The outcome variable should be continuous and cover a wide range.


Each value of the outcome variable should be independent of each over value. For example, this assumption would be violated if there were some sort of time dependency in the data.


The relationship between each predictor variable and the outcome variable should be approximately linear. This is checked by plotting a scatter plot.


The continuous variables should be approximately normally distributed and not contain extreme outliers. We can check this by plotting a histogram and computing the K-S statistic. A transformation of some variables may be required.
	select name from sys.tables where name like '%%';

	select t.name as TableName, c.name as ColumnName
	from sys.columns c join sys.tables t on c.object_id = t.object_id
	where c.name like '%%'
	qualitative = [f for f in train.columns if train.dtypes[f] == 'object']
	quantitative = [f for f in train.columns if train.dtypes[f] != 'object']
	from IPython.display import HTML

	HTML('''<script>
	code_show=true;
	function code_toggle() {
	if (code_show){
	$('div.input').hide();
	} else {
	$('div.input').show();
	}
	# Source: http://stackoverflow.com/questions/328356/extracting-text-from-html-file-using-python

	import urllib
	from bs4 import BeautifulSoup

	url = "http://news.bbc.co.uk/2/hi/health/2284783.stm"
	html = urllib.urlopen(url).read()
	soup = BeautifulSoup(html)

	# kill all script and style elements

	select bk_title, bk_publisher
	from books where exists
	(select * from location where
	loc_shelf = 4)

	-- This query would return al 12 books
	-- and not just the ones on shelf 4
	-- The exist predicate only requiures that a
	-- subquery returns a single result.

	-- Example taken from StackOverflow
	-- http://stackoverflow.com/documentation/sql-server/591/pivot-unpivot/8325/simple-pivot-unpivot-t-sql#t=201705121229519428917

	CREATE TABLE tbl_stock(item NVARCHAR(10), weekday NVARCHAR(10), price INT);

	INSERT INTO tbl_stock VALUES
	('Item1', 'Mon', 110), ('Item2', 'Mon', 230), ('Item3', 'Mon', 150),
	('Item1', 'Tue', 115), ('Item2', 'Tue', 231), ('Item3', 'Tue', 162),
	('Item1', 'Wed', 110), ('Item2', 'Wed', 240), ('Item3', 'Wed', 162),