Skip to content

Instantly share code, notes, and snippets.

@IwanThomas
IwanThomas / Handling Variables.md
Last active June 21, 2017 10:38
Handling Variables in Regression

Sources:

The Paper Learning When to Be Discrete: Continuous vs. Categorical Predictors by David Pasta

There can be instances where it makes sense to treat a continuous predictor as categorical and a categorical predictor as continuous.

Treating a continuous predictor as categorical

  • If the continuous variable has a linear relationship with the outcome, converting it into a categorical variable can remove information.
  • On the other hand, if the relationship is not perfectly linear, then choosing to make the variable categorical can make sense enabling you to capture more complicated relationships.
@IwanThomas
IwanThomas / SQL Database Search.sql
Last active May 24, 2017 14:14
Searching around a Database
select name from sys.tables where name like '%%';
select t.name as TableName, c.name as ColumnName
from sys.columns c join sys.tables t on c.object_id = t.object_id
where c.name like '%%'
@IwanThomas
IwanThomas / Identify Types.py
Last active May 24, 2017 14:14
Identifying type of data in Pandas DF
qualitative = [f for f in train.columns if train.dtypes[f] == 'object']
quantitative = [f for f in train.columns if train.dtypes[f] != 'object']
@IwanThomas
IwanThomas / Hide Code Jupyter.py
Last active September 14, 2017 14:37
Hiding Code in Jupyter Notebook
from IPython.display import HTML
HTML('''<script>
code_show=true;
function code_toggle() {
if (code_show){
$('div.input').hide();
} else {
$('div.input').show();
}
@IwanThomas
IwanThomas / Clean HTML Files.py
Last active May 24, 2017 14:14
Extracting Text from HTML Files
# Source: http://stackoverflow.com/questions/328356/extracting-text-from-html-file-using-python
import urllib
from bs4 import BeautifulSoup
url = "http://news.bbc.co.uk/2/hi/health/2284783.stm"
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)
# kill all script and style elements
@IwanThomas
IwanThomas / EXIST vs IN Operator.sql
Created May 12, 2017 12:25
How does the exist operator vary from the in operator? Let's illustrate this with an example
select bk_title, bk_publisher
from books where exists
(select * from location where
loc_shelf = 4)
-- This query would return al 12 books
-- and not just the ones on shelf 4
-- The exist predicate only requiures that a
-- subquery returns a single result.
@IwanThomas
IwanThomas / Datetime Comparisons.md
Last active May 24, 2017 14:13
Datetime Comparisons

Be careful with datetime comparisons in SQL. For example,

Date < '2017-01-30'

is the same as

Date < '2017-01-30 00:00:00' 

That is, it does not include of that day. To account for the day as well, use the following:

@IwanThomas
IwanThomas / PIVOT.sql
Created May 12, 2017 12:31
The PIVOT operator
-- Example taken from StackOverflow
-- http://stackoverflow.com/documentation/sql-server/591/pivot-unpivot/8325/simple-pivot-unpivot-t-sql#t=201705121229519428917
CREATE TABLE tbl_stock(item NVARCHAR(10), weekday NVARCHAR(10), price INT);
INSERT INTO tbl_stock VALUES
('Item1', 'Mon', 110), ('Item2', 'Mon', 230), ('Item3', 'Mon', 150),
('Item1', 'Tue', 115), ('Item2', 'Tue', 231), ('Item3', 'Tue', 162),
('Item1', 'Wed', 110), ('Item2', 'Wed', 240), ('Item3', 'Wed', 162),
@IwanThomas
IwanThomas / t_tests.md
Last active June 26, 2017 16:16
t tests

Three types of t_test:

  • one sample comparison: comparing a sample with a known mean and sd against a population with a known mean but unknown sd. We use the sample standard deviation to compute the standard error.
  • independent two sample t-test: compare two independent samples. H0 is that both datasets come from the same distribution. Ha is that they don't. Our degrees of freedom, df = n1 + n2 -2. We assume that both datasets are independent, that the distributions are approximately normally distributed and that the variances are approximately equal. We can test this last assumption, the homogeneity of variance, by running a Bartlett Test. If this condition is not met, we can run Welch's test.
  • dependent samples t-test: used for dependent data. Might be pairs of datapoints, recorded for a given individual before and after they do something, take a drug. Might also treat a husband and wife as dependent samples (unlikely to be independent).
@IwanThomas
IwanThomas / Assumptions in MLR.md
Last active May 24, 2017 14:11
Multiple Linear Regression

The assumptions of simple regression also hold for multiple regression. They are: We can check these four assumptions before running our regression. Assumptions 5 and 6 are checked after running our regression analysis.

  1. The outcome variable should be continuous and cover a wide range.

  2. Each value of the outcome variable should be independent of each over value. For example, this assumption would be violated if there were some sort of time dependency in the data.

  3. The relationship between each predictor variable and the outcome variable should be approximately linear. This is checked by plotting a scatter plot.

  4. The continuous variables should be approximately normally distributed and not contain extreme outliers. We can check this by plotting a histogram and computing the K-S statistic. A transformation of some variables may be required.