Skip to content

Instantly share code, notes, and snippets.

View srishtis's full-sized avatar

Srishti Saha srishtis

  • Mu Sigma
  • Bangalore
View GitHub Profile
@srishtis
srishtis / SQL_map.sql
Last active November 7, 2018 06:46
map control parts to test parts
SELECT *
FROM
(
SELECT TEST_DEVICE, DEVICE_OS_TEST, USAGE_TEST,
CONTROL_DEVICE, DEVICE_OS_CTRL, USAGE_CTRL,
RANK () OVER (PARTITION BY TEST_DEVICE ORDER BY EUCLEDIAN_DEV_5 ) AS DEV_RANK_5,
RANK () OVER (PARTITION BY TEST_DEVICE ORDER BY EUCLEDIAN_DEV_3 ) AS DEV_RANK_3,
RANK () OVER (PARTITION BY TEST_DEVICE ORDER BY EUCLEDIAN_DEV_1 ) AS DEV_RANK_1
FROM EUCLIDEAN_TABLE
)
@srishtis
srishtis / SQL_euclidean.sql
Last active November 7, 2018 06:40
compute the combination of Euclidean distances
CREATE TABLE EUCLIDEAN_TABLE AS
SELECT DEVICE_ID_TEST AS TEST_DEVICE, DEVICE_OS_TEST, USAGE_TEST, DEVICE_ID_CTRL AS CONTROL_DEVICE, DEVICE_OS_CTRL, USAGE_CTRL
, SQRT (
(POWER (RETAIL_PRICE,2))+
(POWER (RAM,2))+
(POWER (AGE,2))+
(POWER (STORAGE,2))+
(POWER (SCREEN_SIZE,2))
) AS EUCLEDIAN_DEV_5
, SQRT (
@srishtis
srishtis / SQL_join.sql
Last active November 7, 2018 06:35
Join test and control groups and compute metric differences
SELECT A.DEVICE_ID_TEST,A.DEVICE_OS_TEST, A.USAGE_TEST, B.DEVICE_ID_CTRL, B.DEVICE_OS_CTRL, B.USAGE_CTRL,
(A.CURR_RETAIL_PRC_TEST)-(B.CURR_RETAIL_PRC_CTRL) AS RETAIL_PRICE,
(A.DEVICE_RAM_TEST)-(B.DEVICE_RAM_CTRL) AS RAM,
(A.AGE_TEST)-(B.AGE_CONTROL) AS AGE,
(A.DEVICE_STORAGE_TEST)-(B.DEVICE_STORAGE_CTRL) AS STORAGE,
(A.SCREEN_INCH_TEST)-(B.SCREEN_INCH_CTRL) AS SCREEN_SIZE
FROM
(
SELECT DEVICE_ID AS DEVICE_ID_TEST
, DEVICE_DESC AS DEVICE_DESC_TEST
@srishtis
srishtis / SQL_select.sql
Last active November 7, 2018 06:27
Select test parts from the base table.
SELECT DEVICE_ID
, DEVICE_DESC
, CURR_RETAIL_PRC AS CURR_RETAIL_PRC_TEST
, DEVICE_RAM AS DEVICE_RAM_TEST
, AGE AS AGE_TEST
, DEVICE_STORAGE AS DEVICE_STORAGE_TEST
, SCREEN_INCH AS SCREEN_INCH_TEST
, DEVICE_OS AS DEVICE_OS_TEST
, USAGE AS USAGE_TEST
FROM TABLE_NAME
@srishtis
srishtis / hpp_normalize.py
Created October 9, 2018 14:18
Normalize hpp dataset
# normalize the variables to values from 0 to 1
normalized_total_df = preprocessing.normalize(total_df)
@srishtis
srishtis / hpp_ohe_le.py
Created October 9, 2018 14:12
one hot encoding and labelencoding features in Kaggle Hpp
# create a list of ordinal variables
ordinal_variables=['HeatingQC','KitchenQual','FireplaceQu','GarageQual','PoolQC','ExterQual','BsmtQual','Fence','BsmtCond','GarageCond','ExterCond','GarageCond','OverallCond','OverallQual','TotalHomeQual']
# label encoder
le = preprocessing.LabelEncoder()
for c in ordinal_variables:
le.fit(total_df[c])
total_df[c] = le.transform(total_df[c])
# create a list of categorical columns for one hot encoding
@srishtis
srishtis / hpp_feature_engg_1.py
Created October 9, 2018 13:51
Creating features for hpp Kaggle
# Let us start with the variables having highest correlation with the target variable
# looking at OverallQual, GrLivArea, GarageCars and TotalBsmtSF
# Since it is one of the highest correlated variables with the response, we can create a quadratic variable that might be a part of the regression equation
total_df["OverallQual_2"] = total_df["OverallQual"].astype(int) ** 2
#also creating cubic
total_df["OverallQual_3"] = total_df["OverallQual"].astype(int) ** 3
# another sqrt transformation
total_df["OverallQual_sqrt"] = np.sqrt(total_df["OverallQual"].astype(int))
@srishtis
srishtis / hpp_missing_values_3.py
Created October 9, 2018 11:45
HPP missing value treatment part 3
# function to scale a column
def norm_minmax(col):
return (col-col.min())/(col.max()-col.min())
# By business definition, LotFrontage is the area of each street connected to the house property
# Intuitively it should be highly correlated to variables like LotArea
# It should also depend on LotShape, LotConfig
# Let us make a simple Linear regressor to get the most accurate values
# convert categoricals to dummies
@srishtis
srishtis / hpp_missing_values_2.py
Created October 9, 2018 11:32
hpp imputing missing values part 2
# Let us first focus on the lesser null percentages (except LoTFrontage)
# Let us see the distribution of data across these fields
# first up: Utilities
total_df.groupby(['Utilities']).size() # only one NoSeWa value and 2 nulls
train_df.groupby(['Utilities']).size() # train data contains the 'NoSeWa'i.e. Test has no NoSeWa value
# 2 null values come from Test data
## intuitively this will not play a significant role in our model prediction
# for now let us populate the nulls with the most frequent value 'AllPub'-- can drop it later
total_df['Utilities'] = total_df['Utilities'].fillna(total_df['Utilities'].mode()[0])
@srishtis
srishtis / hpp_missing_values_1.py
Created October 9, 2018 11:25
Imputing missing values part 1
# columns with attributes like Pool, Fence etc. marked as NaN indicate the absence of these features.
attributes_with_na = ['PoolQC','MiscFeature','Alley','Fence','MasVnrType','FireplaceQu',
'GarageQual','GarageCond','GarageFinish','GarageType',
'BsmtExposure','BsmtCond','BsmtQual','BsmtFinType1','BsmtFinType2']
# replace 'NaN' with 'None' in these columns
for col in attributes_with_na:
total_df[col].fillna('None',inplace=True)
#NAs in basement related columns will indicate no masonry veneer. Thus replacing MasVnr Area with 0