Yorko/topic4_summary.md Secret

## topic4_summary.md

      
    Raw
  

              topic4_summary.md
            
          
    Welcome to the 4-th week of our course! This time we reach the most important topic – linear models. Most probably, if you've prepared your data and are ready to train some model, you'll first try either linear or logistic regression, depending on your task (regression or classification).
This week's material covers both theory of linear models and practical aspects of their usage in real-world tasks. There is a lot of math today, and we don't even try to render all the formulas on Medium. So we share our Jupyter Notebooks. In the assignment, you'll beat two simple benchmarks in a Kaggle competition with a task of identifing a user based on her session of attended websites.
Outline


Part1. Regression
Ordinary Least Squares
Maximum Likelihood Estimation
Bias-Variance Decomposition
Regularization of Linear Regression
Part 2. Linear Classification
Linear Classifier
Logistic Regression as a Linear Classifier
Maximum Likelihood Estimation and Logistic Regression
L2-Regularization of Logistic Loss
Part 3. An Illustrative Example of Logistic Regression Regularization
Part 4. Where Logistic Regression Is Good and Where It's Not
Analysis of IMDB movie reviews
XOR-Problem
Part 5. Validation and Learning Curves
How much model complexity is needed?
How much data is needed?
Kaggle Inclass competition "Catch Me If You Can"
Data Downloading and Transformation
Sparse Matrices
Training the first model

Kaggle Inclass competition "Catch Me If You Can"

We will be solving the intruder detection problem analyzing his behavior on the Internet. It is a complicated and interesting problem combining the data analysis and behavioral psychology. As an illustraion of one of such tasks, Yandex solves the mailbox intruder detection problem based on the user's behavior patterns. In a nutshell, intruder's behaviour patterns might differ from the owner's ones:

the breaker might not delete emails right after they are read, as the mailbox owner might do
the intruder might mark emails and even move the cursor differently
etc.

So the intruder could be detected and thrown out from the mailbox proposing the owner to be authentificated via SMS-code.
Similar things are being developed in Google Analytics and are described in scientific research papers. You can find more on this topic by searching "Traversal Pattern Mining" and "Sequential Pattern Mining".
In this competition we are going to solve a similar problem: our algorithm is supposed to analyze the sequence of websites consequently visited by a particular person and to predict whether this person is some Alice or an intruder (somebody else). As a metric we will use ROC AUC.
Data Downloading and Transformation

Register on Kaggle, if you have not done it before.
Go to the competition page and download the data.
First, read the training and test sets. Then explore the data and perform a couple of simple exercises:
%matplotlib inline
from matplotlib import pyplot as plt
import seaborn as sns

import pickle
import numpy as np
import pandas as pd
from scipy.sparse import csr_matrix
from scipy.sparse import hstack
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import roc_auc_score
from sklearn.linear_model import LogisticRegression
# Read the training and test data sets
train_df = pd.read_csv('../../data/websites_train_sessions.csv',
                       index_col='session_id')
test_df = pd.read_csv('../../data/websites_test_sessions.csv',
                      index_col='session_id')

# Switch time1, ..., time10 columns to datetime type
times = ['time%s' % i for i in range(1, 11)]
train_df[times] = train_df[times].apply(pd.to_datetime)
test_df[times] = test_df[times].apply(pd.to_datetime)

# Sort the data by time
train_df = train_df.sort_values(by='time1')

# Look at the first rows of the training set
train_df.head()

<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

  
      site1
      time1
      site2
      time2
      site3
      time3
      site4
      time4
      site5
      time5
      ...
      time6
      site7
      time7
      site8
      time8
      site9
      time9
      site10
      time10
      target
    
    
      session_id
      
      
      21669
      56
      2013-01-12 08:05:57
      55.0
      2013-01-12 08:05:57
      NaN
      NaT
      NaN
      NaT
      NaN
      NaT
      ...
      NaT
      NaN
      NaT
      NaN
      NaT
      NaN
      NaT
      NaN
      NaT
      0
    
    
      54843
      56
      2013-01-12 08:37:23
      55.0
      2013-01-12 08:37:23
      56.0
      2013-01-12 09:07:07
      55.0
      2013-01-12 09:07:09
      NaN
      NaT
      ...
      NaT
      NaN
      NaT
      NaN
      NaT
      NaN
      NaT
      NaN
      NaT
      0
    
    
      77292
      946
      2013-01-12 08:50:13
      946.0
      2013-01-12 08:50:14
      951.0
      2013-01-12 08:50:15
      946.0
      2013-01-12 08:50:15
      946.0
      2013-01-12 08:50:16
      ...
      2013-01-12 08:50:16
      948.0
      2013-01-12 08:50:16
      784.0
      2013-01-12 08:50:16
      949.0
      2013-01-12 08:50:17
      946.0
      2013-01-12 08:50:17
      0
    
    
      114021
      945
      2013-01-12 08:50:17
      948.0
      2013-01-12 08:50:17
      949.0
      2013-01-12 08:50:18
      948.0
      2013-01-12 08:50:18
      945.0
      2013-01-12 08:50:18
      ...
      2013-01-12 08:50:18
      947.0
      2013-01-12 08:50:19
      945.0
      2013-01-12 08:50:19
      946.0
      2013-01-12 08:50:19
      946.0
      2013-01-12 08:50:20
      0
    
    
      146670
      947
      2013-01-12 08:50:20
      950.0
      2013-01-12 08:50:20
      948.0
      2013-01-12 08:50:20
      947.0
      2013-01-12 08:50:21
      950.0
      2013-01-12 08:50:21
      ...
      2013-01-12 08:50:21
      946.0
      2013-01-12 08:50:21
      951.0
      2013-01-12 08:50:22
      946.0
      2013-01-12 08:50:22
      947.0
      2013-01-12 08:50:22
      0
    
  
5 rows × 21 columns

The training data set contains the following features:

site1 – id of the first visited website in the session
time1 – visiting time for the first website in the session
...
site10 – id of the tenth visited website in the session
time10 – visiting time for the tenth website in the session
target – target variable, possesses value of 1 for Alice's sessions, and 0 for the other users' sessions

User sessions are chosen in the way they are not longer than half an hour or/and contain more than ten websites. I.e. a session is considered as ended either if a user has visited ten websites or if a session has lasted over thirty minutes.
There are some empty values in the table, it means that some sessions contain less than ten websites. Replace empty values with 0 and change columns types to integer. Also load the websites dictionary and see how it looks:
# Change site1, ..., site10 columns type to integer and fill NA-values with zeros
sites = ['site%s' % i for i in range(1, 11)]
train_df[sites] = train_df[sites].fillna(0).astype('int')
test_df[sites] = test_df[sites].fillna(0).astype('int')

# Load websites dictionary
with open(r"../../data/site_dic.pkl", "rb") as input_file:
    site_dict = pickle.load(input_file)

# Create dataframe for the dictionary
sites_dict = pd.DataFrame(list(site_dict.keys()), index=list(site_dict.values()), columns=['site'])
print(u'Websites total:', sites_dict.shape[0])
sites_dict.head()
Websites total: 48371


      site
    
  
      25075
      www.abmecatronique.com
    
    
      13997
      groups.live.com
    
    
      42436
      majeureliguefootball.wordpress.com
    
    
      30911
      cdt46.media.tourinsoft.eu
    
    
      8104
      www.hdwallpapers.eu
    
  
In order to train our first model we need to prepare the data. First of all, exclude the target variable from the training set. Now both training and test sets have the same number of columns, therefore aggregate them into one dataframe.  Thus, all transformations will be performed simultaneously on both training and test data sets.
On the one hand, it leads to the fact that both data sets have one feature space (you don't have to worry that you forgot to transform a feature in some data set). On the other hand, processing time will increase.
For the enormously large sets it might turn out that it is impossible to transform both data sets simultaneously (and sometimes you have to split your transformations into several stages only for train/test data set).
In our case, with this particular data set, we are going to perform all the transformations for the whole united dataframe at once, and before training the model or making predictions we will just take its appropriate part.
# Our target variable
y_train = train_df['target']

# United dataframe of the initial data 
full_df = pd.concat([train_df.drop('target', axis=1), test_df])

# Index to split the training and test data sets
idx_split = train_df.shape[0]
For the very basic model, we will use only the visited websites in the session (and we will not take into account timestamp features). The point behind this data selection is: Alice has her favorite sites, and the more often you see these sites in the session, the higher probability that this is an Alice's session, and vice versa.
Let us prepare the data, we will take only features site1, site2, ... , site10 from the whole dataframe. Keep in mind that the missing values are replaced with zero. Here is how the first rows of the dataframe look like:
# Dataframe with indices of visited websites in session
full_sites = full_df[sites]
full_sites.head()


      site1
      site2
      site3
      site4
      site5
      site6
      site7
      site8
      site9
      site10
    
    
      session_id
      
      
      21669
      56
      55
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      54843
      56
      55
      56
      55
      0
      0
      0
      0
      0
      0
    
    
      77292
      946
      946
      951
      946
      946
      945
      948
      784
      949
      946
    
    
      114021
      945
      948
      949
      948
      945
      946
      947
      945
      946
      946
    
    
      146670
      947
      950
      948
      947
      950
      952
      946
      951
      946
      947
    
  
Sessions are the sequences of website indices, and data in this representation is inconvenient for linear methods. According to our hypothesis (Alice has favorite websites) we need to transform this dataframe so each website has corresponding feature (column) and its value is equal to number of this website visits in the session. It can be done in two lines:
# sequence of indices
sites_flatten = full_sites.values.flatten()

# and the matrix we are looking for
full_sites_sparse = csr_matrix(([1] * sites_flatten.shape[0],
                                sites_flatten,
                                range(0, sites_flatten.shape[0]  + 10, 10)))[:, 1:]
If you understand what just happened here, then you can skip the next passage (perhaps, you can handle the logistic regression too?), If not, then let us figure it out.
Sparse Matrices

Let us estimate how much memory it will require to store our data in the example above. Our united dataframe contains 336 thousand samples of 48 thousand integer features in each. It's easy to calculate the required amount of memory, roughly:
$$336K * 48K * 8 bytes = 16M * 8 bytes = 128 GB,$$
(that's the exact value). Obviously, ordinary mortals have no such volumes (strictly speaking, Python may allow you to create such a matrix, but it will not be easy to do anything with it). The interesting fact is that most of the elements of our matrix are zeros. If we count non-zero elements, then it will be about 1.8 million, i.е. slightly more than 10% of all matrix elements. Such a matrix, where most elements are zeros, is called sparse, and the ratio between the number of zero elements and the total number of elements is called the sparseness of the matrix.
To work with such matrices, you can use scipy.sparse library, check the documentation to understand what possible types of sparse matrices are, how to work with them and in which cases their usage is most effective. You can learn how they are arranged, for example, be reading a Wikipedia article.
Note, that a sparse matrix contains only non-zero elements, and you can get the allocated memory size like this (significant memory savings are obvious):
# How much memory does a sparse matrix occupy?
print('{0} elements * {1} bytes = {2} bytes'.format(full_sites_sparse.count_nonzero(), 8, 
                                                    full_sites_sparse.count_nonzero() * 8))
# Or just like this:
print('sparse_matrix_size = {0} bytes'.format(full_sites_sparse.data.nbytes))
1866898 elements * 8 bytes = 14935184 bytes
sparse_matrix_size = 14935184 bytes

Let us explore how the matrix with the websites has been formed using a mini example. Suppose we have the following table with user sessions:


id
site1
site2
site3


1
1
0
0


2
1
3
1


3
2
3
4


There are 3 sessions, and no more than 3 websites in each. Users visited four different sites in total (there are numbers from 1 to 4 in the table cells). And let us assume that:

vk.com
habrahabr.ru
yandex.ru
ods.ai

If the user has visited less than 3 websites during the session, the last few values will be zero. We want to convert the original dataframe in the way that each session has a corresponding row which shows the number of visits to each particular site. I.e. we want to transform the previous table into the following form:


id
vk.com
habrahabr.ru
yandex.ru
ods.ai


1
1
0
0
0


2
2
0
1
0


3
0
1
1
1


To do this, use the constructor: csr_matrix ((data, indices, indptr)) and create a frequency table (see examples, code and comments on the links above to see how it works). Here we set all the parameters explicitly for greater clarity:
# data, create the list of ones, length of which equal to the number of elements in the initial dataframe (9)
# By summing the number of ones in the cell, we get the frequency,
# number of visits to a particular site per session
data = [1] * 9

# To do this, you need to correctly distribute the ones in cells
# Indices - website ids, i.e. columns of a new matrix. We will sum ones up grouping them by sessions (ids)
indices = [1, 0, 0, 1, 3, 1, 2, 3, 4]

# Indices for the division into rows (sessions)
# For example, line 0 is the elements between the indices [0; 3) - the rightmost value is not included
# Line 1 is the elements between the indices [3; 6)
# Line 2 is the elements between the indices [6; 9) 
indptr = [0, 3, 6, 9]

# Aggregate these three variables into a tuple and compose a matrix
# To display this matrix on the screen transform it into the usual "dense" matrix
csr_matrix((data, indices, indptr)).todense()
matrix([[2, 1, 0, 0, 0],
        [0, 2, 0, 1, 0],
        [0, 0, 1, 1, 1]])

As you might have noticed, there are not four columns in the resulting matrix (by the number of different websites), but five. A zero column has been added, which says on how many ones the session was shorter (in our mini example we took sessions of three). This column is excessive and it should be removed from the dataframe.
Another benefit of using sparse matrices is that there are special implementations of both matrix operations and machine learning algorithms for them, which sometimes allows to significantly accelerate operations due to the data structure peculiarities. This applies to logistic regression as well. Now everything is ready to build our first model.
3. Training the first model

So, we have an algorithm and data for it. Let us build our first model, using logistic regression implementation from  Sklearn with default parameters. We will use the first 90% of the data for training (the training data set is sorted by time), and the remaining 10% for validation. Let's write a simple function that returns the quality of the model and then train our first classifier:
def get_auc_lr_valid(X, y, C=1.0, seed=17, ratio = 0.9):
    # Split the data into the training and validation sets
    idx = int(round(X.shape[0] * ratio))
    # Classifier training
    lr = LogisticRegression(C=C, random_state=seed, 
                            solver='lbfgs', n_jobs=-1).fit(X[:idx, :], y[:idx])
    # Prediction for validation set
    y_pred = lr.predict_proba(X[idx:, :])[:, 1]
    # Calculate the quality
    score = roc_auc_score(y[idx:], y_pred)
    
    return score
%%time
# Select the training set from the united dataframe (where we have the answers)
X_train = full_sites_sparse[:idx_split, :]

# Calculate metric on the validation set
print(get_auc_lr_valid(X_train, y_train))
0.9198622553850315
CPU times: user 138 ms, sys: 77.1 ms, total: 216 ms
Wall time: 2.74 s

The first model demonstrated the quality of approximately 0.92 ROC AUC on the validation set. Let's take it as the first baseline and a starting point. To make a prediction on the test data set ** we need to train the model again on the entire training data set ** (until this moment, our model used only part of the data for training), which will increase its generalizing ability:
# Function for writing predictions to a file
def write_to_submission_file(predicted_labels, out_file,
                             target='target', index_label="session_id"):
    predicted_df = pd.DataFrame(predicted_labels,
                                index = np.arange(1, predicted_labels.shape[0] + 1),
                                columns=[target])
    predicted_df.to_csv(out_file, index_label=index_label)
# Train the model on the whole training data set
# Use random_state=17 for repeatability
# Parameter C=1 by default, but here we set it explicitly
lr = LogisticRegression(C=1.0, solver='lbfgs', random_state=17).fit(X_train, y_train)

# Make a prediction for test data set
X_test = full_sites_sparse[idx_split:,:]
y_test = lr.predict_proba(X_test)[:, 1]

# Write it to the file which could be submitted
write_to_submission_file(y_test, 'baseline_1.csv')
If you follow these steps and upload the answer to the competition page, then you get the quality of ROC AUC = 0.91707 on the public leaderboard.
Assignment #4

In the assignment, your task will be to further improve the model through feature engineering, feature scaling, and regularization. Feature engineering is the most interesting part of Data Scientist's work and very often it can boost the performance in your ML task. You'll first try to add some obvious features (hour and day of attending a site, number of sites in a session etc.). We encourage you to try new ideas and models throughout the course, and participate in the competition - it's fun!
	site1	time1	site2	time2	site3	time3	site4	time4	site5	time5	...	time6	site7	time7	site8	time8	site9	time9	site10	time10	target
session_id
21669	56	2013-01-12 08:05:57	55.0	2013-01-12 08:05:57	NaN	NaT	NaN	NaT	NaN	NaT	...	NaT	NaN	NaT	NaN	NaT	NaN	NaT	NaN	NaT	0
54843	56	2013-01-12 08:37:23	55.0	2013-01-12 08:37:23	56.0	2013-01-12 09:07:07	55.0	2013-01-12 09:07:09	NaN	NaT	...	NaT	NaN	NaT	NaN	NaT	NaN	NaT	NaN	NaT	0
77292	946	2013-01-12 08:50:13	946.0	2013-01-12 08:50:14	951.0	2013-01-12 08:50:15	946.0	2013-01-12 08:50:15	946.0	2013-01-12 08:50:16	...	2013-01-12 08:50:16	948.0	2013-01-12 08:50:16	784.0	2013-01-12 08:50:16	949.0	2013-01-12 08:50:17	946.0	2013-01-12 08:50:17	0
114021	945	2013-01-12 08:50:17	948.0	2013-01-12 08:50:17	949.0	2013-01-12 08:50:18	948.0	2013-01-12 08:50:18	945.0	2013-01-12 08:50:18	...	2013-01-12 08:50:18	947.0	2013-01-12 08:50:19	945.0	2013-01-12 08:50:19	946.0	2013-01-12 08:50:19	946.0	2013-01-12 08:50:20	0
146670	947	2013-01-12 08:50:20	950.0	2013-01-12 08:50:20	948.0	2013-01-12 08:50:20	947.0	2013-01-12 08:50:21	950.0	2013-01-12 08:50:21	...	2013-01-12 08:50:21	946.0	2013-01-12 08:50:21	951.0	2013-01-12 08:50:22	946.0	2013-01-12 08:50:22	947.0	2013-01-12 08:50:22	0
	site
25075	www.abmecatronique.com
13997	groups.live.com
42436	majeureliguefootball.wordpress.com
30911	cdt46.media.tourinsoft.eu
8104	www.hdwallpapers.eu