Skip to content

Instantly share code, notes, and snippets.

@Yorko
Last active February 26, 2018 09:42
Show Gist options
  • Save Yorko/3764ce2084b53fe867b14a9c31ea8a95 to your computer and use it in GitHub Desktop.
Save Yorko/3764ce2084b53fe867b14a9c31ea8a95 to your computer and use it in GitHub Desktop.

Welcome to the 4-th week of our course! This time we reach the most important topic – linear models. Most probably, if you've prepared your data and are ready to train some model, you'll first try either linear or logistic regression, depending on your task (regression or classification).

This week's material covers both theory of linear models and practical aspects of their usage in real-world tasks. There is a lot of math today, and we don't even try to render all the formulas on Medium. So we share our Jupyter Notebooks. In the assignment, you'll beat two simple benchmarks in a Kaggle competition with a task of identifing a user based on her session of attended websites.

Outline

Kaggle Inclass competition "Catch Me If You Can"

We will be solving the intruder detection problem analyzing his behavior on the Internet. It is a complicated and interesting problem combining the data analysis and behavioral psychology. As an illustraion of one of such tasks, Yandex solves the mailbox intruder detection problem based on the user's behavior patterns. In a nutshell, intruder's behaviour patterns might differ from the owner's ones:

  • the breaker might not delete emails right after they are read, as the mailbox owner might do
  • the intruder might mark emails and even move the cursor differently
  • etc.

So the intruder could be detected and thrown out from the mailbox proposing the owner to be authentificated via SMS-code.

Similar things are being developed in Google Analytics and are described in scientific research papers. You can find more on this topic by searching "Traversal Pattern Mining" and "Sequential Pattern Mining".

In this competition we are going to solve a similar problem: our algorithm is supposed to analyze the sequence of websites consequently visited by a particular person and to predict whether this person is some Alice or an intruder (somebody else). As a metric we will use ROC AUC.

Data Downloading and Transformation

Register on Kaggle, if you have not done it before. Go to the competition page and download the data.

First, read the training and test sets. Then explore the data and perform a couple of simple exercises:

%matplotlib inline
from matplotlib import pyplot as plt
import seaborn as sns

import pickle
import numpy as np
import pandas as pd
from scipy.sparse import csr_matrix
from scipy.sparse import hstack
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import roc_auc_score
from sklearn.linear_model import LogisticRegression
# Read the training and test data sets
train_df = pd.read_csv('../../data/websites_train_sessions.csv',
                       index_col='session_id')
test_df = pd.read_csv('../../data/websites_test_sessions.csv',
                      index_col='session_id')

# Switch time1, ..., time10 columns to datetime type
times = ['time%s' % i for i in range(1, 11)]
train_df[times] = train_df[times].apply(pd.to_datetime)
test_df[times] = test_df[times].apply(pd.to_datetime)

# Sort the data by time
train_df = train_df.sort_values(by='time1')

# Look at the first rows of the training set
train_df.head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
site1 time1 site2 time2 site3 time3 site4 time4 site5 time5 ... time6 site7 time7 site8 time8 site9 time9 site10 time10 target
session_id
21669 56 2013-01-12 08:05:57 55.0 2013-01-12 08:05:57 NaN NaT NaN NaT NaN NaT ... NaT NaN NaT NaN NaT NaN NaT NaN NaT 0
54843 56 2013-01-12 08:37:23 55.0 2013-01-12 08:37:23 56.0 2013-01-12 09:07:07 55.0 2013-01-12 09:07:09 NaN NaT ... NaT NaN NaT NaN NaT NaN NaT NaN NaT 0
77292 946 2013-01-12 08:50:13 946.0 2013-01-12 08:50:14 951.0 2013-01-12 08:50:15 946.0 2013-01-12 08:50:15 946.0 2013-01-12 08:50:16 ... 2013-01-12 08:50:16 948.0 2013-01-12 08:50:16 784.0 2013-01-12 08:50:16 949.0 2013-01-12 08:50:17 946.0 2013-01-12 08:50:17 0
114021 945 2013-01-12 08:50:17 948.0 2013-01-12 08:50:17 949.0 2013-01-12 08:50:18 948.0 2013-01-12 08:50:18 945.0 2013-01-12 08:50:18 ... 2013-01-12 08:50:18 947.0 2013-01-12 08:50:19 945.0 2013-01-12 08:50:19 946.0 2013-01-12 08:50:19 946.0 2013-01-12 08:50:20 0
146670 947 2013-01-12 08:50:20 950.0 2013-01-12 08:50:20 948.0 2013-01-12 08:50:20 947.0 2013-01-12 08:50:21 950.0 2013-01-12 08:50:21 ... 2013-01-12 08:50:21 946.0 2013-01-12 08:50:21 951.0 2013-01-12 08:50:22 946.0 2013-01-12 08:50:22 947.0 2013-01-12 08:50:22 0

5 rows × 21 columns

The training data set contains the following features:

  • site1 – id of the first visited website in the session
  • time1 – visiting time for the first website in the session
  • ...
  • site10 – id of the tenth visited website in the session
  • time10 – visiting time for the tenth website in the session
  • target – target variable, possesses value of 1 for Alice's sessions, and 0 for the other users' sessions

User sessions are chosen in the way they are not longer than half an hour or/and contain more than ten websites. I.e. a session is considered as ended either if a user has visited ten websites or if a session has lasted over thirty minutes.

There are some empty values in the table, it means that some sessions contain less than ten websites. Replace empty values with 0 and change columns types to integer. Also load the websites dictionary and see how it looks:

# Change site1, ..., site10 columns type to integer and fill NA-values with zeros
sites = ['site%s' % i for i in range(1, 11)]
train_df[sites] = train_df[sites].fillna(0).astype('int')
test_df[sites] = test_df[sites].fillna(0).astype('int')

# Load websites dictionary
with open(r"../../data/site_dic.pkl", "rb") as input_file:
    site_dict = pickle.load(input_file)

# Create dataframe for the dictionary
sites_dict = pd.DataFrame(list(site_dict.keys()), index=list(site_dict.values()), columns=['site'])
print(u'Websites total:', sites_dict.shape[0])
sites_dict.head()
Websites total: 48371
site
25075 www.abmecatronique.com
13997 groups.live.com
42436 majeureliguefootball.wordpress.com
30911 cdt46.media.tourinsoft.eu
8104 www.hdwallpapers.eu

In order to train our first model we need to prepare the data. First of all, exclude the target variable from the training set. Now both training and test sets have the same number of columns, therefore aggregate them into one dataframe. Thus, all transformations will be performed simultaneously on both training and test data sets.

On the one hand, it leads to the fact that both data sets have one feature space (you don't have to worry that you forgot to transform a feature in some data set). On the other hand, processing time will increase. For the enormously large sets it might turn out that it is impossible to transform both data sets simultaneously (and sometimes you have to split your transformations into several stages only for train/test data set). In our case, with this particular data set, we are going to perform all the transformations for the whole united dataframe at once, and before training the model or making predictions we will just take its appropriate part.

# Our target variable
y_train = train_df['target']

# United dataframe of the initial data 
full_df = pd.concat([train_df.drop('target', axis=1), test_df])

# Index to split the training and test data sets
idx_split = train_df.shape[0]

For the very basic model, we will use only the visited websites in the session (and we will not take into account timestamp features). The point behind this data selection is: Alice has her favorite sites, and the more often you see these sites in the session, the higher probability that this is an Alice's session, and vice versa.

Let us prepare the data, we will take only features site1, site2, ... , site10 from the whole dataframe. Keep in mind that the missing values are replaced with zero. Here is how the first rows of the dataframe look like:

# Dataframe with indices of visited websites in session
full_sites = full_df[sites]
full_sites.head()
site1 site2 site3 site4 site5 site6 site7 site8 site9 site10
session_id
21669 56 55 0 0 0 0 0 0 0 0
54843 56 55 56 55 0 0 0 0 0 0
77292 946 946 951 946 946 945 948 784 949 946
114021 945 948 949 948 945 946 947 945 946 946
146670 947 950 948 947 950 952 946 951 946 947

Sessions are the sequences of website indices, and data in this representation is inconvenient for linear methods. According to our hypothesis (Alice has favorite websites) we need to transform this dataframe so each website has corresponding feature (column) and its value is equal to number of this website visits in the session. It can be done in two lines:

# sequence of indices
sites_flatten = full_sites.values.flatten()

# and the matrix we are looking for
full_sites_sparse = csr_matrix(([1] * sites_flatten.shape[0],
                                sites_flatten,
                                range(0, sites_flatten.shape[0]  + 10, 10)))[:, 1:]

If you understand what just happened here, then you can skip the next passage (perhaps, you can handle the logistic regression too?), If not, then let us figure it out.

Sparse Matrices

Let us estimate how much memory it will require to store our data in the example above. Our united dataframe contains 336 thousand samples of 48 thousand integer features in each. It's easy to calculate the required amount of memory, roughly:

$$336K * 48K * 8 bytes = 16M * 8 bytes = 128 GB,$$

(that's the exact value). Obviously, ordinary mortals have no such volumes (strictly speaking, Python may allow you to create such a matrix, but it will not be easy to do anything with it). The interesting fact is that most of the elements of our matrix are zeros. If we count non-zero elements, then it will be about 1.8 million, i.е. slightly more than 10% of all matrix elements. Such a matrix, where most elements are zeros, is called sparse, and the ratio between the number of zero elements and the total number of elements is called the sparseness of the matrix.

To work with such matrices, you can use scipy.sparse library, check the documentation to understand what possible types of sparse matrices are, how to work with them and in which cases their usage is most effective. You can learn how they are arranged, for example, be reading a Wikipedia article. Note, that a sparse matrix contains only non-zero elements, and you can get the allocated memory size like this (significant memory savings are obvious):

# How much memory does a sparse matrix occupy?
print('{0} elements * {1} bytes = {2} bytes'.format(full_sites_sparse.count_nonzero(), 8, 
                                                    full_sites_sparse.count_nonzero() * 8))
# Or just like this:
print('sparse_matrix_size = {0} bytes'.format(full_sites_sparse.data.nbytes))
1866898 elements * 8 bytes = 14935184 bytes
sparse_matrix_size = 14935184 bytes

Let us explore how the matrix with the websites has been formed using a mini example. Suppose we have the following table with user sessions:

id site1 site2 site3
1 1 0 0
2 1 3 1
3 2 3 4

There are 3 sessions, and no more than 3 websites in each. Users visited four different sites in total (there are numbers from 1 to 4 in the table cells). And let us assume that:

  1. vk.com
  2. habrahabr.ru
  3. yandex.ru
  4. ods.ai

If the user has visited less than 3 websites during the session, the last few values will be zero. We want to convert the original dataframe in the way that each session has a corresponding row which shows the number of visits to each particular site. I.e. we want to transform the previous table into the following form:

id vk.com habrahabr.ru yandex.ru ods.ai
1 1 0 0 0
2 2 0 1 0
3 0 1 1 1

To do this, use the constructor: csr_matrix ((data, indices, indptr)) and create a frequency table (see examples, code and comments on the links above to see how it works). Here we set all the parameters explicitly for greater clarity:

# data, create the list of ones, length of which equal to the number of elements in the initial dataframe (9)
# By summing the number of ones in the cell, we get the frequency,
# number of visits to a particular site per session
data = [1] * 9

# To do this, you need to correctly distribute the ones in cells
# Indices - website ids, i.e. columns of a new matrix. We will sum ones up grouping them by sessions (ids)
indices = [1, 0, 0, 1, 3, 1, 2, 3, 4]

# Indices for the division into rows (sessions)
# For example, line 0 is the elements between the indices [0; 3) - the rightmost value is not included
# Line 1 is the elements between the indices [3; 6)
# Line 2 is the elements between the indices [6; 9) 
indptr = [0, 3, 6, 9]

# Aggregate these three variables into a tuple and compose a matrix
# To display this matrix on the screen transform it into the usual "dense" matrix
csr_matrix((data, indices, indptr)).todense()
matrix([[2, 1, 0, 0, 0],
        [0, 2, 0, 1, 0],
        [0, 0, 1, 1, 1]])

As you might have noticed, there are not four columns in the resulting matrix (by the number of different websites), but five. A zero column has been added, which says on how many ones the session was shorter (in our mini example we took sessions of three). This column is excessive and it should be removed from the dataframe.

Another benefit of using sparse matrices is that there are special implementations of both matrix operations and machine learning algorithms for them, which sometimes allows to significantly accelerate operations due to the data structure peculiarities. This applies to logistic regression as well. Now everything is ready to build our first model.

3. Training the first model

So, we have an algorithm and data for it. Let us build our first model, using logistic regression implementation from Sklearn with default parameters. We will use the first 90% of the data for training (the training data set is sorted by time), and the remaining 10% for validation. Let's write a simple function that returns the quality of the model and then train our first classifier:

def get_auc_lr_valid(X, y, C=1.0, seed=17, ratio = 0.9):
    # Split the data into the training and validation sets
    idx = int(round(X.shape[0] * ratio))
    # Classifier training
    lr = LogisticRegression(C=C, random_state=seed, 
                            solver='lbfgs', n_jobs=-1).fit(X[:idx, :], y[:idx])
    # Prediction for validation set
    y_pred = lr.predict_proba(X[idx:, :])[:, 1]
    # Calculate the quality
    score = roc_auc_score(y[idx:], y_pred)
    
    return score
%%time
# Select the training set from the united dataframe (where we have the answers)
X_train = full_sites_sparse[:idx_split, :]

# Calculate metric on the validation set
print(get_auc_lr_valid(X_train, y_train))
0.9198622553850315
CPU times: user 138 ms, sys: 77.1 ms, total: 216 ms
Wall time: 2.74 s

The first model demonstrated the quality of approximately 0.92 ROC AUC on the validation set. Let's take it as the first baseline and a starting point. To make a prediction on the test data set ** we need to train the model again on the entire training data set ** (until this moment, our model used only part of the data for training), which will increase its generalizing ability:

# Function for writing predictions to a file
def write_to_submission_file(predicted_labels, out_file,
                             target='target', index_label="session_id"):
    predicted_df = pd.DataFrame(predicted_labels,
                                index = np.arange(1, predicted_labels.shape[0] + 1),
                                columns=[target])
    predicted_df.to_csv(out_file, index_label=index_label)
# Train the model on the whole training data set
# Use random_state=17 for repeatability
# Parameter C=1 by default, but here we set it explicitly
lr = LogisticRegression(C=1.0, solver='lbfgs', random_state=17).fit(X_train, y_train)

# Make a prediction for test data set
X_test = full_sites_sparse[idx_split:,:]
y_test = lr.predict_proba(X_test)[:, 1]

# Write it to the file which could be submitted
write_to_submission_file(y_test, 'baseline_1.csv')

If you follow these steps and upload the answer to the competition page, then you get the quality of ROC AUC = 0.91707 on the public leaderboard.

Assignment #4

In the assignment, your task will be to further improve the model through feature engineering, feature scaling, and regularization. Feature engineering is the most interesting part of Data Scientist's work and very often it can boost the performance in your ML task. You'll first try to add some obvious features (hour and day of attending a site, number of sites in a session etc.). We encourage you to try new ideas and models throughout the course, and participate in the competition - it's fun!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment