devforfu/xx_final.ipynb

## xx_final.ipynb
{
 "cells": [
  {
   "metadata": {},
   "cell_type": "markdown",
   "source": "# An Unconscious Kaggler's Notebook\n\nA successful participation in a data science competition requires carry on some EDA, setting up a good cross-validation strategy, and careful feature engineering. However, if you don't have too much time and skills to work on the data in a structured way but still want to take part, what you do?\n\nRight! You encode the data, generate a bunch of additional features, apply some models and see how it comes.\n\n![img](https://imgs.xkcd.com/comics/machine_learning.png)\n\nThis notebook is an author's attempt to get some skills in the competitive Data Science, apply various Machine Learning techniques and practice with a non-trivial dataset.\n\nIt is organized in such a way that allows to keep separate paragraphs more or less independent from each other. Therefore, the data is often saved to the persistent storage and some lines of the code are repeted many times. It helps to safe a lot of time when reloading the kernel because you can easily continue work from any step.\n\n> This notebook has lots of borrowings from the public kernels and discussions. (Especially, from this one). Please let me know if you find a fragment of your work here and would like to add a proper attribution/reference. Another great source of knowledge is [How to Win a Data Science Competition](https://www.coursera.org/learn/competitive-data-science) Coursera's course. Very helpful resource for anyone who only starts participate in the competitions.\n\n[Link to the competition page](https://www.kaggle.com/c/microsoft-malware-prediction).\n"
  },
  {
   "metadata": {},
   "cell_type": "markdown",
   "source": "# Imports and Global State"
  },
  {
   "metadata": {
    "trusted": true
   },
   "cell_type": "code",
   "source": "import gc\nfrom itertools import combinations, chain",
   "execution_count": 1,
   "outputs": []
  },
  {
   "metadata": {
    "trusted": true
   },
   "cell_type": "code",
   "source": "import catboost as cb\nimport category_encoders as ce\nimport feather\nimport lightgbm as lgb\nimport numpy as np\nimport pandas as pd\nfrom scipy.spatial.distance import hamming\nfrom sklearn.ensemble import BaggingClassifier, RandomForestClassifier\nfrom sklearn.externals import joblib\nfrom sklearn.linear_model import SGDClassifier, LogisticRegression\nfrom sklearn.preprocessing import LabelEncoder, PolynomialFeatures\nfrom sklearn.model_selection import train_test_split, StratifiedKFold\nfrom sklearn.metrics import roc_auc_score\nfrom tqdm import tqdm_notebook as tqdm",
   "execution_count": 2,
   "outputs": []
  },
  {
   "metadata": {
    "trusted": true
   },
   "cell_type": "code",
   "source": "from basedir import DATA, TRAIN, TEST\nfrom info import efficient_types, id_feature, target_feature\nfrom utils import Timer",
   "execution_count": 3,
   "outputs": []
  },
  {
   "metadata": {
    "trusted": true
   },
   "cell_type": "code",
   "source": "seed = 1\nnp.random.seed(seed)",
   "execution_count": 4,
   "outputs": []
  },
  {
   "metadata": {
    "heading_collapsed": true
   },
   "cell_type": "markdown",
   "source": "# Converting Into Binary Format\n\nThe original dataset is provided in the CSV format which is very inefficient in the terms of reading/writing speed. Also, this format doesn't allow to store any information about data types. Therefore, to speed up and simplify the data loading process, let's convert the dataset into binary format. \n\nAlso, it could be convenient to concatenate the training and testing datasets together, and drop the `MachineIdentifier` column. (We can use `pandas.DataFrame` index to uniquely identify the observations)."
  },
  {
   "metadata": {
    "trusted": true,
    "hidden": true
   },
   "cell_type": "code",
   "source": "def default_saver(df, out, **params):\n    cols = df.select_dtypes(include=[np.float16]).columns\n    df[cols] = df[cols].astype(np.float32)\n    outfile = f'{out}.feather'\n    df.to_feather(outfile)\n    return outfile",
   "execution_count": null,
   "outputs": []
  },
  {
   "metadata": {
    "trusted": true,
    "hidden": true
   },
   "cell_type": "code",
   "source": "class Converter:\n    \"\"\"Reads dataset in CSV format and converts into binary file.\"\"\"\n    \n    def __init__(self, id_col, target_col, types):\n        assert id_col in types and target_col in types\n        self.id_col = id_col\n        self.target_col = target_col\n        self.types = types\n        \n    def __call__(self, *args, **kwargs):\n        self.convert(*args, **kwargs)\n        \n    def convert(self, train, test, out, saver_fn=default_saver, **params):\n        types = self.types.copy()\n        trn_df = pd.read_csv(train, usecols=types.keys(), dtype=types)\n        del types[self.target_col]\n        tst_df = pd.read_csv(test, usecols=types.keys(), dtype=types)\n        data = pd.concat([trn_df, tst_df], axis=0, sort=False)\n        data.reset_index(drop=True, inplace=True)\n        del data[self.id_col]\n        return saver_fn(data, out, **params)",
   "execution_count": null,
   "outputs": []
  },
  {
   "metadata": {
    "trusted": true,
    "hidden": true
   },
   "cell_type": "code",
   "source": "def to_feather(train, test, out):\n    conv = Converter(id_feature, target_feature, efficient_types)\n    conv(train, test, out)",
   "execution_count": null,
   "outputs": []
  },
  {
   "metadata": {
    "trusted": true,
    "hidden": true
   },
   "cell_type": "code",
   "source": "to_feather(TRAIN, TEST, DATA/'data')",
   "execution_count": null,
   "outputs": []
  },
  {
   "metadata": {
    "heading_collapsed": true
   },
   "cell_type": "markdown",
   "source": "# Features Encoding\n\nThe next step is to encode the categorical features. The author takes approach from [this great kernel](https://www.kaggle.com/bogorodvo/lightgbm-baseline-model-using-sparse-matrix) and treats all features as categorical ones. Also, all rare values are merged into a new single category."
  },
  {
   "metadata": {
    "trusted": true,
    "hidden": true
   },
   "cell_type": "code",
   "source": "data = feather.read_dataframe(DATA/'data.feather')",
   "execution_count": 5,
   "outputs": []
  },
  {
   "metadata": {
    "trusted": true,
    "hidden": true
   },
   "cell_type": "code",
   "source": "def get_categorical(df):\n    cols = df.columns.tolist()\n    if target_feature in cols:\n        cols.remove(target_feature)\n    return cols",
   "execution_count": 6,
   "outputs": []
  },
  {
   "metadata": {
    "trusted": true,
    "hidden": true
   },
   "cell_type": "code",
   "source": "print('Encoding features:')\ncat_features = get_categorical(data)\ntotal = len(cat_features)\neps = int(1e-4 * len(data))\n\nfor i, feature in enumerate(cat_features, 1):\n    print(f'[{i:2d}/{total:2d}] {feature}')\n    col = data[feature].astype(str)\n    encoder = LabelEncoder().fit(col.unique())\n    data[feature] = encoder.transform(col) + 1\n    rare = {val: 0 if cnt <= eps else val\n            for val, cnt in data[feature].value_counts().items()}\n    data[feature] = data[feature].map(rare).astype('category')",
   "execution_count": 7,
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": "Encoding features:\n[ 1/81] ProductName\n[ 2/81] EngineVersion\n[ 3/81] AppVersion\n[ 4/81] AvSigVersion\n[ 5/81] IsBeta\n[ 6/81] RtpStateBitfield\n[ 7/81] IsSxsPassiveMode\n[ 8/81] DefaultBrowsersIdentifier\n[ 9/81] AVProductStatesIdentifier\n[10/81] AVProductsInstalled\n[11/81] AVProductsEnabled\n[12/81] HasTpm\n[13/81] CountryIdentifier\n[14/81] CityIdentifier\n[15/81] OrganizationIdentifier\n[16/81] GeoNameIdentifier\n[17/81] LocaleEnglishNameIdentifier\n[18/81] Platform\n[19/81] Processor\n[20/81] OsVer\n[21/81] OsBuild\n[22/81] OsSuite\n[23/81] OsPlatformSubRelease\n[24/81] OsBuildLab\n[25/81] SkuEdition\n[26/81] IsProtected\n[27/81] AutoSampleOptIn\n[28/81] PuaMode\n[29/81] SMode\n[30/81] IeVerIdentifier\n[31/81] SmartScreen\n[32/81] Firewall\n[33/81] UacLuaenable\n[34/81] Census_MDC2FormFactor\n[35/81] Census_DeviceFamily\n[36/81] Census_OEMNameIdentifier\n[37/81] Census_OEMModelIdentifier\n[38/81] Census_ProcessorCoreCount\n[39/81] Census_ProcessorManufacturerIdentifier\n[40/81] Census_ProcessorModelIdentifier\n[41/81] Census_ProcessorClass\n[42/81] Census_PrimaryDiskTotalCapacity\n[43/81] Census_PrimaryDiskTypeName\n[44/81] Census_SystemVolumeTotalCapacity\n[45/81] Census_HasOpticalDiskDrive\n[46/81] Census_TotalPhysicalRAM\n[47/81] Census_ChassisTypeName\n[48/81] Census_InternalPrimaryDiagonalDisplaySizeInInches\n[49/81] Census_InternalPrimaryDisplayResolutionHorizontal\n[50/81] Census_InternalPrimaryDisplayResolutionVertical\n[51/81] Census_PowerPlatformRoleName\n[52/81] Census_InternalBatteryType\n[53/81] Census_InternalBatteryNumberOfCharges\n[54/81] Census_OSVersion\n[55/81] Census_OSArchitecture\n[56/81] Census_OSBranch\n[57/81] Census_OSBuildNumber\n[58/81] Census_OSBuildRevision\n[59/81] Census_OSEdition\n[60/81] Census_OSSkuName\n[61/81] Census_OSInstallTypeName\n[62/81] Census_OSInstallLanguageIdentifier\n[63/81] Census_OSUILocaleIdentifier\n[64/81] Census_OSWUAutoUpdateOptionsName\n[65/81] Census_IsPortableOperatingSystem\n[66/81] Census_GenuineStateName\n[67/81] Census_ActivationChannel\n[68/81] Census_IsFlightingInternal\n[69/81] Census_IsFlightsDisabled\n[70/81] Census_FlightRing\n[71/81] Census_ThresholdOptIn\n[72/81] Census_FirmwareManufacturerIdentifier\n[73/81] Census_FirmwareVersionIdentifier\n[74/81] Census_IsSecureBootEnabled\n[75/81] Census_IsWIMBootEnabled\n[76/81] Census_IsVirtualDevice\n[77/81] Census_IsTouchEnabled\n[78/81] Census_IsPenCapable\n[79/81] Census_IsAlwaysOnAlwaysConnectedCapable\n[80/81] Wdft_IsGamer\n[81/81] Wdft_RegionIdentifier\n"
    }
   ]
  },
  {
   "metadata": {
    "trusted": true,
    "hidden": true
   },
   "cell_type": "code",
   "source": "data.to_feather(DATA/'enc.feather')",
   "execution_count": 8,
   "outputs": []
  },
  {
   "metadata": {
    "heading_collapsed": true
   },
   "cell_type": "markdown",
   "source": "# Add More Features\n\nNow it is time to perform some feature engineering. There are a few straightforward methods which don't require any sophisticated preprocessing. These include:\n1. values counting\n2. using decision tree leafs indicies\n3. mean encoding"
  },
  {
   "metadata": {
    "trusted": true,
    "hidden": true
   },
   "cell_type": "code",
   "source": "data = feather.read_dataframe(DATA/'enc.feather')",
   "execution_count": null,
   "outputs": []
  },
  {
   "metadata": {
    "heading_collapsed": true,
    "hidden": true
   },
   "cell_type": "markdown",
   "source": "## Counting \n\nLet's count how often a specific value is encountered, including interacations between pairs of features also. The listed features are chosen somewhat arbitrary so there should be some better choices."
  },
  {
   "metadata": {
    "trusted": true,
    "hidden": true
   },
   "cell_type": "code",
   "source": "features =[ \n    'ProductName',\n    'EngineVersion',\n    'AppVersion',\n    'AvSigVersion',\n    'Platform',\n    'Processor',\n    'OsBuildLab',\n    'SmartScreen']",
   "execution_count": null,
   "outputs": []
  },
  {
   "metadata": {
    "trusted": true,
    "hidden": true
   },
   "cell_type": "code",
   "source": "feature_groups = [list(g) for g in chain(*[combinations(features, i) for i in range(1, 3)])]",
   "execution_count": null,
   "outputs": []
  },
  {
   "metadata": {
    "trusted": true,
    "hidden": true
   },
   "cell_type": "code",
   "source": "cnt_df = pd.DataFrame()",
   "execution_count": null,
   "outputs": []
  },
  {
   "metadata": {
    "trusted": true,
    "hidden": true
   },
   "cell_type": "code",
   "source": "for keys in feature_groups:\n    new_col = '_'.join(keys) + '_Freq'\n    print('Creating new feature:', new_col)\n    cnt = data.groupby(keys).size().to_frame(new_col).reset_index()\n    cnt_df[new_col] = pd.merge(data[keys], cnt, how='left', on=keys, suffixes=('', '.cnt'))[new_col]\n    cnt_df[new_col] = (cnt_df[new_col] / len(cnt_df)).astype(np.float32)",
   "execution_count": null,
   "outputs": []
  },
  {
   "metadata": {
    "trusted": true,
    "hidden": true
   },
   "cell_type": "code",
   "source": "cnt_df.to_feather(DATA/'cnt.feather')",
   "execution_count": null,
   "outputs": []
  },
  {
   "metadata": {
    "heading_collapsed": true,
    "hidden": true
   },
   "cell_type": "markdown",
   "source": "## Tree Leafs\n\nI didn't know about this method before entering the comptetion. It is about training a decision trees classifier, and use its trees structure to derive new features. Here we train 100 trees and use leaf indicies assigned to the observations as the new features."
  },
  {
   "metadata": {
    "trusted": true,
    "hidden": true
   },
   "cell_type": "code",
   "source": "X = data[~data[target_feature].isna()]\ny = X[target_feature].copy()\ndel X[target_feature]",
   "execution_count": null,
   "outputs": []
  },
  {
   "metadata": {
    "trusted": true,
    "hidden": true
   },
   "cell_type": "code",
   "source": "trn_idx, val_idx = train_test_split(X.index, test_size=0.2, random_state=seed)",
   "execution_count": null,
   "outputs": []
  },
  {
   "metadata": {
    "trusted": true,
    "hidden": true
   },
   "cell_type": "code",
   "source": "x_trn = X[X.index.isin(trn_idx)]\nx_val = X[X.index.isin(val_idx)]\ny_trn = y[y.index.isin(trn_idx)]\ny_val = y[y.index.isin(val_idx)]",
   "execution_count": null,
   "outputs": []
  },
  {
   "metadata": {
    "trusted": true,
    "hidden": true
   },
   "cell_type": "code",
   "source": "trees = lgb.LGBMClassifier(n_estimators=100, colsample_bytree=0.3, learning_rate=0.05)",
   "execution_count": null,
   "outputs": []
  },
  {
   "metadata": {
    "trusted": true,
    "hidden": true
   },
   "cell_type": "code",
   "source": "trees.fit(x_trn, y_trn,\n          eval_metric='auc',\n          eval_set=[(x_val, y_val)], \n          verbose=20,\n          early_stopping_rounds=20)",
   "execution_count": null,
   "outputs": []
  },
  {
   "metadata": {
    "trusted": true,
    "hidden": true
   },
   "cell_type": "code",
   "source": "def chunks(data, chunk_size=10000):\n    n = len(data)\n    n_chunks = n//chunk_size + int(n % chunk_size != 0)\n    for i in range(n_chunks):\n        yield data.iloc[i*chunk_size:(i + 1)*chunk_size]",
   "execution_count": null,
   "outputs": []
  },
  {
   "metadata": {
    "trusted": true,
    "hidden": true
   },
   "cell_type": "code",
   "source": "del data[target_feature]",
   "execution_count": null,
   "outputs": []
  },
  {
   "metadata": {
    "trusted": true,
    "hidden": true
   },
   "cell_type": "code",
   "source": "leafs = np.row_stack([trees.predict(chunk, pred_leaf=True) for chunk in chunks(data)])",
   "execution_count": null,
   "outputs": []
  },
  {
   "metadata": {
    "trusted": true,
    "hidden": true
   },
   "cell_type": "code",
   "source": "leafs_df = pd.DataFrame(leafs, columns=[f'Leaf_Tree{i:d}' for i in range(100)])",
   "execution_count": null,
   "outputs": []
  },
  {
   "metadata": {
    "trusted": true,
    "hidden": true
   },
   "cell_type": "code",
   "source": "leafs_df.to_feather(DATA/'leaf.feather')",
   "execution_count": null,
   "outputs": []
  },
  {
   "metadata": {
    "heading_collapsed": true,
    "hidden": true
   },
   "cell_type": "markdown",
   "source": "## Mean Encoding\n\nActually, I am not sure how to properly use this method for this competition's dataset. Or more precisely, how to extend the derived train subset features to the test set. So here we're going to  use random sampling to create test set mean encoding features."
  },
  {
   "metadata": {
    "trusted": true,
    "hidden": true
   },
   "cell_type": "code",
   "source": "X = data[~data[target_feature].isna()].copy()",
   "execution_count": null,
   "outputs": []
  },
  {
   "metadata": {
    "trusted": true,
    "hidden": true
   },
   "cell_type": "code",
   "source": "features =[ \n    'ProductName',\n    'EngineVersion',\n    'AppVersion',\n    'AvSigVersion',\n    'Platform',\n    'Processor',\n    'OsBuildLab',\n    'SmartScreen']",
   "execution_count": null,
   "outputs": []
  },
  {
   "metadata": {
    "trusted": true,
    "hidden": true
   },
   "cell_type": "code",
   "source": "n, rounds = len(X), 3\nglobal_mean = X[target_feature].mean()",
   "execution_count": null,
   "outputs": []
  },
  {
   "metadata": {
    "trusted": true,
    "hidden": true
   },
   "cell_type": "code",
   "source": "for feat in features:\n    name = f'{feat}_MeanTarget'\n    X[name] = 0\n    print(f'Mean encoding feature:', name)\n    print('\\tround:', end=' ')\n    for i in range(rounds):\n        print(f'{i}..', end=' ')\n        perm = X[X.index.isin(np.random.choice(n, size=n, replace=False))]\n        cumsum = perm.groupby(feat)[target_feature].cumsum() - perm[target_feature]\n        cumcnt = perm.groupby(feat)[target_feature].cumcount()\n        X[name] += (cumsum/cumcnt).fillna(global_mean)\n    X[name] /= rounds\n    print()",
   "execution_count": null,
   "outputs": []
  },
  {
   "metadata": {
    "trusted": true,
    "hidden": true
   },
   "cell_type": "code",
   "source": "X_test = data[~data.index.isin(X.index)].copy()",
   "execution_count": null,
   "outputs": []
  },
  {
   "metadata": {
    "trusted": true,
    "hidden": true
   },
   "cell_type": "code",
   "source": "for feat in features:\n    name = f'{feat}_MeanTarget'\n    print('Generating test set values for the mean feature:', name)\n    train_groups = X.groupby(feat).groups\n    train_keys = list(train_groups)\n    test_keys = list(X_test.groupby(feat).groups)\n    for key in test_keys:\n        subset = X_test[feat] == key\n        if key not in train_keys or len(train_groups[key]) == 0:\n            X_test.loc[subset, name] = global_mean\n        else:\n            sample = X[name][train_groups[key]].sample(subset.sum(), replace=True)\n            X_test.loc[subset, name] = sample.values",
   "execution_count": null,
   "outputs": []
  },
  {
   "metadata": {
    "trusted": true,
    "hidden": true
   },
   "cell_type": "code",
   "source": "data = pd.concat([X, X_test], axis=0, sort=False)",
   "execution_count": null,
   "outputs": []
  },
  {
   "metadata": {
    "trusted": true,
    "hidden": true
   },
   "cell_type": "code",
   "source": "mean_df = data[data.columns[data.columns.str.endswith('MeanTarget')]]",
   "execution_count": null,
   "outputs": []
  },
  {
   "metadata": {
    "trusted": true,
    "hidden": true
   },
   "cell_type": "code",
   "source": "mean_df = mean_df.astype(np.float32)",
   "execution_count": null,
   "outputs": []
  },
  {
   "metadata": {
    "trusted": true,
    "hidden": true
   },
   "cell_type": "code",
   "source": "mean_df.to_feather('mean.feather')",
   "execution_count": null,
   "outputs": []
  },
  {
   "metadata": {},
   "cell_type": "markdown",
   "source": "# Model's Zoo\n\nNow it is time to train some models! Each model is trained with K-fold validation scheme. Also, the intermediate results and models are saved into persistent memory to build a stacked model at the end."
  },
  {
   "metadata": {
    "heading_collapsed": true
   },
   "cell_type": "markdown",
   "source": "## Prepare Data\n\nIt is a good idea to work with `pandas.DataFrame` objects during EDA and feature engineering process. It allows to compactly represent the categorical features. However, when the time comes to train a model, you'll probably need to convert your categories into floating-point numbers. \nAnd, it could be problem if you don't have enough memory. (The kernel was killed several times on my machine when I've tried to fit a model with data frame, and it didn't not enough memory to generate `np.float32` array).\n\nIn the next cells, the data frames with the original and derived features are converted into `numpy` arrays of appropriate type. Also note that the `np.float16` format can be too small for your data and lead to `np.inf` values due to numerical overflow."
  },
  {
   "metadata": {
    "trusted": true,
    "hidden": true
   },
   "cell_type": "code",
   "source": "enc = feather.read_dataframe(DATA/'enc.feather')\ncnt = feather.read_dataframe(DATA/'cnt.feather').astype(np.float16)\nmean = feather.read_dataframe(DATA/'mean.feather').astype(np.float16)\nleafs = feather.read_dataframe(DATA/'leaf.feather').astype('category')\ndata = pd.concat([enc, cnt, mean, leafs], axis=1)",
   "execution_count": 6,
   "outputs": []
  },
  {
   "metadata": {
    "trusted": true,
    "hidden": true
   },
   "cell_type": "code",
   "source": "del enc, cnt, mean, leafs",
   "execution_count": 6,
   "outputs": []
  },
  {
   "metadata": {
    "trusted": true,
    "hidden": true
   },
   "cell_type": "code",
   "source": "trn_df = data.loc[~data[target_feature].isna()]\ntst_df = data.loc[ data[target_feature].isna()]\ntrn_target = trn_df[target_feature].copy()",
   "execution_count": 7,
   "outputs": []
  },
  {
   "metadata": {
    "trusted": true,
    "hidden": true
   },
   "cell_type": "code",
   "source": "del data, trn_df[target_feature], tst_df[target_feature]",
   "execution_count": 8,
   "outputs": []
  },
  {
   "metadata": {
    "trusted": true,
    "hidden": true
   },
   "cell_type": "code",
   "source": "gc.collect()",
   "execution_count": 9,
   "outputs": [
    {
     "data": {
      "text/plain": "14"
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ]
  },
  {
   "metadata": {
    "trusted": true,
    "hidden": true
   },
   "cell_type": "code",
   "source": "np.save(DATA/'x_train.npy', trn_df.astype(np.float32).values)\ndel trn_df\ngc.collect()",
   "execution_count": 10,
   "outputs": [
    {
     "data": {
      "text/plain": "0"
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ]
  },
  {
   "metadata": {
    "trusted": true,
    "hidden": true
   },
   "cell_type": "code",
   "source": "np.save(DATA/'y_train.npy', trn_target.astype(np.uint8).values)\ndel trn_target\ngc.collect()",
   "execution_count": 11,
   "outputs": [
    {
     "data": {
      "text/plain": "0"
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ]
  },
  {
   "metadata": {
    "trusted": true,
    "hidden": true
   },
   "cell_type": "code",
   "source": "np.save(DATA/'x_test.npy', tst_df.astype(np.float32).values)\ndel tst_df\ngc.collect()",
   "execution_count": 12,
   "outputs": [
    {
     "data": {
      "text/plain": "1267"
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ]
  },
  {
   "metadata": {},
   "cell_type": "markdown",
   "source": "## Utils\n"
  },
  {
   "metadata": {
    "trusted": true
   },
   "cell_type": "code",
   "source": "def split(data, target, n_splits=5, seed=seed):\n    kfold = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=seed)\n    idx = np.arange(len(data))\n    for i, (trn_idx, val_idx) in enumerate(kfold.split(idx, target), 1):\n        print(f'Running {i:d} of {kfold.get_n_splits():d} folds')\n        yield trn_idx, val_idx",
   "execution_count": 9,
   "outputs": []
  },
  {
   "metadata": {
    "trusted": true
   },
   "cell_type": "code",
   "source": "def chunks(arr, chunk_size=100000):\n    n = len(X)\n    n_chunks = n//chunk_size + int(n % chunk_size != 0)\n    for i in range(n_chunks):\n        start, end = i*chunk_size, (i + 1)*chunk_size\n        yield arr[start:end]",
   "execution_count": 10,
   "outputs": []
  },
  {
   "metadata": {
    "heading_collapsed": true
   },
   "cell_type": "markdown",
   "source": "## LightGBM\n\nThe favourite model of this competition. Very fast, high accuracy score, simple to use. What can be better?"
  },
  {
   "metadata": {
    "trusted": true,
    "hidden": true
   },
   "cell_type": "code",
   "source": "X = np.load(DATA/'x_train.npy')\ny = np.load(DATA/'y_train.npy')",
   "execution_count": 7,
   "outputs": []
  },
  {
   "metadata": {
    "trusted": true,
    "hidden": true
   },
   "cell_type": "code",
   "source": "params = {\n    'colsample_bytree': 0.25,\n    'learning_rate': 0.10, # 0.05\n    'max_depth': -1,\n    'num_leaves': 500,\n    'n_estimators': 10000,\n    'objective': 'binary',\n    'random_state': seed,\n    'n_jobs': -1}",
   "execution_count": 8,
   "outputs": []
  },
  {
   "metadata": {
    "trusted": true,
    "hidden": true
   },
   "cell_type": "code",
   "source": "val_lgb = np.zeros(len(X), dtype=np.float16)\nensemble = []\ntotal_time = 0\nfor trn_idx, val_idx in split(X, y):\n    with Timer() as timer:\n        model = lgb.LGBMClassifier(**params)\n        model.fit(\n            X[trn_idx], y[trn_idx],\n            eval_metric='auc',\n            eval_set=[(X[val_idx], y[val_idx])],\n            verbose=125, early_stopping_rounds=125)\n        print('Predicting validation fold...')\n        val_lgb[val_idx] = model.predict_proba(X[val_idx])[:, 1]\n    print(f'Fold time: {timer}')\n    ensemble.append(model)\n    total_time += float(timer)\nprint(f'Total amount of training time: {Timer.format_elapsed_time(total_time)}')",
   "execution_count": 9,
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": "Running 1 of 5 folds\nTraining until validation scores don't improve for 125 rounds.\n[125]\tvalid_0's auc: 0.737329\tvalid_0's binary_logloss: 0.596901\n[250]\tvalid_0's auc: 0.740404\tvalid_0's binary_logloss: 0.594357\n[375]\tvalid_0's auc: 0.74095\tvalid_0's binary_logloss: 0.593876\n[500]\tvalid_0's auc: 0.741204\tvalid_0's binary_logloss: 0.593647\n[625]\tvalid_0's auc: 0.741238\tvalid_0's binary_logloss: 0.593615\nEarly stopping, best iteration is:\n[589]\tvalid_0's auc: 0.741252\tvalid_0's binary_logloss: 0.593605\nPredicting validation fold...\nFold time: 00:10:17\nRunning 2 of 5 folds\nTraining until validation scores don't improve for 125 rounds.\n[125]\tvalid_0's auc: 0.73815\tvalid_0's binary_logloss: 0.596237\n[250]\tvalid_0's auc: 0.741152\tvalid_0's binary_logloss: 0.593703\n[375]\tvalid_0's auc: 0.741617\tvalid_0's binary_logloss: 0.593294\n[500]\tvalid_0's auc: 0.741804\tvalid_0's binary_logloss: 0.593111\n[625]\tvalid_0's auc: 0.741963\tvalid_0's binary_logloss: 0.592972\n[750]\tvalid_0's auc: 0.741993\tvalid_0's binary_logloss: 0.592966\nEarly stopping, best iteration is:\n[662]\tvalid_0's auc: 0.741998\tvalid_0's binary_logloss: 0.592944\nPredicting validation fold...\nFold time: 00:10:52\nRunning 3 of 5 folds\nTraining until validation scores don't improve for 125 rounds.\n[125]\tvalid_0's auc: 0.737608\tvalid_0's binary_logloss: 0.596651\n[250]\tvalid_0's auc: 0.740687\tvalid_0's binary_logloss: 0.594073\n[375]\tvalid_0's auc: 0.741005\tvalid_0's binary_logloss: 0.593788\n[500]\tvalid_0's auc: 0.741328\tvalid_0's binary_logloss: 0.593524\n[625]\tvalid_0's auc: 0.741488\tvalid_0's binary_logloss: 0.593371\n[750]\tvalid_0's auc: 0.741556\tvalid_0's binary_logloss: 0.593314\n[875]\tvalid_0's auc: 0.741559\tvalid_0's binary_logloss: 0.593318\nEarly stopping, best iteration is:\n[763]\tvalid_0's auc: 0.741578\tvalid_0's binary_logloss: 0.593296\nPredicting validation fold...\nFold time: 00:11:41\nRunning 4 of 5 folds\nTraining until validation scores don't improve for 125 rounds.\n[125]\tvalid_0's auc: 0.73794\tvalid_0's binary_logloss: 0.596407\n[250]\tvalid_0's auc: 0.740907\tvalid_0's binary_logloss: 0.59391\n[375]\tvalid_0's auc: 0.741523\tvalid_0's binary_logloss: 0.593386\n[500]\tvalid_0's auc: 0.741727\tvalid_0's binary_logloss: 0.59319\n[625]\tvalid_0's auc: 0.741813\tvalid_0's binary_logloss: 0.593109\n[750]\tvalid_0's auc: 0.741873\tvalid_0's binary_logloss: 0.593051\n[875]\tvalid_0's auc: 0.74192\tvalid_0's binary_logloss: 0.593021\nEarly stopping, best iteration is:\n[864]\tvalid_0's auc: 0.741943\tvalid_0's binary_logloss: 0.593004\nPredicting validation fold...\nFold time: 00:12:34\nRunning 5 of 5 folds\nTraining until validation scores don't improve for 125 rounds.\n[125]\tvalid_0's auc: 0.737498\tvalid_0's binary_logloss: 0.596724\n[250]\tvalid_0's auc: 0.740422\tvalid_0's binary_logloss: 0.594257\n[375]\tvalid_0's auc: 0.741021\tvalid_0's binary_logloss: 0.593757\n[500]\tvalid_0's auc: 0.74127\tvalid_0's binary_logloss: 0.593547\n[625]\tvalid_0's auc: 0.741351\tvalid_0's binary_logloss: 0.593469\n[750]\tvalid_0's auc: 0.741372\tvalid_0's binary_logloss: 0.593465\nEarly stopping, best iteration is:\n[720]\tvalid_0's auc: 0.741387\tvalid_0's binary_logloss: 0.593445\nPredicting validation fold...\nFold time: 00:11:33\nTotal amount of training time: 00:56:59\n"
    }
   ]
  },
  {
   "metadata": {
    "trusted": true,
    "hidden": true
   },
   "cell_type": "code",
   "source": "joblib.dump(ensemble, DATA/'lgb_ensemble.pickle')",
   "execution_count": 10,
   "outputs": [
    {
     "data": {
      "text/plain": "['/home/ck/data/microsoft/lgb_ensemble.pickle']"
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ]
  },
  {
   "metadata": {
    "trusted": true,
    "hidden": true
   },
   "cell_type": "code",
   "source": "np.save(DATA/'val_lgb.npy', val_lgb)",
   "execution_count": 11,
   "outputs": []
  },
  {
   "metadata": {
    "heading_collapsed": true
   },
   "cell_type": "markdown",
   "source": "## CatBoost\n\nAnother promising trees boosting library. Have never used it before. Supports GPU computations and  can work with several GPUs with different volumes of memory."
  },
  {
   "metadata": {
    "trusted": true,
    "hidden": true
   },
   "cell_type": "code",
   "source": "X = np.load(DATA/'x_train.npy')\ny = np.load(DATA/'y_train.npy')",
   "execution_count": 11,
   "outputs": []
  },
  {
   "metadata": {
    "trusted": true,
    "hidden": true
   },
   "cell_type": "code",
   "source": "params = {\n    'bagging_temperature': 1.8,\n    'l2_leaf_reg': 1,\n    'leaf_estimation_method': 'Gradient',\n    'learning_rate': 0.1,\n    'max_depth': 8,  # 5\n    'subsample': 0.6,\n    'iterations': 30000,\n    'bootstrap_type': 'Poisson',\n    'eval_metric': 'AUC',\n    'task_type': 'GPU',\n    'devices': '0:1',\n    'loss_function': 'CrossEntropy',\n    'logging_level': 'Verbose',\n    'random_seed': seed}",
   "execution_count": 12,
   "outputs": []
  },
  {
   "metadata": {
    "scrolled": true,
    "trusted": true,
    "hidden": true
   },
   "cell_type": "code",
   "source": "val_cb = np.zeros(len(X), dtype=np.float16)\nensemble = []\ntotal_time = 0\nfor trn_idx, val_idx in split(X, y):\n    with Timer() as timer:\n        model = cb.CatBoostClassifier(**params)\n        model.fit(\n            X[trn_idx], y[trn_idx],\n            eval_set=[(X[val_idx], y[val_idx])],\n            metric_period=250, early_stopping_rounds=250)\n        print('Predicting validation fold...')\n        val_cb[val_idx] = model.predict_proba(X[val_idx])[:, 1]\n    print(f'Fold time: {timer}')\n    ensemble.append(model)\n    total_time += float(timer)\nprint(f'Total amount of training time: {Timer.format_elapsed_time(total_time)}')",
   "execution_count": 13,
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": "Running 1 of 5 folds\n0:\tlearn: 0.6851266\ttest: 0.6844047\tbest: 0.6844047 (0)\ttotal: 131ms\tremaining: 1h 5m 15s\n250:\tlearn: 0.7297741\ttest: 0.7283556\tbest: 0.7283556 (250)\ttotal: 19.5s\tremaining: 38m 37s\n500:\tlearn: 0.7355618\ttest: 0.7331157\tbest: 0.7331157 (500)\ttotal: 39.4s\tremaining: 38m 40s\n750:\tlearn: 0.7390429\ttest: 0.7354908\tbest: 0.7354908 (750)\ttotal: 59.5s\tremaining: 38m 36s\n1000:\tlearn: 0.7416191\ttest: 0.7369830\tbest: 0.7369830 (1000)\ttotal: 1m 19s\tremaining: 38m 31s\n1250:\tlearn: 0.7437746\ttest: 0.7380123\tbest: 0.7380123 (1250)\ttotal: 1m 40s\tremaining: 38m 21s\n1500:\tlearn: 0.7456559\ttest: 0.7387539\tbest: 0.7387539 (1500)\ttotal: 2m\tremaining: 38m 8s\n1750:\tlearn: 0.7474428\ttest: 0.7393755\tbest: 0.7393755 (1750)\ttotal: 2m 21s\tremaining: 37m 56s\n2000:\tlearn: 0.7490658\ttest: 0.7398450\tbest: 0.7398450 (2000)\ttotal: 2m 41s\tremaining: 37m 38s\n2250:\tlearn: 0.7506031\ttest: 0.7402359\tbest: 0.7402359 (2250)\ttotal: 3m 1s\tremaining: 37m 21s\n2500:\tlearn: 0.7520553\ttest: 0.7405825\tbest: 0.7405825 (2500)\ttotal: 3m 22s\tremaining: 37m 4s\n2750:\tlearn: 0.7534900\ttest: 0.7408695\tbest: 0.7408695 (2750)\ttotal: 3m 43s\tremaining: 36m 49s\n3000:\tlearn: 0.7548683\ttest: 0.7411410\tbest: 0.7411410 (3000)\ttotal: 4m 3s\tremaining: 36m 33s\n3250:\tlearn: 0.7562250\ttest: 0.7413552\tbest: 0.7413552 (3250)\ttotal: 4m 24s\tremaining: 36m 16s\n3500:\tlearn: 0.7575406\ttest: 0.7415373\tbest: 0.7415373 (3500)\ttotal: 4m 45s\tremaining: 35m 58s\n3750:\tlearn: 0.7588010\ttest: 0.7417109\tbest: 0.7417115 (3746)\ttotal: 5m 6s\tremaining: 35m 41s\n4000:\tlearn: 0.7600622\ttest: 0.7418751\tbest: 0.7418751 (4000)\ttotal: 5m 27s\tremaining: 35m 24s\n4250:\tlearn: 0.7612717\ttest: 0.7419896\tbest: 0.7419896 (4250)\ttotal: 5m 47s\tremaining: 35m 7s\n4500:\tlearn: 0.7624764\ttest: 0.7421108\tbest: 0.7421108 (4499)\ttotal: 6m 8s\tremaining: 34m 49s\n4750:\tlearn: 0.7636789\ttest: 0.7422251\tbest: 0.7422251 (4750)\ttotal: 6m 29s\tremaining: 34m 30s\n5000:\tlearn: 0.7648291\ttest: 0.7423050\tbest: 0.7423062 (4994)\ttotal: 6m 50s\tremaining: 34m 11s\n5250:\tlearn: 0.7659769\ttest: 0.7423753\tbest: 0.7423753 (5250)\ttotal: 7m 11s\tremaining: 33m 52s\n5500:\tlearn: 0.7671116\ttest: 0.7424427\tbest: 0.7424427 (5500)\ttotal: 7m 32s\tremaining: 33m 33s\n5750:\tlearn: 0.7682374\ttest: 0.7425076\tbest: 0.7425076 (5750)\ttotal: 7m 53s\tremaining: 33m 14s\n6000:\tlearn: 0.7693331\ttest: 0.7425613\tbest: 0.7425613 (6000)\ttotal: 8m 13s\tremaining: 32m 55s\n6250:\tlearn: 0.7704284\ttest: 0.7426014\tbest: 0.7426031 (6239)\ttotal: 8m 34s\tremaining: 32m 35s\n6500:\tlearn: 0.7715001\ttest: 0.7426521\tbest: 0.7426521 (6500)\ttotal: 8m 55s\tremaining: 32m 15s\n6750:\tlearn: 0.7725867\ttest: 0.7427104\tbest: 0.7427106 (6748)\ttotal: 9m 16s\tremaining: 31m 56s\n7000:\tlearn: 0.7736357\ttest: 0.7427511\tbest: 0.7427511 (7000)\ttotal: 9m 37s\tremaining: 31m 37s\n7250:\tlearn: 0.7747117\ttest: 0.7427766\tbest: 0.7427775 (7183)\ttotal: 9m 58s\tremaining: 31m 18s\n7500:\tlearn: 0.7757379\ttest: 0.7427909\tbest: 0.7427937 (7460)\ttotal: 10m 19s\tremaining: 30m 59s\n7750:\tlearn: 0.7767831\ttest: 0.7427952\tbest: 0.7428030 (7705)\ttotal: 10m 41s\tremaining: 30m 40s\n8000:\tlearn: 0.7778163\ttest: 0.7428082\tbest: 0.7428097 (7995)\ttotal: 11m 2s\tremaining: 30m 20s\n8250:\tlearn: 0.7788331\ttest: 0.7428332\tbest: 0.7428340 (8239)\ttotal: 11m 23s\tremaining: 30m 1s\n8500:\tlearn: 0.7798406\ttest: 0.7428669\tbest: 0.7428669 (8500)\ttotal: 11m 44s\tremaining: 29m 41s\n8750:\tlearn: 0.7808563\ttest: 0.7428851\tbest: 0.7428904 (8699)\ttotal: 12m 5s\tremaining: 29m 21s\n9000:\tlearn: 0.7818410\ttest: 0.7428959\tbest: 0.7429007 (8958)\ttotal: 12m 26s\tremaining: 29m 1s\n9250:\tlearn: 0.7828320\ttest: 0.7428994\tbest: 0.7429056 (9032)\ttotal: 12m 47s\tremaining: 28m 41s\n9500:\tlearn: 0.7837998\ttest: 0.7429189\tbest: 0.7429227 (9415)\ttotal: 13m 8s\tremaining: 28m 21s\n9750:\tlearn: 0.7847793\ttest: 0.7429212\tbest: 0.7429241 (9735)\ttotal: 13m 29s\tremaining: 28m 1s\nbestTest = 0.7429240942\nbestIteration = 9735\nShrink model to first 9736 iterations.\nPredicting validation fold...\nFold time: 00:15:21\nRunning 2 of 5 folds\n0:\tlearn: 0.6848314\ttest: 0.6855975\tbest: 0.6855975 (0)\ttotal: 124ms\tremaining: 1h 1m 49s\n250:\tlearn: 0.7296661\ttest: 0.7293133\tbest: 0.7293133 (250)\ttotal: 19.4s\tremaining: 38m 23s\n500:\tlearn: 0.7353851\ttest: 0.7340461\tbest: 0.7340461 (500)\ttotal: 39s\tremaining: 38m 16s\n750:\tlearn: 0.7389217\ttest: 0.7364740\tbest: 0.7364740 (750)\ttotal: 59.2s\tremaining: 38m 24s\n1000:\tlearn: 0.7414839\ttest: 0.7378700\tbest: 0.7378700 (1000)\ttotal: 1m 19s\tremaining: 38m 20s\n1250:\tlearn: 0.7436312\ttest: 0.7388944\tbest: 0.7388944 (1250)\ttotal: 1m 39s\tremaining: 38m 9s\n1500:\tlearn: 0.7455142\ttest: 0.7396414\tbest: 0.7396414 (1500)\ttotal: 2m\tremaining: 37m 58s\n1750:\tlearn: 0.7472605\ttest: 0.7402188\tbest: 0.7402188 (1750)\ttotal: 2m 20s\tremaining: 37m 46s\n2000:\tlearn: 0.7488767\ttest: 0.7406901\tbest: 0.7406901 (2000)\ttotal: 2m 40s\tremaining: 37m 32s\n2250:\tlearn: 0.7504321\ttest: 0.7410704\tbest: 0.7410704 (2250)\ttotal: 3m 1s\tremaining: 37m 18s\n2500:\tlearn: 0.7519293\ttest: 0.7414368\tbest: 0.7414372 (2499)\ttotal: 3m 22s\tremaining: 37m 3s\n2750:\tlearn: 0.7533427\ttest: 0.7417213\tbest: 0.7417213 (2750)\ttotal: 3m 42s\tremaining: 36m 45s\n3000:\tlearn: 0.7546988\ttest: 0.7419631\tbest: 0.7419640 (2998)\ttotal: 4m 3s\tremaining: 36m 27s\n3250:\tlearn: 0.7560247\ttest: 0.7421651\tbest: 0.7421651 (3250)\ttotal: 4m 23s\tremaining: 36m 10s\n3500:\tlearn: 0.7573494\ttest: 0.7423383\tbest: 0.7423383 (3500)\ttotal: 4m 44s\tremaining: 35m 52s\n3750:\tlearn: 0.7586054\ttest: 0.7424997\tbest: 0.7424997 (3750)\ttotal: 5m 4s\tremaining: 35m 34s\n4000:\tlearn: 0.7598609\ttest: 0.7426500\tbest: 0.7426500 (4000)\ttotal: 5m 25s\tremaining: 35m 17s\n4250:\tlearn: 0.7610719\ttest: 0.7427893\tbest: 0.7427894 (4249)\ttotal: 5m 46s\tremaining: 34m 57s\n4500:\tlearn: 0.7622871\ttest: 0.7428899\tbest: 0.7428899 (4500)\ttotal: 6m 6s\tremaining: 34m 38s\n4750:\tlearn: 0.7634523\ttest: 0.7429807\tbest: 0.7429808 (4749)\ttotal: 6m 27s\tremaining: 34m 20s\n5000:\tlearn: 0.7646332\ttest: 0.7430608\tbest: 0.7430608 (5000)\ttotal: 6m 48s\tremaining: 34m 1s\n5250:\tlearn: 0.7657889\ttest: 0.7431650\tbest: 0.7431653 (5241)\ttotal: 7m 9s\tremaining: 33m 42s\n5500:\tlearn: 0.7669226\ttest: 0.7432202\tbest: 0.7432202 (5500)\ttotal: 7m 29s\tremaining: 33m 23s\n5750:\tlearn: 0.7680573\ttest: 0.7432972\tbest: 0.7432972 (5750)\ttotal: 7m 50s\tremaining: 33m 3s\n6000:\tlearn: 0.7691827\ttest: 0.7433477\tbest: 0.7433481 (5999)\ttotal: 8m 11s\tremaining: 32m 44s\n6250:\tlearn: 0.7702753\ttest: 0.7434064\tbest: 0.7434067 (6247)\ttotal: 8m 31s\tremaining: 32m 24s\n6500:\tlearn: 0.7713563\ttest: 0.7434359\tbest: 0.7434370 (6496)\ttotal: 8m 52s\tremaining: 32m 5s\n6750:\tlearn: 0.7724329\ttest: 0.7434914\tbest: 0.7434914 (6750)\ttotal: 9m 13s\tremaining: 31m 45s\n7000:\tlearn: 0.7735010\ttest: 0.7435163\tbest: 0.7435167 (6990)\ttotal: 9m 34s\tremaining: 31m 26s\n7250:\tlearn: 0.7745351\ttest: 0.7435403\tbest: 0.7435408 (7228)\ttotal: 9m 54s\tremaining: 31m 6s\n7500:\tlearn: 0.7755905\ttest: 0.7435873\tbest: 0.7435883 (7499)\ttotal: 10m 15s\tremaining: 30m 46s\n7750:\tlearn: 0.7766078\ttest: 0.7436039\tbest: 0.7436039 (7750)\ttotal: 10m 36s\tremaining: 30m 26s\n8000:\tlearn: 0.7776453\ttest: 0.7436253\tbest: 0.7436275 (7976)\ttotal: 10m 57s\tremaining: 30m 6s\n8250:\tlearn: 0.7786550\ttest: 0.7436531\tbest: 0.7436538 (8236)\ttotal: 11m 17s\tremaining: 29m 46s\n8500:\tlearn: 0.7796691\ttest: 0.7436568\tbest: 0.7436578 (8282)\ttotal: 11m 38s\tremaining: 29m 26s\n8750:\tlearn: 0.7806778\ttest: 0.7436817\tbest: 0.7436851 (8742)\ttotal: 11m 59s\tremaining: 29m 6s\n9000:\tlearn: 0.7816508\ttest: 0.7436990\tbest: 0.7436991 (8999)\ttotal: 12m 19s\tremaining: 28m 46s\n9250:\tlearn: 0.7826423\ttest: 0.7436992\tbest: 0.7437031 (9243)\ttotal: 12m 40s\tremaining: 28m 26s\n9500:\tlearn: 0.7836209\ttest: 0.7437122\tbest: 0.7437143 (9492)\ttotal: 13m 1s\tremaining: 28m 6s\nbestTest = 0.743714273\nbestIteration = 9492\nShrink model to first 9493 iterations.\nPredicting validation fold...\nFold time: 00:14:52\nRunning 3 of 5 folds\n0:\tlearn: 0.6850999\ttest: 0.6845070\tbest: 0.6845070 (0)\ttotal: 125ms\tremaining: 1h 2m 20s\n250:\tlearn: 0.7296294\ttest: 0.7284850\tbest: 0.7284850 (250)\ttotal: 19.6s\tremaining: 38m 43s\n500:\tlearn: 0.7354756\ttest: 0.7333815\tbest: 0.7333815 (500)\ttotal: 39.3s\tremaining: 38m 32s\n750:\tlearn: 0.7389518\ttest: 0.7357793\tbest: 0.7357793 (750)\ttotal: 59.5s\tremaining: 38m 35s\n1000:\tlearn: 0.7416003\ttest: 0.7373117\tbest: 0.7373117 (1000)\ttotal: 1m 19s\tremaining: 38m 29s\n"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": "1250:\tlearn: 0.7438034\ttest: 0.7383720\tbest: 0.7383720 (1250)\ttotal: 1m 40s\tremaining: 38m 19s\n1500:\tlearn: 0.7457413\ttest: 0.7391279\tbest: 0.7391279 (1500)\ttotal: 2m\tremaining: 38m 9s\n1750:\tlearn: 0.7474782\ttest: 0.7397085\tbest: 0.7397085 (1750)\ttotal: 2m 21s\tremaining: 37m 57s\n2000:\tlearn: 0.7490694\ttest: 0.7401717\tbest: 0.7401717 (2000)\ttotal: 2m 41s\tremaining: 37m 41s\n2250:\tlearn: 0.7506141\ttest: 0.7405697\tbest: 0.7405697 (2250)\ttotal: 3m 2s\tremaining: 37m 25s\n2500:\tlearn: 0.7521148\ttest: 0.7409177\tbest: 0.7409177 (2500)\ttotal: 3m 22s\tremaining: 37m 8s\n2750:\tlearn: 0.7535221\ttest: 0.7411814\tbest: 0.7411823 (2749)\ttotal: 3m 43s\tremaining: 36m 51s\n3000:\tlearn: 0.7548911\ttest: 0.7414254\tbest: 0.7414262 (2994)\ttotal: 4m 3s\tremaining: 36m 34s\n3250:\tlearn: 0.7562345\ttest: 0.7416208\tbest: 0.7416219 (3248)\ttotal: 4m 24s\tremaining: 36m 15s\n3500:\tlearn: 0.7575342\ttest: 0.7417935\tbest: 0.7417941 (3499)\ttotal: 4m 45s\tremaining: 35m 59s\n3750:\tlearn: 0.7587993\ttest: 0.7419619\tbest: 0.7419619 (3750)\ttotal: 5m 5s\tremaining: 35m 41s\n4000:\tlearn: 0.7600444\ttest: 0.7420942\tbest: 0.7420949 (3999)\ttotal: 5m 26s\tremaining: 35m 22s\n4250:\tlearn: 0.7612798\ttest: 0.7422073\tbest: 0.7422074 (4249)\ttotal: 5m 47s\tremaining: 35m 2s\n4500:\tlearn: 0.7624865\ttest: 0.7422993\tbest: 0.7422993 (4500)\ttotal: 6m 7s\tremaining: 34m 43s\n4750:\tlearn: 0.7636806\ttest: 0.7424092\tbest: 0.7424092 (4750)\ttotal: 6m 28s\tremaining: 34m 24s\n5000:\tlearn: 0.7648394\ttest: 0.7424819\tbest: 0.7424819 (5000)\ttotal: 6m 49s\tremaining: 34m 4s\n5250:\tlearn: 0.7659918\ttest: 0.7425707\tbest: 0.7425725 (5241)\ttotal: 7m 9s\tremaining: 33m 45s\n5500:\tlearn: 0.7671323\ttest: 0.7426252\tbest: 0.7426281 (5486)\ttotal: 7m 30s\tremaining: 33m 25s\n5750:\tlearn: 0.7682531\ttest: 0.7426993\tbest: 0.7426993 (5750)\ttotal: 7m 51s\tremaining: 33m 6s\n6000:\tlearn: 0.7693505\ttest: 0.7427369\tbest: 0.7427382 (5998)\ttotal: 8m 11s\tremaining: 32m 47s\n6250:\tlearn: 0.7704578\ttest: 0.7427803\tbest: 0.7427803 (6250)\ttotal: 8m 32s\tremaining: 32m 27s\n6500:\tlearn: 0.7715418\ttest: 0.7428173\tbest: 0.7428177 (6498)\ttotal: 8m 53s\tremaining: 32m 8s\n6750:\tlearn: 0.7726044\ttest: 0.7428693\tbest: 0.7428694 (6748)\ttotal: 9m 14s\tremaining: 31m 48s\n7000:\tlearn: 0.7736835\ttest: 0.7429143\tbest: 0.7429171 (6990)\ttotal: 9m 35s\tremaining: 31m 28s\n7250:\tlearn: 0.7747476\ttest: 0.7429455\tbest: 0.7429455 (7250)\ttotal: 9m 55s\tremaining: 31m 8s\n7500:\tlearn: 0.7757769\ttest: 0.7429571\tbest: 0.7429702 (7396)\ttotal: 10m 16s\tremaining: 30m 49s\nbestTest = 0.7429702282\nbestIteration = 7396\nShrink model to first 7397 iterations.\nPredicting validation fold...\nFold time: 00:11:58\nRunning 4 of 5 folds\n0:\tlearn: 0.6840279\ttest: 0.6837019\tbest: 0.6837019 (0)\ttotal: 124ms\tremaining: 1h 2m 14s\n250:\tlearn: 0.7295898\ttest: 0.7288651\tbest: 0.7288651 (250)\ttotal: 19.7s\tremaining: 38m 51s\n500:\tlearn: 0.7354265\ttest: 0.7336980\tbest: 0.7336980 (500)\ttotal: 39.5s\tremaining: 38m 45s\n750:\tlearn: 0.7388861\ttest: 0.7360826\tbest: 0.7360826 (750)\ttotal: 59.6s\tremaining: 38m 39s\n1000:\tlearn: 0.7415335\ttest: 0.7375904\tbest: 0.7375904 (1000)\ttotal: 1m 19s\tremaining: 38m 35s\n1250:\tlearn: 0.7436752\ttest: 0.7386098\tbest: 0.7386098 (1250)\ttotal: 1m 40s\tremaining: 38m 26s\n1500:\tlearn: 0.7456069\ttest: 0.7393535\tbest: 0.7393535 (1500)\ttotal: 2m\tremaining: 38m 10s\n1750:\tlearn: 0.7473272\ttest: 0.7399135\tbest: 0.7399135 (1750)\ttotal: 2m 21s\tremaining: 37m 56s\n2000:\tlearn: 0.7489631\ttest: 0.7403584\tbest: 0.7403584 (2000)\ttotal: 2m 41s\tremaining: 37m 41s\n2250:\tlearn: 0.7505025\ttest: 0.7407418\tbest: 0.7407418 (2250)\ttotal: 3m 2s\tremaining: 37m 27s\n2500:\tlearn: 0.7519901\ttest: 0.7410769\tbest: 0.7410769 (2500)\ttotal: 3m 23s\tremaining: 37m 12s\n2750:\tlearn: 0.7534010\ttest: 0.7413916\tbest: 0.7413928 (2749)\ttotal: 3m 43s\tremaining: 36m 55s\n3000:\tlearn: 0.7547644\ttest: 0.7416371\tbest: 0.7416371 (3000)\ttotal: 4m 4s\tremaining: 36m 37s\n3250:\tlearn: 0.7560982\ttest: 0.7418520\tbest: 0.7418520 (3250)\ttotal: 4m 25s\tremaining: 36m 20s\n3500:\tlearn: 0.7574009\ttest: 0.7420366\tbest: 0.7420366 (3499)\ttotal: 4m 45s\tremaining: 36m 2s\n3750:\tlearn: 0.7587007\ttest: 0.7422026\tbest: 0.7422026 (3750)\ttotal: 5m 6s\tremaining: 35m 43s\n4000:\tlearn: 0.7599286\ttest: 0.7423490\tbest: 0.7423490 (4000)\ttotal: 5m 26s\tremaining: 35m 24s\n4250:\tlearn: 0.7611579\ttest: 0.7424964\tbest: 0.7424964 (4248)\ttotal: 5m 47s\tremaining: 35m 4s\n4500:\tlearn: 0.7623374\ttest: 0.7426186\tbest: 0.7426186 (4493)\ttotal: 6m 8s\tremaining: 34m 45s\n4750:\tlearn: 0.7635263\ttest: 0.7427064\tbest: 0.7427072 (4744)\ttotal: 6m 28s\tremaining: 34m 26s\n5000:\tlearn: 0.7646803\ttest: 0.7427904\tbest: 0.7427904 (5000)\ttotal: 6m 49s\tremaining: 34m 7s\n5250:\tlearn: 0.7658509\ttest: 0.7428794\tbest: 0.7428794 (5250)\ttotal: 7m 10s\tremaining: 33m 48s\n5500:\tlearn: 0.7669673\ttest: 0.7429584\tbest: 0.7429588 (5497)\ttotal: 7m 31s\tremaining: 33m 29s\n5750:\tlearn: 0.7680994\ttest: 0.7430549\tbest: 0.7430549 (5750)\ttotal: 7m 52s\tremaining: 33m 10s\n6000:\tlearn: 0.7692231\ttest: 0.7431363\tbest: 0.7431367 (5998)\ttotal: 8m 12s\tremaining: 32m 50s\n6250:\tlearn: 0.7703366\ttest: 0.7431835\tbest: 0.7431835 (6250)\ttotal: 8m 33s\tremaining: 32m 31s\n6500:\tlearn: 0.7714197\ttest: 0.7432296\tbest: 0.7432296 (6500)\ttotal: 8m 54s\tremaining: 32m 12s\n6750:\tlearn: 0.7724985\ttest: 0.7432509\tbest: 0.7432511 (6747)\ttotal: 9m 15s\tremaining: 31m 52s\n7000:\tlearn: 0.7735813\ttest: 0.7432857\tbest: 0.7432857 (7000)\ttotal: 9m 36s\tremaining: 31m 32s\n7250:\tlearn: 0.7746302\ttest: 0.7433014\tbest: 0.7433034 (7245)\ttotal: 9m 57s\tremaining: 31m 13s\n7500:\tlearn: 0.7756832\ttest: 0.7433423\tbest: 0.7433453 (7476)\ttotal: 10m 17s\tremaining: 30m 53s\n7750:\tlearn: 0.7767162\ttest: 0.7433698\tbest: 0.7433729 (7733)\ttotal: 10m 38s\tremaining: 30m 33s\n8000:\tlearn: 0.7777387\ttest: 0.7433999\tbest: 0.7434005 (7983)\ttotal: 10m 59s\tremaining: 30m 12s\n8250:\tlearn: 0.7787530\ttest: 0.7434103\tbest: 0.7434158 (8179)\ttotal: 11m 20s\tremaining: 29m 52s\n8500:\tlearn: 0.7797645\ttest: 0.7434147\tbest: 0.7434197 (8308)\ttotal: 11m 40s\tremaining: 29m 32s\nbestTest = 0.7434197068\nbestIteration = 8308\nShrink model to first 8309 iterations.\nPredicting validation fold...\nFold time: 00:13:15\nRunning 5 of 5 folds\n0:\tlearn: 0.6848926\ttest: 0.6853259\tbest: 0.6853259 (0)\ttotal: 124ms\tremaining: 1h 2m 12s\n250:\tlearn: 0.7296306\ttest: 0.7283516\tbest: 0.7283516 (250)\ttotal: 19.5s\tremaining: 38m 36s\n500:\tlearn: 0.7355935\ttest: 0.7332525\tbest: 0.7332525 (500)\ttotal: 39.4s\tremaining: 38m 38s\n750:\tlearn: 0.7390787\ttest: 0.7356053\tbest: 0.7356053 (750)\ttotal: 59.5s\tremaining: 38m 38s\n1000:\tlearn: 0.7416857\ttest: 0.7370375\tbest: 0.7370375 (1000)\ttotal: 1m 19s\tremaining: 38m 37s\n1250:\tlearn: 0.7438401\ttest: 0.7380364\tbest: 0.7380364 (1250)\ttotal: 1m 40s\tremaining: 38m 27s\n1500:\tlearn: 0.7457320\ttest: 0.7387847\tbest: 0.7387847 (1500)\ttotal: 2m\tremaining: 38m 17s\n1750:\tlearn: 0.7474715\ttest: 0.7393531\tbest: 0.7393531 (1750)\ttotal: 2m 21s\tremaining: 38m 2s\n2000:\tlearn: 0.7491094\ttest: 0.7398145\tbest: 0.7398145 (2000)\ttotal: 2m 42s\tremaining: 37m 49s\n2250:\tlearn: 0.7506503\ttest: 0.7401853\tbest: 0.7401853 (2250)\ttotal: 3m 2s\tremaining: 37m 34s\n2500:\tlearn: 0.7521340\ttest: 0.7404997\tbest: 0.7404997 (2500)\ttotal: 3m 23s\tremaining: 37m 17s\n2750:\tlearn: 0.7535403\ttest: 0.7407767\tbest: 0.7407767 (2750)\ttotal: 3m 44s\tremaining: 37m 1s\n3000:\tlearn: 0.7548900\ttest: 0.7409991\tbest: 0.7409991 (3000)\ttotal: 4m 4s\tremaining: 36m 43s\n3250:\tlearn: 0.7562026\ttest: 0.7411880\tbest: 0.7411880 (3250)\ttotal: 4m 25s\tremaining: 36m 24s\n3500:\tlearn: 0.7575147\ttest: 0.7413676\tbest: 0.7413676 (3500)\ttotal: 4m 46s\tremaining: 36m 5s\n3750:\tlearn: 0.7587857\ttest: 0.7415340\tbest: 0.7415340 (3750)\ttotal: 5m 6s\tremaining: 35m 47s\n4000:\tlearn: 0.7600163\ttest: 0.7416680\tbest: 0.7416680 (4000)\ttotal: 5m 27s\tremaining: 35m 28s\n4250:\tlearn: 0.7612514\ttest: 0.7418015\tbest: 0.7418021 (4249)\ttotal: 5m 48s\tremaining: 35m 10s\n4500:\tlearn: 0.7624537\ttest: 0.7419295\tbest: 0.7419295 (4500)\ttotal: 6m 9s\tremaining: 34m 51s\n4750:\tlearn: 0.7636208\ttest: 0.7420292\tbest: 0.7420292 (4750)\ttotal: 6m 29s\tremaining: 34m 31s\n5000:\tlearn: 0.7647878\ttest: 0.7421130\tbest: 0.7421130 (5000)\ttotal: 6m 50s\tremaining: 34m 12s\n5250:\tlearn: 0.7659357\ttest: 0.7421789\tbest: 0.7421789 (5244)\ttotal: 7m 11s\tremaining: 33m 52s\n5500:\tlearn: 0.7671063\ttest: 0.7422670\tbest: 0.7422670 (5500)\ttotal: 7m 31s\tremaining: 33m 32s\n"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": "5750:\tlearn: 0.7682310\ttest: 0.7423315\tbest: 0.7423343 (5742)\ttotal: 7m 52s\tremaining: 33m 13s\n6000:\tlearn: 0.7693284\ttest: 0.7424052\tbest: 0.7424055 (5999)\ttotal: 8m 13s\tremaining: 32m 53s\n6250:\tlearn: 0.7704502\ttest: 0.7424735\tbest: 0.7424735 (6250)\ttotal: 8m 34s\tremaining: 32m 33s\n6500:\tlearn: 0.7715443\ttest: 0.7425126\tbest: 0.7425132 (6490)\ttotal: 8m 54s\tremaining: 32m 13s\n6750:\tlearn: 0.7726235\ttest: 0.7425506\tbest: 0.7425506 (6750)\ttotal: 9m 15s\tremaining: 31m 53s\n7000:\tlearn: 0.7736748\ttest: 0.7425961\tbest: 0.7425979 (6994)\ttotal: 9m 36s\tremaining: 31m 33s\n7250:\tlearn: 0.7747220\ttest: 0.7426147\tbest: 0.7426147 (7250)\ttotal: 9m 57s\tremaining: 31m 13s\n7500:\tlearn: 0.7757820\ttest: 0.7426503\tbest: 0.7426568 (7454)\ttotal: 10m 17s\tremaining: 30m 53s\n7750:\tlearn: 0.7768281\ttest: 0.7426544\tbest: 0.7426600 (7598)\ttotal: 10m 38s\tremaining: 30m 33s\n8000:\tlearn: 0.7778530\ttest: 0.7426778\tbest: 0.7426866 (7953)\ttotal: 10m 59s\tremaining: 30m 13s\n8250:\tlearn: 0.7788802\ttest: 0.7426885\tbest: 0.7426907 (8237)\ttotal: 11m 20s\tremaining: 29m 53s\n8500:\tlearn: 0.7798839\ttest: 0.7427056\tbest: 0.7427127 (8445)\ttotal: 11m 41s\tremaining: 29m 33s\nbestTest = 0.7427127361\nbestIteration = 8445\nShrink model to first 8446 iterations.\nPredicting validation fold...\nFold time: 00:13:27\nTotal amount of training time: 01:08:54\n"
    }
   ]
  },
  {
   "metadata": {
    "trusted": true,
    "hidden": true
   },
   "cell_type": "code",
   "source": "joblib.dump(ensemble, DATA/'cb_ensemble.pickle')",
   "execution_count": 14,
   "outputs": [
    {
     "data": {
      "text/plain": "['/home/ck/data/microsoft/cb_ensemble.pickle']"
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ]
  },
  {
   "metadata": {
    "trusted": true,
    "hidden": true
   },
   "cell_type": "code",
   "source": "np.save(DATA/'val_cb.npy', val_cb)",
   "execution_count": 15,
   "outputs": []
  },
  {
   "metadata": {
    "heading_collapsed": true
   },
   "cell_type": "markdown",
   "source": "## SGD \n\nThis one was added to somehow diversify the ensemble. All previous models use tree-based boosting method so probably SGD could bring some additional information. As a standalone model, it shows quite low accuracy compared to the previous solutions. But maybe it can add some value to the stacked model."
  },
  {
   "metadata": {
    "trusted": true,
    "hidden": true
   },
   "cell_type": "code",
   "source": "X = np.load(DATA/'x_train.npy')\ny = np.load(DATA/'y_train.npy')",
   "execution_count": 7,
   "outputs": []
  },
  {
   "metadata": {
    "trusted": true,
    "hidden": true
   },
   "cell_type": "code",
   "source": "val_sgd = np.zeros(len(X), dtype=np.float16)\nensemble = []\ntotal_time = 0\nfor trn_idx, val_idx in split(X, y):\n    with Timer() as timer:\n        sgd = SGDClassifier(\n            loss='log', early_stopping=True, \n            tol=0.001, alpha=1./len(trn_idx),\n            fit_intercept=False, n_jobs=-1)\n        model = BaggingClassifier(\n            base_estimator=sgd, n_estimators=10, \n            bootstrap_features=True, max_features=0.5)\n        model.fit(X[trn_idx], y[trn_idx])\n        print('Predicting validation fold...', end=' ')\n        preds = model.predict_proba(X[val_idx])[:, 1]\n        val_sgd[val_idx] = preds\n        score = roc_auc_score(y[val_idx], preds)\n        print(f'AUC score: {score:2.2f}')\n    print(f'Fold time: {timer}')\n    ensemble.append(model)\n    total_time += float(timer)\nprint(f'Total amount of training time: {Timer.format_elapsed_time(total_time)}')",
   "execution_count": 8,
   "outputs": [
    {
     "output_type": "stream",
     "text": "Running 1 of 5 folds\nPredicting validation fold... AUC score: 0.58\nFold time: 00:09:42\nRunning 2 of 5 folds\nPredicting validation fold... AUC score: 0.56\nFold time: 00:08:59\nRunning 3 of 5 folds\nPredicting validation fold... AUC score: 0.59\nFold time: 00:09:57\nRunning 4 of 5 folds\nPredicting validation fold... AUC score: 0.56\nFold time: 00:09:29\nRunning 5 of 5 folds\nPredicting validation fold... AUC score: 0.60\nFold time: 00:08:54\nTotal amount of training time: 00:47:04\n",
     "name": "stdout"
    }
   ]
  },
  {
   "metadata": {
    "trusted": true,
    "hidden": true
   },
   "cell_type": "code",
   "source": "joblib.dump(ensemble, DATA/'sgd_ensemble.pickle')",
   "execution_count": 10,
   "outputs": [
    {
     "output_type": "execute_result",
     "execution_count": 10,
     "data": {
      "text/plain": "['/home/ck/data/microsoft/sgd_ensemble.pickle']"
     },
     "metadata": {}
    }
   ]
  },
  {
   "metadata": {
    "hidden": true,
    "trusted": true
   },
   "cell_type": "code",
   "source": "np.save(DATA/'val_sgd.npy', val_sgd)",
   "execution_count": 11,
   "outputs": []
  },
  {
   "metadata": {
    "heading_collapsed": true
   },
   "cell_type": "markdown",
   "source": "## Vowpal Wabbit\n\nOne more linear learner here. In the next cells, the data we have is converted into format expected by VW. Note that these files could occupy a lot of disk space in uncompressed format. Also, we don't use K-fold cross-validation here. Of course, you can generate several files, or try to play with VW options to enable this and be more consistent with the previous models."
  },
  {
   "metadata": {
    "hidden": true,
    "trusted": true
   },
   "cell_type": "code",
   "source": "X = np.load(DATA/'x_train.npy')\ny = np.load(DATA/'y_train.npy')\nX_test = np.load(DATA/'x_test.npy')",
   "execution_count": 7,
   "outputs": []
  },
  {
   "metadata": {
    "hidden": true,
    "trusted": true
   },
   "cell_type": "code",
   "source": "col_groups = 0, 81, 81+36, 81+36+8, 81+36+8+100",
   "execution_count": 8,
   "outputs": []
  },
  {
   "metadata": {
    "hidden": true,
    "trusted": true
   },
   "cell_type": "code",
   "source": "def pairs(xs):\n    for a, b in zip(xs[:-1], xs[1:]):\n        yield a, b",
   "execution_count": 9,
   "outputs": []
  },
  {
   "metadata": {
    "hidden": true,
    "trusted": true
   },
   "cell_type": "code",
   "source": "base, cnts, mean, leaf = pairs(col_groups)\ngroups = [('base', base, True), \n          ('cnts', cnts, False),\n          ('mean', mean, False), \n          ('leaf', leaf, True)]",
   "execution_count": 10,
   "outputs": []
  },
  {
   "metadata": {
    "hidden": true,
    "trusted": true
   },
   "cell_type": "code",
   "source": "def join(s): return ' '.join(s)",
   "execution_count": 11,
   "outputs": []
  },
  {
   "metadata": {
    "hidden": true,
    "trusted": true
   },
   "cell_type": "code",
   "source": "def convert_to_vw(data, targets, filename, groups, logistic=True):\n    print('Preparing file:', filename)\n    print('Number of samples in the dataset:', len(data))\n    with open(filename, 'w') as file:\n        for (i, row), target in zip(enumerate(data), targets):\n            if (i+1) % 500_000 == 0:\n                print(f'{i+1:d} samples prepared')\n            sample = []\n            for name, (start, end), categorical in groups:\n                sep = '_' if categorical else ':'\n                prefix = name[0]\n                group = [f'{prefix}{j}{sep}{int(x) if categorical else f\"{x:.4f}\"}' \n                         for j, x in enumerate(row[start:end])]\n                sample.append((name, group))\n            if logistic:\n                target = -1 if not target else 1\n            string = f\"{target} 'index={i}\"\n            for name, group in sample:\n                string += f' |{name} {join(group)}'\n            string += '\\n'\n            file.write(string)\n    print('Done!')",
   "execution_count": 16,
   "outputs": []
  },
  {
   "metadata": {
    "hidden": true,
    "trusted": true
   },
   "cell_type": "code",
   "source": "convert_to_vw(X, y, DATA/'train.vw', groups)",
   "execution_count": 17,
   "outputs": [
    {
     "output_type": "stream",
     "text": "Preparing file: /home/ck/data/microsoft/train.vw\nNumber of samples in the dataset: 8921483\n500000 samples prepared\n1000000 samples prepared\n1500000 samples prepared\n2000000 samples prepared\n2500000 samples prepared\n3000000 samples prepared\n3500000 samples prepared\n4000000 samples prepared\n4500000 samples prepared\n5000000 samples prepared\n5500000 samples prepared\n6000000 samples prepared\n6500000 samples prepared\n7000000 samples prepared\n7500000 samples prepared\n8000000 samples prepared\n8500000 samples prepared\nDone!\n",
     "name": "stdout"
    }
   ]
  },
  {
   "metadata": {
    "hidden": true,
    "trusted": true
   },
   "cell_type": "code",
   "source": "convert_to_vw(X_test, np.zeros(len(X_test)), DATA/'test.vw', groups)",
   "execution_count": 18,
   "outputs": [
    {
     "output_type": "stream",
     "text": "Preparing file: /home/ck/data/microsoft/test.vw\nNumber of samples in the dataset: 7853253\n500000 samples prepared\n1000000 samples prepared\n1500000 samples prepared\n2000000 samples prepared\n2500000 samples prepared\n3000000 samples prepared\n3500000 samples prepared\n4000000 samples prepared\n4500000 samples prepared\n5000000 samples prepared\n5500000 samples prepared\n6000000 samples prepared\n6500000 samples prepared\n7000000 samples prepared\n7500000 samples prepared\nDone!\n",
     "name": "stdout"
    }
   ]
  },
  {
   "metadata": {
    "hidden": true,
    "trusted": true
   },
   "cell_type": "code",
   "source": "train = DATA/'train.vw'\ntest = DATA/'test.vw'\nmodel = DATA/'vw.model'",
   "execution_count": 23,
   "outputs": []
  },
  {
   "metadata": {
    "hidden": true,
    "trusted": true,
    "scrolled": true
   },
   "cell_type": "code",
   "source": "!vw -d \"{train}\" --loss_function=logistic --passes=1 --l1=1e-8 -f \"{model}\" --threads -c",
   "execution_count": 11,
   "outputs": [
    {
     "output_type": "stream",
     "text": "using l1 regularization = 1e-08\nfinal_regressor = /home/ck/data/microsoft/vw.model\nNum weight bits = 18\nlearning rate = 0.5\ninitial_t = 0\npower_t = 0.5\nusing cache_file = /home/ck/data/microsoft/train.vw.cache\nignoring text input in favor of cache input\nnum sources = 1\naverage  since         example        example  current  current  current\nloss     last          counter         weight    label  predict features\n0.693147 0.693147            1            1.0  -1.0000   0.0000      226\n0.385178 0.077210            2            2.0  -1.0000  -2.5224      221\n0.779407 1.173636            4            4.0   1.0000  -2.0642      220\n0.886432 0.993456            8            8.0  -1.0000   1.6948      225\n0.844876 0.803321           16           16.0   1.0000   0.2158      225\n0.765214 0.685552           32           32.0  -1.0000   0.1218      225\n0.732735 0.700256           64           64.0  -1.0000  -0.9867      223\n0.712478 0.692220          128          128.0  -1.0000  -1.9928      224\n0.688964 0.665450          256          256.0   1.0000   1.0081      226\n0.703236 0.717509          512          512.0  -1.0000  -0.0780      223\n0.687916 0.672595         1024         1024.0  -1.0000  -1.0880      226\n0.673310 0.658704         2048         2048.0  -1.0000  -1.9166      225\n0.663663 0.654015         4096         4096.0  -1.0000  -2.0242      224\n0.652972 0.642281         8192         8192.0  -1.0000   0.1751      225\n0.646093 0.639215        16384        16384.0  -1.0000  -0.7576      226\n0.637280 0.628467        32768        32768.0  -1.0000  -0.3366      225\n0.628735 0.620189        65536        65536.0   1.0000   1.7307      225\n0.620995 0.613256       131072       131072.0  -1.0000  -0.6446      225\n0.615783 0.610570       262144       262144.0  -1.0000  -0.3538      224\n0.611487 0.607191       524288       524288.0  -1.0000  -0.5350      226\n0.608158 0.604829      1048576      1048576.0   1.0000   0.4412      225\n0.605723 0.603287      2097152      2097152.0   1.0000  -0.4425      226\n0.604186 0.602649      4194304      4194304.0  -1.0000  -1.8421      226\n0.602900 0.601615      8388608      8388608.0   1.0000   1.9680      224\n\nfinished run\nnumber of examples per pass = 8921483\npasses used = 1\nweighted example sum = 8921483.000000\nweighted label sum = -3699.000000\naverage loss = 0.602805\nbest constant = -0.000829\nbest constant's loss = 0.693147\ntotal feature number = 2009866018\n",
     "name": "stdout"
    }
   ]
  },
  {
   "metadata": {
    "hidden": true,
    "trusted": true,
    "scrolled": true
   },
   "cell_type": "code",
   "source": "!vw -d \"{train}\" -i \"{model}\" -t --loss_function=logistic --link=logistic -p \"{DATA}/train_preds.vw\"",
   "execution_count": 24,
   "outputs": [
    {
     "output_type": "stream",
     "text": "only testing\npredictions = /home/ck/data/microsoft/train_preds.vw\nNum weight bits = 18\nlearning rate = 0.5\ninitial_t = 0\npower_t = 0.5\nusing no cache\nReading datafile = /home/ck/data/microsoft/train.vw\nnum sources = 1\naverage  since         example        example  current  current  current\nloss     last          counter         weight    label  predict features\n0.810212 0.810212            1            1.0  -1.0000   0.5552      226\n0.885578 0.960945            2            2.0  -1.0000   0.6175      221\n0.630610 0.375642            4            4.0   1.0000   0.9909      220\n0.653711 0.676812            8            8.0  -1.0000   0.4736      225\n0.574857 0.496003           16           16.0   1.0000   0.9100      225\n0.613110 0.651364           32           32.0  -1.0000   0.2503      225\n0.601791 0.590472           64           64.0  -1.0000   0.3089      223\n0.619216 0.636641          128          128.0  -1.0000   0.3140      224\n0.606463 0.593710          256          256.0   1.0000   0.4752      226\n0.590612 0.574761          512          512.0  -1.0000   0.4016      223\n0.596969 0.603326         1024         1024.0  -1.0000   0.3293      226\n0.592711 0.588453         2048         2048.0  -1.0000   0.2576      225\n0.596289 0.599867         4096         4096.0  -1.0000   0.0842      224\n0.597137 0.597985         8192         8192.0  -1.0000   0.5245      225\n0.600479 0.603822        16384        16384.0  -1.0000   0.2062      226\n0.601834 0.603188        32768        32768.0  -1.0000   0.3038      225\n0.600483 0.599133        65536        65536.0   1.0000   0.8231      225\n0.599772 0.599060       131072       131072.0  -1.0000   0.4382      225\n0.600129 0.600487       262144       262144.0  -1.0000   0.4013      224\n0.600324 0.600520       524288       524288.0  -1.0000   0.3935      226\n0.600431 0.600538      1048576      1048576.0   1.0000   0.5249      225\n0.600565 0.600700      2097152      2097152.0   1.0000   0.3649      226\n0.600901 0.601236      4194304      4194304.0  -1.0000   0.1249      226\n0.600789 0.600677      8388608      8388608.0   1.0000   0.8713      224\n\nfinished run\nnumber of examples per pass = 8921483\npasses used = 1\nweighted example sum = 8921483.000000\nweighted label sum = -3699.000000\naverage loss = 0.600727\nbest constant = -0.000829\nbest constant's loss = 0.693147\ntotal feature number = 2009866018\n",
     "name": "stdout"
    }
   ]
  },
  {
   "metadata": {
    "hidden": true,
    "trusted": true
   },
   "cell_type": "code",
   "source": "!vw -d \"{test}\" -i \"{model}\" -t --loss_function=logistic --link=logistic -p \"{DATA}/test_preds.vw\"",
   "execution_count": 25,
   "outputs": [
    {
     "output_type": "stream",
     "text": "only testing\npredictions = /home/ck/data/microsoft/test_preds.vw\nNum weight bits = 18\nlearning rate = 0.5\ninitial_t = 0\npower_t = 0.5\nusing no cache\nReading datafile = /home/ck/data/microsoft/test.vw\nnum sources = 1\naverage  since         example        example  current  current  current\nloss     last          counter         weight    label  predict features\n0.719431 0.719431            1            1.0  -1.0000   0.5130      225\n0.754351 0.789272            2            2.0  -1.0000   0.5458      226\n0.596922 0.439494            4            4.0  -1.0000   0.3611      225\n0.559359 0.521797            8            8.0  -1.0000   0.2534      223\n0.528392 0.497425           16           16.0  -1.0000   0.3468      226\n0.567522 0.606651           32           32.0  -1.0000   0.5032      226\n0.562386 0.557250           64           64.0  -1.0000   0.4206      226\n0.629696 0.697005          128          128.0  -1.0000   0.5881      226\n0.655898 0.682101          256          256.0  -1.0000   0.6589      226\n0.649914 0.643930          512          512.0  -1.0000   0.4845      226\n0.647522 0.645130         1024         1024.0  -1.0000   0.4133      225\n0.663348 0.679174         2048         2048.0  -1.0000   0.5276      226\n0.670636 0.677924         4096         4096.0  -1.0000   0.2393      226\n0.670951 0.671266         8192         8192.0  -1.0000   0.4425      224\n0.674848 0.678746        16384        16384.0  -1.0000   0.7229      224\n0.670143 0.665438        32768        32768.0  -1.0000   0.5403      226\n0.668797 0.667450        65536        65536.0  -1.0000   0.6339      226\n0.668300 0.667802       131072       131072.0  -1.0000   0.7732      226\n0.669747 0.671195       262144       262144.0  -1.0000   0.5273      226\n0.669958 0.670168       524288       524288.0  -1.0000   0.5058      226\n0.670054 0.670150      1048576      1048576.0  -1.0000   0.3149      226\n0.669620 0.669186      2097152      2097152.0  -1.0000   0.6468      226\n0.669303 0.668987      4194304      4194304.0  -1.0000   0.4008      225\n\nfinished run\nnumber of examples per pass = 7853253\npasses used = 1\nweighted example sum = 7853253.000000\nweighted label sum = -7853253.000000\naverage loss = 0.669313\nbest constant = -1.000000\nbest constant's loss = 0.313262\ntotal feature number = 1768544335\n",
     "name": "stdout"
    }
   ]
  },
  {
   "metadata": {
    "hidden": true,
    "trusted": true
   },
   "cell_type": "code",
   "source": "for filename in ('train_preds.vw', 'test_preds.vw'):\n    with open(DATA/filename) as file:\n        prefix = filename.strip('.vw')\n        values = [(line.strip().split()[0]) for line in file]\n        out = DATA/f'vw_{prefix}.npy'\n        np.save(out, np.array(values, dtype=np.float16))\n        print('Saved file:', out)",
   "execution_count": 26,
   "outputs": [
    {
     "output_type": "stream",
     "text": "Saved file: /home/ck/data/microsoft/vw_train_preds.npy\nSaved file: /home/ck/data/microsoft/vw_test_preds.npy\n",
     "name": "stdout"
    }
   ]
  },
  {
   "metadata": {
    "hidden": true,
    "trusted": true
   },
   "cell_type": "code",
   "source": "preds = np.load(DATA/'vw_test_preds.npy')",
   "execution_count": 27,
   "outputs": []
  },
  {
   "metadata": {
    "hidden": true,
    "trusted": true
   },
   "cell_type": "code",
   "source": "submit = pd.read_csv(DATA/'sample_submission.csv')\nsubmit['HasDetections'] = preds\nsubmit.to_csv('vw.csv', index=None)",
   "execution_count": 28,
   "outputs": []
  },
  {
   "metadata": {
    "hidden": true,
    "trusted": true
   },
   "cell_type": "code",
   "source": "!kaggle c submit -c microsoft-malware-prediction -f \"vw.csv\" -m \"VW\"",
   "execution_count": 29,
   "outputs": [
    {
     "output_type": "stream",
     "text": "100%|████████████████████████████████████████| 297M/297M [00:38<00:00, 8.08MB/s]\nSuccessfully submitted to Microsoft Malware Prediction",
     "name": "stdout"
    }
   ]
  },
  {
   "metadata": {},
   "cell_type": "markdown",
   "source": "# Stack Them All\n\nHaving several trained classifiers, we can try to blend their predictions to improve the overall quality of the solution. For this purpose, we need to generate new datasets where the predictions from previous stages become features."
  },
  {
   "metadata": {
    "heading_collapsed": true
   },
   "cell_type": "markdown",
   "source": "## Concatenate the Predictions\n\nThe previous stages saved the training predictions into numpy arrays so now we only need to restore them. Also, we need to apply the trained models to the test set and build a new testing set from these predictions as well."
  },
  {
   "metadata": {
    "trusted": true,
    "hidden": true
   },
   "cell_type": "code",
   "source": "names = ['lgb', 'cb', 'sgd']",
   "execution_count": 5,
   "outputs": []
  },
  {
   "metadata": {
    "trusted": true,
    "hidden": true
   },
   "cell_type": "code",
   "source": "train_cols, test_cols = [], []",
   "execution_count": 6,
   "outputs": []
  },
  {
   "metadata": {
    "trusted": true,
    "hidden": true
   },
   "cell_type": "code",
   "source": "X_test = np.load(DATA/'x_test.npy')",
   "execution_count": 7,
   "outputs": []
  },
  {
   "metadata": {
    "trusted": true,
    "hidden": true
   },
   "cell_type": "code",
   "source": "for name in names:\n    print('Preparing model:', name)\n    ensemble = joblib.load(DATA/f'{name}_ensemble.pickle')\n    test_result = np.zeros(len(X_test), dtype=np.float16)\n    for model in ensemble:\n        test_result += model.predict_proba(X_test)[:, 1]\n    test_result /= len(ensemble)\n    test_cols.append(test_result)\n    train_cols.append(np.load(DATA/f'val_{name}.npy'))",
   "execution_count": 9,
   "outputs": [
    {
     "output_type": "stream",
     "text": "Preparing model: lgb\nPreparing model: cb\nPreparing model: sgd\n",
     "name": "stdout"
    }
   ]
  },
  {
   "metadata": {
    "hidden": true,
    "trusted": true
   },
   "cell_type": "code",
   "source": "vw_train = np.load(DATA/'vw_train_preds.npy')\ntrain_cols.append(vw_train)",
   "execution_count": 10,
   "outputs": []
  },
  {
   "metadata": {
    "hidden": true,
    "trusted": true
   },
   "cell_type": "code",
   "source": "vw_test = np.load(DATA/'vw_test_preds.npy')\ntest_cols.append(vw_test)",
   "execution_count": 16,
   "outputs": []
  },
  {
   "metadata": {
    "trusted": true,
    "hidden": true
   },
   "cell_type": "code",
   "source": "X = np.column_stack(train_cols)\nX_test = np.column_stack(test_cols)",
   "execution_count": 18,
   "outputs": []
  },
  {
   "metadata": {
    "hidden": true,
    "trusted": true
   },
   "cell_type": "code",
   "source": "np.save(DATA/'x_train_stacked.npy', X)\nnp.save(DATA/'x_test_stacked.npy', X_test)",
   "execution_count": 20,
   "outputs": []
  },
  {
   "metadata": {},
   "cell_type": "markdown",
   "source": "## Fit Model\n\nWe put a very simple classifier on top of the our meta-dataset, an instance of `LogisticRegression` class from `sklearn` with interaction features added."
  },
  {
   "metadata": {
    "trusted": true
   },
   "cell_type": "code",
   "source": "X = np.load(DATA/'x_train_stacked.npy').astype(np.float32)\nX_test = np.load(DATA/'x_test_stacked.npy').astype(np.float32)\ny = np.load(DATA/'y_train.npy')",
   "execution_count": 5,
   "outputs": []
  },
  {
   "metadata": {
    "trusted": true
   },
   "cell_type": "code",
   "source": "poly = PolynomialFeatures(interaction_only=True, include_bias=False)",
   "execution_count": 6,
   "outputs": []
  },
  {
   "metadata": {
    "trusted": true
   },
   "cell_type": "code",
   "source": "X = poly.fit_transform(X)\nX_test = poly.transform(X_test)",
   "execution_count": 7,
   "outputs": []
  },
  {
   "metadata": {
    "trusted": true
   },
   "cell_type": "code",
   "source": "test_result = np.zeros(len(X_test), dtype=np.float16)\nfor trn_idx, val_idx in split(X, y):\n    with Timer() as timer:\n        model = LogisticRegression(C=1, fit_intercept=True, penalty='l1', solver='saga')\n        model.fit(X[trn_idx], y[trn_idx])\n        print('Predicting validation fold...', end=' ')\n        preds = model.predict_proba(X[val_idx])[:, 1]\n        score = roc_auc_score(y[val_idx], preds)\n        print(f'Fold AUC score: {score:2.2f}')\n        print('Predicting testing dataset...')\n        test_result += model.predict_proba(X_test)[:, 1]\ntest_result /= 5",
   "execution_count": 11,
   "outputs": [
    {
     "output_type": "stream",
     "text": "Running 1 of 5 folds\nPredicting validation fold... Fold AUC score: 0.74\nPredicting testing dataset...\nRunning 2 of 5 folds\nPredicting validation fold... Fold AUC score: 0.74\nPredicting testing dataset...\nRunning 3 of 5 folds\nPredicting validation fold... Fold AUC score: 0.74\nPredicting testing dataset...\nRunning 4 of 5 folds\nPredicting validation fold... Fold AUC score: 0.74\nPredicting testing dataset...\nRunning 5 of 5 folds\nPredicting validation fold... Fold AUC score: 0.74\nPredicting testing dataset...\n",
     "name": "stdout"
    }
   ]
  },
  {
   "metadata": {
    "trusted": true
   },
   "cell_type": "code",
   "source": "submit = pd.read_csv(DATA/'sample_submission.csv')\nsubmit['HasDetections'] = np.clip(test_result, 0.05, 0.95)\nsubmit.to_csv('stacked.csv', index=None)",
   "execution_count": 14,
   "outputs": []
  },
  {
   "metadata": {
    "trusted": true
   },
   "cell_type": "code",
   "source": "!kaggle c submit -c microsoft-malware-prediction -f \"stacked.csv\" -m \"Stacked classifier\"",
   "execution_count": 15,
   "outputs": [
    {
     "output_type": "stream",
     "text": "100%|████████████████████████████████████████| 297M/297M [00:22<00:00, 13.6MB/s]\nSuccessfully submitted to Microsoft Malware Prediction",
     "name": "stdout"
    }
   ]
  },
  {
   "metadata": {},
   "cell_type": "markdown",
   "source": "# Conculsion"
  },
  {
   "metadata": {},
   "cell_type": "markdown",
   "source": "Though the author didn't achieve an outstanding result in this competition, it was a very interesting experience. Practice always brings new aspects that are not always obvious from theory. Things like memory errors, corrupted data and floatings overflow bring additional complexity to the process. Therefore, it is essential to practice not only in getting a well-performing solution but also make it robust and computationally efficient."
  },
  {
   "metadata": {
    "trusted": true
   },
   "cell_type": "code",
   "source": "",
   "execution_count": null,
   "outputs": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "name": "python3",
   "display_name": "Python 3",
   "language": "python"
  },
  "language_info": {
   "name": "python",
   "version": "3.7.1",
   "mimetype": "text/x-python",
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "pygments_lexer": "ipython3",
   "nbconvert_exporter": "python",
   "file_extension": ".py"
  },
  "toc": {
   "nav_menu": {},
   "number_sections": true,
   "sideBar": true,
   "skip_h1_title": false,
   "base_numbering": 1,
   "title_cell": "Table of Contents",
   "title_sidebar": "Contents",
   "toc_cell": false,
   "toc_position": {},
   "toc_section_display": true,
   "toc_window_display": false
  },
  "gist_id": "400b2c37858143201a62b8f59189352d"
 },
 "nbformat": 4,
 "nbformat_minor": 2
}