Skip to content

Instantly share code, notes, and snippets.

@devforfu
Last active March 13, 2019 10:34
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save devforfu/400b2c37858143201a62b8f59189352d to your computer and use it in GitHub Desktop.
Save devforfu/400b2c37858143201a62b8f59189352d to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"metadata": {},
"cell_type": "markdown",
"source": "# An Unconscious Kaggler's Notebook\n\nA successful participation in a data science competition requires carry on some EDA, setting up a good cross-validation strategy, and careful feature engineering. However, if you don't have too much time and skills to work on the data in a structured way but still want to take part, what you do?\n\nRight! You encode the data, generate a bunch of additional features, apply some models and see how it comes.\n\n![img](https://imgs.xkcd.com/comics/machine_learning.png)\n\nThis notebook is an author's attempt to get some skills in the competitive Data Science, apply various Machine Learning techniques and practice with a non-trivial dataset.\n\nIt is organized in such a way that allows to keep separate paragraphs more or less independent from each other. Therefore, the data is often saved to the persistent storage and some lines of the code are repeted many times. It helps to safe a lot of time when reloading the kernel because you can easily continue work from any step.\n\n> This notebook has lots of borrowings from the public kernels and discussions. (Especially, from this one). Please let me know if you find a fragment of your work here and would like to add a proper attribution/reference. Another great source of knowledge is [How to Win a Data Science Competition](https://www.coursera.org/learn/competitive-data-science) Coursera's course. Very helpful resource for anyone who only starts participate in the competitions.\n\n[Link to the competition page](https://www.kaggle.com/c/microsoft-malware-prediction).\n"
},
{
"metadata": {},
"cell_type": "markdown",
"source": "# Imports and Global State"
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "import gc\nfrom itertools import combinations, chain",
"execution_count": 1,
"outputs": []
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "import catboost as cb\nimport category_encoders as ce\nimport feather\nimport lightgbm as lgb\nimport numpy as np\nimport pandas as pd\nfrom scipy.spatial.distance import hamming\nfrom sklearn.ensemble import BaggingClassifier, RandomForestClassifier\nfrom sklearn.externals import joblib\nfrom sklearn.linear_model import SGDClassifier, LogisticRegression\nfrom sklearn.preprocessing import LabelEncoder, PolynomialFeatures\nfrom sklearn.model_selection import train_test_split, StratifiedKFold\nfrom sklearn.metrics import roc_auc_score\nfrom tqdm import tqdm_notebook as tqdm",
"execution_count": 2,
"outputs": []
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "from basedir import DATA, TRAIN, TEST\nfrom info import efficient_types, id_feature, target_feature\nfrom utils import Timer",
"execution_count": 3,
"outputs": []
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "seed = 1\nnp.random.seed(seed)",
"execution_count": 4,
"outputs": []
},
{
"metadata": {
"heading_collapsed": true
},
"cell_type": "markdown",
"source": "# Converting Into Binary Format\n\nThe original dataset is provided in the CSV format which is very inefficient in the terms of reading/writing speed. Also, this format doesn't allow to store any information about data types. Therefore, to speed up and simplify the data loading process, let's convert the dataset into binary format. \n\nAlso, it could be convenient to concatenate the training and testing datasets together, and drop the `MachineIdentifier` column. (We can use `pandas.DataFrame` index to uniquely identify the observations)."
},
{
"metadata": {
"trusted": true,
"hidden": true
},
"cell_type": "code",
"source": "def default_saver(df, out, **params):\n cols = df.select_dtypes(include=[np.float16]).columns\n df[cols] = df[cols].astype(np.float32)\n outfile = f'{out}.feather'\n df.to_feather(outfile)\n return outfile",
"execution_count": null,
"outputs": []
},
{
"metadata": {
"trusted": true,
"hidden": true
},
"cell_type": "code",
"source": "class Converter:\n \"\"\"Reads dataset in CSV format and converts into binary file.\"\"\"\n \n def __init__(self, id_col, target_col, types):\n assert id_col in types and target_col in types\n self.id_col = id_col\n self.target_col = target_col\n self.types = types\n \n def __call__(self, *args, **kwargs):\n self.convert(*args, **kwargs)\n \n def convert(self, train, test, out, saver_fn=default_saver, **params):\n types = self.types.copy()\n trn_df = pd.read_csv(train, usecols=types.keys(), dtype=types)\n del types[self.target_col]\n tst_df = pd.read_csv(test, usecols=types.keys(), dtype=types)\n data = pd.concat([trn_df, tst_df], axis=0, sort=False)\n data.reset_index(drop=True, inplace=True)\n del data[self.id_col]\n return saver_fn(data, out, **params)",
"execution_count": null,
"outputs": []
},
{
"metadata": {
"trusted": true,
"hidden": true
},
"cell_type": "code",
"source": "def to_feather(train, test, out):\n conv = Converter(id_feature, target_feature, efficient_types)\n conv(train, test, out)",
"execution_count": null,
"outputs": []
},
{
"metadata": {
"trusted": true,
"hidden": true
},
"cell_type": "code",
"source": "to_feather(TRAIN, TEST, DATA/'data')",
"execution_count": null,
"outputs": []
},
{
"metadata": {
"heading_collapsed": true
},
"cell_type": "markdown",
"source": "# Features Encoding\n\nThe next step is to encode the categorical features. The author takes approach from [this great kernel](https://www.kaggle.com/bogorodvo/lightgbm-baseline-model-using-sparse-matrix) and treats all features as categorical ones. Also, all rare values are merged into a new single category."
},
{
"metadata": {
"trusted": true,
"hidden": true
},
"cell_type": "code",
"source": "data = feather.read_dataframe(DATA/'data.feather')",
"execution_count": 5,
"outputs": []
},
{
"metadata": {
"trusted": true,
"hidden": true
},
"cell_type": "code",
"source": "def get_categorical(df):\n cols = df.columns.tolist()\n if target_feature in cols:\n cols.remove(target_feature)\n return cols",
"execution_count": 6,
"outputs": []
},
{
"metadata": {
"trusted": true,
"hidden": true
},
"cell_type": "code",
"source": "print('Encoding features:')\ncat_features = get_categorical(data)\ntotal = len(cat_features)\neps = int(1e-4 * len(data))\n\nfor i, feature in enumerate(cat_features, 1):\n print(f'[{i:2d}/{total:2d}] {feature}')\n col = data[feature].astype(str)\n encoder = LabelEncoder().fit(col.unique())\n data[feature] = encoder.transform(col) + 1\n rare = {val: 0 if cnt <= eps else val\n for val, cnt in data[feature].value_counts().items()}\n data[feature] = data[feature].map(rare).astype('category')",
"execution_count": 7,
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": "Encoding features:\n[ 1/81] ProductName\n[ 2/81] EngineVersion\n[ 3/81] AppVersion\n[ 4/81] AvSigVersion\n[ 5/81] IsBeta\n[ 6/81] RtpStateBitfield\n[ 7/81] IsSxsPassiveMode\n[ 8/81] DefaultBrowsersIdentifier\n[ 9/81] AVProductStatesIdentifier\n[10/81] AVProductsInstalled\n[11/81] AVProductsEnabled\n[12/81] HasTpm\n[13/81] CountryIdentifier\n[14/81] CityIdentifier\n[15/81] OrganizationIdentifier\n[16/81] GeoNameIdentifier\n[17/81] LocaleEnglishNameIdentifier\n[18/81] Platform\n[19/81] Processor\n[20/81] OsVer\n[21/81] OsBuild\n[22/81] OsSuite\n[23/81] OsPlatformSubRelease\n[24/81] OsBuildLab\n[25/81] SkuEdition\n[26/81] IsProtected\n[27/81] AutoSampleOptIn\n[28/81] PuaMode\n[29/81] SMode\n[30/81] IeVerIdentifier\n[31/81] SmartScreen\n[32/81] Firewall\n[33/81] UacLuaenable\n[34/81] Census_MDC2FormFactor\n[35/81] Census_DeviceFamily\n[36/81] Census_OEMNameIdentifier\n[37/81] Census_OEMModelIdentifier\n[38/81] Census_ProcessorCoreCount\n[39/81] Census_ProcessorManufacturerIdentifier\n[40/81] Census_ProcessorModelIdentifier\n[41/81] Census_ProcessorClass\n[42/81] Census_PrimaryDiskTotalCapacity\n[43/81] Census_PrimaryDiskTypeName\n[44/81] Census_SystemVolumeTotalCapacity\n[45/81] Census_HasOpticalDiskDrive\n[46/81] Census_TotalPhysicalRAM\n[47/81] Census_ChassisTypeName\n[48/81] Census_InternalPrimaryDiagonalDisplaySizeInInches\n[49/81] Census_InternalPrimaryDisplayResolutionHorizontal\n[50/81] Census_InternalPrimaryDisplayResolutionVertical\n[51/81] Census_PowerPlatformRoleName\n[52/81] Census_InternalBatteryType\n[53/81] Census_InternalBatteryNumberOfCharges\n[54/81] Census_OSVersion\n[55/81] Census_OSArchitecture\n[56/81] Census_OSBranch\n[57/81] Census_OSBuildNumber\n[58/81] Census_OSBuildRevision\n[59/81] Census_OSEdition\n[60/81] Census_OSSkuName\n[61/81] Census_OSInstallTypeName\n[62/81] Census_OSInstallLanguageIdentifier\n[63/81] Census_OSUILocaleIdentifier\n[64/81] Census_OSWUAutoUpdateOptionsName\n[65/81] Census_IsPortableOperatingSystem\n[66/81] Census_GenuineStateName\n[67/81] Census_ActivationChannel\n[68/81] Census_IsFlightingInternal\n[69/81] Census_IsFlightsDisabled\n[70/81] Census_FlightRing\n[71/81] Census_ThresholdOptIn\n[72/81] Census_FirmwareManufacturerIdentifier\n[73/81] Census_FirmwareVersionIdentifier\n[74/81] Census_IsSecureBootEnabled\n[75/81] Census_IsWIMBootEnabled\n[76/81] Census_IsVirtualDevice\n[77/81] Census_IsTouchEnabled\n[78/81] Census_IsPenCapable\n[79/81] Census_IsAlwaysOnAlwaysConnectedCapable\n[80/81] Wdft_IsGamer\n[81/81] Wdft_RegionIdentifier\n"
}
]
},
{
"metadata": {
"trusted": true,
"hidden": true
},
"cell_type": "code",
"source": "data.to_feather(DATA/'enc.feather')",
"execution_count": 8,
"outputs": []
},
{
"metadata": {
"heading_collapsed": true
},
"cell_type": "markdown",
"source": "# Add More Features\n\nNow it is time to perform some feature engineering. There are a few straightforward methods which don't require any sophisticated preprocessing. These include:\n1. values counting\n2. using decision tree leafs indicies\n3. mean encoding"
},
{
"metadata": {
"trusted": true,
"hidden": true
},
"cell_type": "code",
"source": "data = feather.read_dataframe(DATA/'enc.feather')",
"execution_count": null,
"outputs": []
},
{
"metadata": {
"heading_collapsed": true,
"hidden": true
},
"cell_type": "markdown",
"source": "## Counting \n\nLet's count how often a specific value is encountered, including interacations between pairs of features also. The listed features are chosen somewhat arbitrary so there should be some better choices."
},
{
"metadata": {
"trusted": true,
"hidden": true
},
"cell_type": "code",
"source": "features =[ \n 'ProductName',\n 'EngineVersion',\n 'AppVersion',\n 'AvSigVersion',\n 'Platform',\n 'Processor',\n 'OsBuildLab',\n 'SmartScreen']",
"execution_count": null,
"outputs": []
},
{
"metadata": {
"trusted": true,
"hidden": true
},
"cell_type": "code",
"source": "feature_groups = [list(g) for g in chain(*[combinations(features, i) for i in range(1, 3)])]",
"execution_count": null,
"outputs": []
},
{
"metadata": {
"trusted": true,
"hidden": true
},
"cell_type": "code",
"source": "cnt_df = pd.DataFrame()",
"execution_count": null,
"outputs": []
},
{
"metadata": {
"trusted": true,
"hidden": true
},
"cell_type": "code",
"source": "for keys in feature_groups:\n new_col = '_'.join(keys) + '_Freq'\n print('Creating new feature:', new_col)\n cnt = data.groupby(keys).size().to_frame(new_col).reset_index()\n cnt_df[new_col] = pd.merge(data[keys], cnt, how='left', on=keys, suffixes=('', '.cnt'))[new_col]\n cnt_df[new_col] = (cnt_df[new_col] / len(cnt_df)).astype(np.float32)",
"execution_count": null,
"outputs": []
},
{
"metadata": {
"trusted": true,
"hidden": true
},
"cell_type": "code",
"source": "cnt_df.to_feather(DATA/'cnt.feather')",
"execution_count": null,
"outputs": []
},
{
"metadata": {
"heading_collapsed": true,
"hidden": true
},
"cell_type": "markdown",
"source": "## Tree Leafs\n\nI didn't know about this method before entering the comptetion. It is about training a decision trees classifier, and use its trees structure to derive new features. Here we train 100 trees and use leaf indicies assigned to the observations as the new features."
},
{
"metadata": {
"trusted": true,
"hidden": true
},
"cell_type": "code",
"source": "X = data[~data[target_feature].isna()]\ny = X[target_feature].copy()\ndel X[target_feature]",
"execution_count": null,
"outputs": []
},
{
"metadata": {
"trusted": true,
"hidden": true
},
"cell_type": "code",
"source": "trn_idx, val_idx = train_test_split(X.index, test_size=0.2, random_state=seed)",
"execution_count": null,
"outputs": []
},
{
"metadata": {
"trusted": true,
"hidden": true
},
"cell_type": "code",
"source": "x_trn = X[X.index.isin(trn_idx)]\nx_val = X[X.index.isin(val_idx)]\ny_trn = y[y.index.isin(trn_idx)]\ny_val = y[y.index.isin(val_idx)]",
"execution_count": null,
"outputs": []
},
{
"metadata": {
"trusted": true,
"hidden": true
},
"cell_type": "code",
"source": "trees = lgb.LGBMClassifier(n_estimators=100, colsample_bytree=0.3, learning_rate=0.05)",
"execution_count": null,
"outputs": []
},
{
"metadata": {
"trusted": true,
"hidden": true
},
"cell_type": "code",
"source": "trees.fit(x_trn, y_trn,\n eval_metric='auc',\n eval_set=[(x_val, y_val)], \n verbose=20,\n early_stopping_rounds=20)",
"execution_count": null,
"outputs": []
},
{
"metadata": {
"trusted": true,
"hidden": true
},
"cell_type": "code",
"source": "def chunks(data, chunk_size=10000):\n n = len(data)\n n_chunks = n//chunk_size + int(n % chunk_size != 0)\n for i in range(n_chunks):\n yield data.iloc[i*chunk_size:(i + 1)*chunk_size]",
"execution_count": null,
"outputs": []
},
{
"metadata": {
"trusted": true,
"hidden": true
},
"cell_type": "code",
"source": "del data[target_feature]",
"execution_count": null,
"outputs": []
},
{
"metadata": {
"trusted": true,
"hidden": true
},
"cell_type": "code",
"source": "leafs = np.row_stack([trees.predict(chunk, pred_leaf=True) for chunk in chunks(data)])",
"execution_count": null,
"outputs": []
},
{
"metadata": {
"trusted": true,
"hidden": true
},
"cell_type": "code",
"source": "leafs_df = pd.DataFrame(leafs, columns=[f'Leaf_Tree{i:d}' for i in range(100)])",
"execution_count": null,
"outputs": []
},
{
"metadata": {
"trusted": true,
"hidden": true
},
"cell_type": "code",
"source": "leafs_df.to_feather(DATA/'leaf.feather')",
"execution_count": null,
"outputs": []
},
{
"metadata": {
"heading_collapsed": true,
"hidden": true
},
"cell_type": "markdown",
"source": "## Mean Encoding\n\nActually, I am not sure how to properly use this method for this competition's dataset. Or more precisely, how to extend the derived train subset features to the test set. So here we're going to use random sampling to create test set mean encoding features."
},
{
"metadata": {
"trusted": true,
"hidden": true
},
"cell_type": "code",
"source": "X = data[~data[target_feature].isna()].copy()",
"execution_count": null,
"outputs": []
},
{
"metadata": {
"trusted": true,
"hidden": true
},
"cell_type": "code",
"source": "features =[ \n 'ProductName',\n 'EngineVersion',\n 'AppVersion',\n 'AvSigVersion',\n 'Platform',\n 'Processor',\n 'OsBuildLab',\n 'SmartScreen']",
"execution_count": null,
"outputs": []
},
{
"metadata": {
"trusted": true,
"hidden": true
},
"cell_type": "code",
"source": "n, rounds = len(X), 3\nglobal_mean = X[target_feature].mean()",
"execution_count": null,
"outputs": []
},
{
"metadata": {
"trusted": true,
"hidden": true
},
"cell_type": "code",
"source": "for feat in features:\n name = f'{feat}_MeanTarget'\n X[name] = 0\n print(f'Mean encoding feature:', name)\n print('\\tround:', end=' ')\n for i in range(rounds):\n print(f'{i}..', end=' ')\n perm = X[X.index.isin(np.random.choice(n, size=n, replace=False))]\n cumsum = perm.groupby(feat)[target_feature].cumsum() - perm[target_feature]\n cumcnt = perm.groupby(feat)[target_feature].cumcount()\n X[name] += (cumsum/cumcnt).fillna(global_mean)\n X[name] /= rounds\n print()",
"execution_count": null,
"outputs": []
},
{
"metadata": {
"trusted": true,
"hidden": true
},
"cell_type": "code",
"source": "X_test = data[~data.index.isin(X.index)].copy()",
"execution_count": null,
"outputs": []
},
{
"metadata": {
"trusted": true,
"hidden": true
},
"cell_type": "code",
"source": "for feat in features:\n name = f'{feat}_MeanTarget'\n print('Generating test set values for the mean feature:', name)\n train_groups = X.groupby(feat).groups\n train_keys = list(train_groups)\n test_keys = list(X_test.groupby(feat).groups)\n for key in test_keys:\n subset = X_test[feat] == key\n if key not in train_keys or len(train_groups[key]) == 0:\n X_test.loc[subset, name] = global_mean\n else:\n sample = X[name][train_groups[key]].sample(subset.sum(), replace=True)\n X_test.loc[subset, name] = sample.values",
"execution_count": null,
"outputs": []
},
{
"metadata": {
"trusted": true,
"hidden": true
},
"cell_type": "code",
"source": "data = pd.concat([X, X_test], axis=0, sort=False)",
"execution_count": null,
"outputs": []
},
{
"metadata": {
"trusted": true,
"hidden": true
},
"cell_type": "code",
"source": "mean_df = data[data.columns[data.columns.str.endswith('MeanTarget')]]",
"execution_count": null,
"outputs": []
},
{
"metadata": {
"trusted": true,
"hidden": true
},
"cell_type": "code",
"source": "mean_df = mean_df.astype(np.float32)",
"execution_count": null,
"outputs": []
},
{
"metadata": {
"trusted": true,
"hidden": true
},
"cell_type": "code",
"source": "mean_df.to_feather('mean.feather')",
"execution_count": null,
"outputs": []
},
{
"metadata": {},
"cell_type": "markdown",
"source": "# Model's Zoo\n\nNow it is time to train some models! Each model is trained with K-fold validation scheme. Also, the intermediate results and models are saved into persistent memory to build a stacked model at the end."
},
{
"metadata": {
"heading_collapsed": true
},
"cell_type": "markdown",
"source": "## Prepare Data\n\nIt is a good idea to work with `pandas.DataFrame` objects during EDA and feature engineering process. It allows to compactly represent the categorical features. However, when the time comes to train a model, you'll probably need to convert your categories into floating-point numbers. \nAnd, it could be problem if you don't have enough memory. (The kernel was killed several times on my machine when I've tried to fit a model with data frame, and it didn't not enough memory to generate `np.float32` array).\n\nIn the next cells, the data frames with the original and derived features are converted into `numpy` arrays of appropriate type. Also note that the `np.float16` format can be too small for your data and lead to `np.inf` values due to numerical overflow."
},
{
"metadata": {
"trusted": true,
"hidden": true
},
"cell_type": "code",
"source": "enc = feather.read_dataframe(DATA/'enc.feather')\ncnt = feather.read_dataframe(DATA/'cnt.feather').astype(np.float16)\nmean = feather.read_dataframe(DATA/'mean.feather').astype(np.float16)\nleafs = feather.read_dataframe(DATA/'leaf.feather').astype('category')\ndata = pd.concat([enc, cnt, mean, leafs], axis=1)",
"execution_count": 6,
"outputs": []
},
{
"metadata": {
"trusted": true,
"hidden": true
},
"cell_type": "code",
"source": "del enc, cnt, mean, leafs",
"execution_count": 6,
"outputs": []
},
{
"metadata": {
"trusted": true,
"hidden": true
},
"cell_type": "code",
"source": "trn_df = data.loc[~data[target_feature].isna()]\ntst_df = data.loc[ data[target_feature].isna()]\ntrn_target = trn_df[target_feature].copy()",
"execution_count": 7,
"outputs": []
},
{
"metadata": {
"trusted": true,
"hidden": true
},
"cell_type": "code",
"source": "del data, trn_df[target_feature], tst_df[target_feature]",
"execution_count": 8,
"outputs": []
},
{
"metadata": {
"trusted": true,
"hidden": true
},
"cell_type": "code",
"source": "gc.collect()",
"execution_count": 9,
"outputs": [
{
"data": {
"text/plain": "14"
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
]
},
{
"metadata": {
"trusted": true,
"hidden": true
},
"cell_type": "code",
"source": "np.save(DATA/'x_train.npy', trn_df.astype(np.float32).values)\ndel trn_df\ngc.collect()",
"execution_count": 10,
"outputs": [
{
"data": {
"text/plain": "0"
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
]
},
{
"metadata": {
"trusted": true,
"hidden": true
},
"cell_type": "code",
"source": "np.save(DATA/'y_train.npy', trn_target.astype(np.uint8).values)\ndel trn_target\ngc.collect()",
"execution_count": 11,
"outputs": [
{
"data": {
"text/plain": "0"
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
]
},
{
"metadata": {
"trusted": true,
"hidden": true
},
"cell_type": "code",
"source": "np.save(DATA/'x_test.npy', tst_df.astype(np.float32).values)\ndel tst_df\ngc.collect()",
"execution_count": 12,
"outputs": [
{
"data": {
"text/plain": "1267"
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
]
},
{
"metadata": {},
"cell_type": "markdown",
"source": "## Utils\n"
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "def split(data, target, n_splits=5, seed=seed):\n kfold = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=seed)\n idx = np.arange(len(data))\n for i, (trn_idx, val_idx) in enumerate(kfold.split(idx, target), 1):\n print(f'Running {i:d} of {kfold.get_n_splits():d} folds')\n yield trn_idx, val_idx",
"execution_count": 9,
"outputs": []
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "def chunks(arr, chunk_size=100000):\n n = len(X)\n n_chunks = n//chunk_size + int(n % chunk_size != 0)\n for i in range(n_chunks):\n start, end = i*chunk_size, (i + 1)*chunk_size\n yield arr[start:end]",
"execution_count": 10,
"outputs": []
},
{
"metadata": {
"heading_collapsed": true
},
"cell_type": "markdown",
"source": "## LightGBM\n\nThe favourite model of this competition. Very fast, high accuracy score, simple to use. What can be better?"
},
{
"metadata": {
"trusted": true,
"hidden": true
},
"cell_type": "code",
"source": "X = np.load(DATA/'x_train.npy')\ny = np.load(DATA/'y_train.npy')",
"execution_count": 7,
"outputs": []
},
{
"metadata": {
"trusted": true,
"hidden": true
},
"cell_type": "code",
"source": "params = {\n 'colsample_bytree': 0.25,\n 'learning_rate': 0.10, # 0.05\n 'max_depth': -1,\n 'num_leaves': 500,\n 'n_estimators': 10000,\n 'objective': 'binary',\n 'random_state': seed,\n 'n_jobs': -1}",
"execution_count": 8,
"outputs": []
},
{
"metadata": {
"trusted": true,
"hidden": true
},
"cell_type": "code",
"source": "val_lgb = np.zeros(len(X), dtype=np.float16)\nensemble = []\ntotal_time = 0\nfor trn_idx, val_idx in split(X, y):\n with Timer() as timer:\n model = lgb.LGBMClassifier(**params)\n model.fit(\n X[trn_idx], y[trn_idx],\n eval_metric='auc',\n eval_set=[(X[val_idx], y[val_idx])],\n verbose=125, early_stopping_rounds=125)\n print('Predicting validation fold...')\n val_lgb[val_idx] = model.predict_proba(X[val_idx])[:, 1]\n print(f'Fold time: {timer}')\n ensemble.append(model)\n total_time += float(timer)\nprint(f'Total amount of training time: {Timer.format_elapsed_time(total_time)}')",
"execution_count": 9,
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": "Running 1 of 5 folds\nTraining until validation scores don't improve for 125 rounds.\n[125]\tvalid_0's auc: 0.737329\tvalid_0's binary_logloss: 0.596901\n[250]\tvalid_0's auc: 0.740404\tvalid_0's binary_logloss: 0.594357\n[375]\tvalid_0's auc: 0.74095\tvalid_0's binary_logloss: 0.593876\n[500]\tvalid_0's auc: 0.741204\tvalid_0's binary_logloss: 0.593647\n[625]\tvalid_0's auc: 0.741238\tvalid_0's binary_logloss: 0.593615\nEarly stopping, best iteration is:\n[589]\tvalid_0's auc: 0.741252\tvalid_0's binary_logloss: 0.593605\nPredicting validation fold...\nFold time: 00:10:17\nRunning 2 of 5 folds\nTraining until validation scores don't improve for 125 rounds.\n[125]\tvalid_0's auc: 0.73815\tvalid_0's binary_logloss: 0.596237\n[250]\tvalid_0's auc: 0.741152\tvalid_0's binary_logloss: 0.593703\n[375]\tvalid_0's auc: 0.741617\tvalid_0's binary_logloss: 0.593294\n[500]\tvalid_0's auc: 0.741804\tvalid_0's binary_logloss: 0.593111\n[625]\tvalid_0's auc: 0.741963\tvalid_0's binary_logloss: 0.592972\n[750]\tvalid_0's auc: 0.741993\tvalid_0's binary_logloss: 0.592966\nEarly stopping, best iteration is:\n[662]\tvalid_0's auc: 0.741998\tvalid_0's binary_logloss: 0.592944\nPredicting validation fold...\nFold time: 00:10:52\nRunning 3 of 5 folds\nTraining until validation scores don't improve for 125 rounds.\n[125]\tvalid_0's auc: 0.737608\tvalid_0's binary_logloss: 0.596651\n[250]\tvalid_0's auc: 0.740687\tvalid_0's binary_logloss: 0.594073\n[375]\tvalid_0's auc: 0.741005\tvalid_0's binary_logloss: 0.593788\n[500]\tvalid_0's auc: 0.741328\tvalid_0's binary_logloss: 0.593524\n[625]\tvalid_0's auc: 0.741488\tvalid_0's binary_logloss: 0.593371\n[750]\tvalid_0's auc: 0.741556\tvalid_0's binary_logloss: 0.593314\n[875]\tvalid_0's auc: 0.741559\tvalid_0's binary_logloss: 0.593318\nEarly stopping, best iteration is:\n[763]\tvalid_0's auc: 0.741578\tvalid_0's binary_logloss: 0.593296\nPredicting validation fold...\nFold time: 00:11:41\nRunning 4 of 5 folds\nTraining until validation scores don't improve for 125 rounds.\n[125]\tvalid_0's auc: 0.73794\tvalid_0's binary_logloss: 0.596407\n[250]\tvalid_0's auc: 0.740907\tvalid_0's binary_logloss: 0.59391\n[375]\tvalid_0's auc: 0.741523\tvalid_0's binary_logloss: 0.593386\n[500]\tvalid_0's auc: 0.741727\tvalid_0's binary_logloss: 0.59319\n[625]\tvalid_0's auc: 0.741813\tvalid_0's binary_logloss: 0.593109\n[750]\tvalid_0's auc: 0.741873\tvalid_0's binary_logloss: 0.593051\n[875]\tvalid_0's auc: 0.74192\tvalid_0's binary_logloss: 0.593021\nEarly stopping, best iteration is:\n[864]\tvalid_0's auc: 0.741943\tvalid_0's binary_logloss: 0.593004\nPredicting validation fold...\nFold time: 00:12:34\nRunning 5 of 5 folds\nTraining until validation scores don't improve for 125 rounds.\n[125]\tvalid_0's auc: 0.737498\tvalid_0's binary_logloss: 0.596724\n[250]\tvalid_0's auc: 0.740422\tvalid_0's binary_logloss: 0.594257\n[375]\tvalid_0's auc: 0.741021\tvalid_0's binary_logloss: 0.593757\n[500]\tvalid_0's auc: 0.74127\tvalid_0's binary_logloss: 0.593547\n[625]\tvalid_0's auc: 0.741351\tvalid_0's binary_logloss: 0.593469\n[750]\tvalid_0's auc: 0.741372\tvalid_0's binary_logloss: 0.593465\nEarly stopping, best iteration is:\n[720]\tvalid_0's auc: 0.741387\tvalid_0's binary_logloss: 0.593445\nPredicting validation fold...\nFold time: 00:11:33\nTotal amount of training time: 00:56:59\n"
}
]
},
{
"metadata": {
"trusted": true,
"hidden": true
},
"cell_type": "code",
"source": "joblib.dump(ensemble, DATA/'lgb_ensemble.pickle')",
"execution_count": 10,
"outputs": [
{
"data": {
"text/plain": "['/home/ck/data/microsoft/lgb_ensemble.pickle']"
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
]
},
{
"metadata": {
"trusted": true,
"hidden": true
},
"cell_type": "code",
"source": "np.save(DATA/'val_lgb.npy', val_lgb)",
"execution_count": 11,
"outputs": []
},
{
"metadata": {
"heading_collapsed": true
},
"cell_type": "markdown",
"source": "## CatBoost\n\nAnother promising trees boosting library. Have never used it before. Supports GPU computations and can work with several GPUs with different volumes of memory."
},
{
"metadata": {
"trusted": true,
"hidden": true
},
"cell_type": "code",
"source": "X = np.load(DATA/'x_train.npy')\ny = np.load(DATA/'y_train.npy')",
"execution_count": 11,
"outputs": []
},
{
"metadata": {
"trusted": true,
"hidden": true
},
"cell_type": "code",
"source": "params = {\n 'bagging_temperature': 1.8,\n 'l2_leaf_reg': 1,\n 'leaf_estimation_method': 'Gradient',\n 'learning_rate': 0.1,\n 'max_depth': 8, # 5\n 'subsample': 0.6,\n 'iterations': 30000,\n 'bootstrap_type': 'Poisson',\n 'eval_metric': 'AUC',\n 'task_type': 'GPU',\n 'devices': '0:1',\n 'loss_function': 'CrossEntropy',\n 'logging_level': 'Verbose',\n 'random_seed': seed}",
"execution_count": 12,
"outputs": []
},
{
"metadata": {
"scrolled": true,
"trusted": true,
"hidden": true
},
"cell_type": "code",
"source": "val_cb = np.zeros(len(X), dtype=np.float16)\nensemble = []\ntotal_time = 0\nfor trn_idx, val_idx in split(X, y):\n with Timer() as timer:\n model = cb.CatBoostClassifier(**params)\n model.fit(\n X[trn_idx], y[trn_idx],\n eval_set=[(X[val_idx], y[val_idx])],\n metric_period=250, early_stopping_rounds=250)\n print('Predicting validation fold...')\n val_cb[val_idx] = model.predict_proba(X[val_idx])[:, 1]\n print(f'Fold time: {timer}')\n ensemble.append(model)\n total_time += float(timer)\nprint(f'Total amount of training time: {Timer.format_elapsed_time(total_time)}')",
"execution_count": 13,
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": "Running 1 of 5 folds\n0:\tlearn: 0.6851266\ttest: 0.6844047\tbest: 0.6844047 (0)\ttotal: 131ms\tremaining: 1h 5m 15s\n250:\tlearn: 0.7297741\ttest: 0.7283556\tbest: 0.7283556 (250)\ttotal: 19.5s\tremaining: 38m 37s\n500:\tlearn: 0.7355618\ttest: 0.7331157\tbest: 0.7331157 (500)\ttotal: 39.4s\tremaining: 38m 40s\n750:\tlearn: 0.7390429\ttest: 0.7354908\tbest: 0.7354908 (750)\ttotal: 59.5s\tremaining: 38m 36s\n1000:\tlearn: 0.7416191\ttest: 0.7369830\tbest: 0.7369830 (1000)\ttotal: 1m 19s\tremaining: 38m 31s\n1250:\tlearn: 0.7437746\ttest: 0.7380123\tbest: 0.7380123 (1250)\ttotal: 1m 40s\tremaining: 38m 21s\n1500:\tlearn: 0.7456559\ttest: 0.7387539\tbest: 0.7387539 (1500)\ttotal: 2m\tremaining: 38m 8s\n1750:\tlearn: 0.7474428\ttest: 0.7393755\tbest: 0.7393755 (1750)\ttotal: 2m 21s\tremaining: 37m 56s\n2000:\tlearn: 0.7490658\ttest: 0.7398450\tbest: 0.7398450 (2000)\ttotal: 2m 41s\tremaining: 37m 38s\n2250:\tlearn: 0.7506031\ttest: 0.7402359\tbest: 0.7402359 (2250)\ttotal: 3m 1s\tremaining: 37m 21s\n2500:\tlearn: 0.7520553\ttest: 0.7405825\tbest: 0.7405825 (2500)\ttotal: 3m 22s\tremaining: 37m 4s\n2750:\tlearn: 0.7534900\ttest: 0.7408695\tbest: 0.7408695 (2750)\ttotal: 3m 43s\tremaining: 36m 49s\n3000:\tlearn: 0.7548683\ttest: 0.7411410\tbest: 0.7411410 (3000)\ttotal: 4m 3s\tremaining: 36m 33s\n3250:\tlearn: 0.7562250\ttest: 0.7413552\tbest: 0.7413552 (3250)\ttotal: 4m 24s\tremaining: 36m 16s\n3500:\tlearn: 0.7575406\ttest: 0.7415373\tbest: 0.7415373 (3500)\ttotal: 4m 45s\tremaining: 35m 58s\n3750:\tlearn: 0.7588010\ttest: 0.7417109\tbest: 0.7417115 (3746)\ttotal: 5m 6s\tremaining: 35m 41s\n4000:\tlearn: 0.7600622\ttest: 0.7418751\tbest: 0.7418751 (4000)\ttotal: 5m 27s\tremaining: 35m 24s\n4250:\tlearn: 0.7612717\ttest: 0.7419896\tbest: 0.7419896 (4250)\ttotal: 5m 47s\tremaining: 35m 7s\n4500:\tlearn: 0.7624764\ttest: 0.7421108\tbest: 0.7421108 (4499)\ttotal: 6m 8s\tremaining: 34m 49s\n4750:\tlearn: 0.7636789\ttest: 0.7422251\tbest: 0.7422251 (4750)\ttotal: 6m 29s\tremaining: 34m 30s\n5000:\tlearn: 0.7648291\ttest: 0.7423050\tbest: 0.7423062 (4994)\ttotal: 6m 50s\tremaining: 34m 11s\n5250:\tlearn: 0.7659769\ttest: 0.7423753\tbest: 0.7423753 (5250)\ttotal: 7m 11s\tremaining: 33m 52s\n5500:\tlearn: 0.7671116\ttest: 0.7424427\tbest: 0.7424427 (5500)\ttotal: 7m 32s\tremaining: 33m 33s\n5750:\tlearn: 0.7682374\ttest: 0.7425076\tbest: 0.7425076 (5750)\ttotal: 7m 53s\tremaining: 33m 14s\n6000:\tlearn: 0.7693331\ttest: 0.7425613\tbest: 0.7425613 (6000)\ttotal: 8m 13s\tremaining: 32m 55s\n6250:\tlearn: 0.7704284\ttest: 0.7426014\tbest: 0.7426031 (6239)\ttotal: 8m 34s\tremaining: 32m 35s\n6500:\tlearn: 0.7715001\ttest: 0.7426521\tbest: 0.7426521 (6500)\ttotal: 8m 55s\tremaining: 32m 15s\n6750:\tlearn: 0.7725867\ttest: 0.7427104\tbest: 0.7427106 (6748)\ttotal: 9m 16s\tremaining: 31m 56s\n7000:\tlearn: 0.7736357\ttest: 0.7427511\tbest: 0.7427511 (7000)\ttotal: 9m 37s\tremaining: 31m 37s\n7250:\tlearn: 0.7747117\ttest: 0.7427766\tbest: 0.7427775 (7183)\ttotal: 9m 58s\tremaining: 31m 18s\n7500:\tlearn: 0.7757379\ttest: 0.7427909\tbest: 0.7427937 (7460)\ttotal: 10m 19s\tremaining: 30m 59s\n7750:\tlearn: 0.7767831\ttest: 0.7427952\tbest: 0.7428030 (7705)\ttotal: 10m 41s\tremaining: 30m 40s\n8000:\tlearn: 0.7778163\ttest: 0.7428082\tbest: 0.7428097 (7995)\ttotal: 11m 2s\tremaining: 30m 20s\n8250:\tlearn: 0.7788331\ttest: 0.7428332\tbest: 0.7428340 (8239)\ttotal: 11m 23s\tremaining: 30m 1s\n8500:\tlearn: 0.7798406\ttest: 0.7428669\tbest: 0.7428669 (8500)\ttotal: 11m 44s\tremaining: 29m 41s\n8750:\tlearn: 0.7808563\ttest: 0.7428851\tbest: 0.7428904 (8699)\ttotal: 12m 5s\tremaining: 29m 21s\n9000:\tlearn: 0.7818410\ttest: 0.7428959\tbest: 0.7429007 (8958)\ttotal: 12m 26s\tremaining: 29m 1s\n9250:\tlearn: 0.7828320\ttest: 0.7428994\tbest: 0.7429056 (9032)\ttotal: 12m 47s\tremaining: 28m 41s\n9500:\tlearn: 0.7837998\ttest: 0.7429189\tbest: 0.7429227 (9415)\ttotal: 13m 8s\tremaining: 28m 21s\n9750:\tlearn: 0.7847793\ttest: 0.7429212\tbest: 0.7429241 (9735)\ttotal: 13m 29s\tremaining: 28m 1s\nbestTest = 0.7429240942\nbestIteration = 9735\nShrink model to first 9736 iterations.\nPredicting validation fold...\nFold time: 00:15:21\nRunning 2 of 5 folds\n0:\tlearn: 0.6848314\ttest: 0.6855975\tbest: 0.6855975 (0)\ttotal: 124ms\tremaining: 1h 1m 49s\n250:\tlearn: 0.7296661\ttest: 0.7293133\tbest: 0.7293133 (250)\ttotal: 19.4s\tremaining: 38m 23s\n500:\tlearn: 0.7353851\ttest: 0.7340461\tbest: 0.7340461 (500)\ttotal: 39s\tremaining: 38m 16s\n750:\tlearn: 0.7389217\ttest: 0.7364740\tbest: 0.7364740 (750)\ttotal: 59.2s\tremaining: 38m 24s\n1000:\tlearn: 0.7414839\ttest: 0.7378700\tbest: 0.7378700 (1000)\ttotal: 1m 19s\tremaining: 38m 20s\n1250:\tlearn: 0.7436312\ttest: 0.7388944\tbest: 0.7388944 (1250)\ttotal: 1m 39s\tremaining: 38m 9s\n1500:\tlearn: 0.7455142\ttest: 0.7396414\tbest: 0.7396414 (1500)\ttotal: 2m\tremaining: 37m 58s\n1750:\tlearn: 0.7472605\ttest: 0.7402188\tbest: 0.7402188 (1750)\ttotal: 2m 20s\tremaining: 37m 46s\n2000:\tlearn: 0.7488767\ttest: 0.7406901\tbest: 0.7406901 (2000)\ttotal: 2m 40s\tremaining: 37m 32s\n2250:\tlearn: 0.7504321\ttest: 0.7410704\tbest: 0.7410704 (2250)\ttotal: 3m 1s\tremaining: 37m 18s\n2500:\tlearn: 0.7519293\ttest: 0.7414368\tbest: 0.7414372 (2499)\ttotal: 3m 22s\tremaining: 37m 3s\n2750:\tlearn: 0.7533427\ttest: 0.7417213\tbest: 0.7417213 (2750)\ttotal: 3m 42s\tremaining: 36m 45s\n3000:\tlearn: 0.7546988\ttest: 0.7419631\tbest: 0.7419640 (2998)\ttotal: 4m 3s\tremaining: 36m 27s\n3250:\tlearn: 0.7560247\ttest: 0.7421651\tbest: 0.7421651 (3250)\ttotal: 4m 23s\tremaining: 36m 10s\n3500:\tlearn: 0.7573494\ttest: 0.7423383\tbest: 0.7423383 (3500)\ttotal: 4m 44s\tremaining: 35m 52s\n3750:\tlearn: 0.7586054\ttest: 0.7424997\tbest: 0.7424997 (3750)\ttotal: 5m 4s\tremaining: 35m 34s\n4000:\tlearn: 0.7598609\ttest: 0.7426500\tbest: 0.7426500 (4000)\ttotal: 5m 25s\tremaining: 35m 17s\n4250:\tlearn: 0.7610719\ttest: 0.7427893\tbest: 0.7427894 (4249)\ttotal: 5m 46s\tremaining: 34m 57s\n4500:\tlearn: 0.7622871\ttest: 0.7428899\tbest: 0.7428899 (4500)\ttotal: 6m 6s\tremaining: 34m 38s\n4750:\tlearn: 0.7634523\ttest: 0.7429807\tbest: 0.7429808 (4749)\ttotal: 6m 27s\tremaining: 34m 20s\n5000:\tlearn: 0.7646332\ttest: 0.7430608\tbest: 0.7430608 (5000)\ttotal: 6m 48s\tremaining: 34m 1s\n5250:\tlearn: 0.7657889\ttest: 0.7431650\tbest: 0.7431653 (5241)\ttotal: 7m 9s\tremaining: 33m 42s\n5500:\tlearn: 0.7669226\ttest: 0.7432202\tbest: 0.7432202 (5500)\ttotal: 7m 29s\tremaining: 33m 23s\n5750:\tlearn: 0.7680573\ttest: 0.7432972\tbest: 0.7432972 (5750)\ttotal: 7m 50s\tremaining: 33m 3s\n6000:\tlearn: 0.7691827\ttest: 0.7433477\tbest: 0.7433481 (5999)\ttotal: 8m 11s\tremaining: 32m 44s\n6250:\tlearn: 0.7702753\ttest: 0.7434064\tbest: 0.7434067 (6247)\ttotal: 8m 31s\tremaining: 32m 24s\n6500:\tlearn: 0.7713563\ttest: 0.7434359\tbest: 0.7434370 (6496)\ttotal: 8m 52s\tremaining: 32m 5s\n6750:\tlearn: 0.7724329\ttest: 0.7434914\tbest: 0.7434914 (6750)\ttotal: 9m 13s\tremaining: 31m 45s\n7000:\tlearn: 0.7735010\ttest: 0.7435163\tbest: 0.7435167 (6990)\ttotal: 9m 34s\tremaining: 31m 26s\n7250:\tlearn: 0.7745351\ttest: 0.7435403\tbest: 0.7435408 (7228)\ttotal: 9m 54s\tremaining: 31m 6s\n7500:\tlearn: 0.7755905\ttest: 0.7435873\tbest: 0.7435883 (7499)\ttotal: 10m 15s\tremaining: 30m 46s\n7750:\tlearn: 0.7766078\ttest: 0.7436039\tbest: 0.7436039 (7750)\ttotal: 10m 36s\tremaining: 30m 26s\n8000:\tlearn: 0.7776453\ttest: 0.7436253\tbest: 0.7436275 (7976)\ttotal: 10m 57s\tremaining: 30m 6s\n8250:\tlearn: 0.7786550\ttest: 0.7436531\tbest: 0.7436538 (8236)\ttotal: 11m 17s\tremaining: 29m 46s\n8500:\tlearn: 0.7796691\ttest: 0.7436568\tbest: 0.7436578 (8282)\ttotal: 11m 38s\tremaining: 29m 26s\n8750:\tlearn: 0.7806778\ttest: 0.7436817\tbest: 0.7436851 (8742)\ttotal: 11m 59s\tremaining: 29m 6s\n9000:\tlearn: 0.7816508\ttest: 0.7436990\tbest: 0.7436991 (8999)\ttotal: 12m 19s\tremaining: 28m 46s\n9250:\tlearn: 0.7826423\ttest: 0.7436992\tbest: 0.7437031 (9243)\ttotal: 12m 40s\tremaining: 28m 26s\n9500:\tlearn: 0.7836209\ttest: 0.7437122\tbest: 0.7437143 (9492)\ttotal: 13m 1s\tremaining: 28m 6s\nbestTest = 0.743714273\nbestIteration = 9492\nShrink model to first 9493 iterations.\nPredicting validation fold...\nFold time: 00:14:52\nRunning 3 of 5 folds\n0:\tlearn: 0.6850999\ttest: 0.6845070\tbest: 0.6845070 (0)\ttotal: 125ms\tremaining: 1h 2m 20s\n250:\tlearn: 0.7296294\ttest: 0.7284850\tbest: 0.7284850 (250)\ttotal: 19.6s\tremaining: 38m 43s\n500:\tlearn: 0.7354756\ttest: 0.7333815\tbest: 0.7333815 (500)\ttotal: 39.3s\tremaining: 38m 32s\n750:\tlearn: 0.7389518\ttest: 0.7357793\tbest: 0.7357793 (750)\ttotal: 59.5s\tremaining: 38m 35s\n1000:\tlearn: 0.7416003\ttest: 0.7373117\tbest: 0.7373117 (1000)\ttotal: 1m 19s\tremaining: 38m 29s\n"
},
{
"name": "stdout",
"output_type": "stream",
"text": "1250:\tlearn: 0.7438034\ttest: 0.7383720\tbest: 0.7383720 (1250)\ttotal: 1m 40s\tremaining: 38m 19s\n1500:\tlearn: 0.7457413\ttest: 0.7391279\tbest: 0.7391279 (1500)\ttotal: 2m\tremaining: 38m 9s\n1750:\tlearn: 0.7474782\ttest: 0.7397085\tbest: 0.7397085 (1750)\ttotal: 2m 21s\tremaining: 37m 57s\n2000:\tlearn: 0.7490694\ttest: 0.7401717\tbest: 0.7401717 (2000)\ttotal: 2m 41s\tremaining: 37m 41s\n2250:\tlearn: 0.7506141\ttest: 0.7405697\tbest: 0.7405697 (2250)\ttotal: 3m 2s\tremaining: 37m 25s\n2500:\tlearn: 0.7521148\ttest: 0.7409177\tbest: 0.7409177 (2500)\ttotal: 3m 22s\tremaining: 37m 8s\n2750:\tlearn: 0.7535221\ttest: 0.7411814\tbest: 0.7411823 (2749)\ttotal: 3m 43s\tremaining: 36m 51s\n3000:\tlearn: 0.7548911\ttest: 0.7414254\tbest: 0.7414262 (2994)\ttotal: 4m 3s\tremaining: 36m 34s\n3250:\tlearn: 0.7562345\ttest: 0.7416208\tbest: 0.7416219 (3248)\ttotal: 4m 24s\tremaining: 36m 15s\n3500:\tlearn: 0.7575342\ttest: 0.7417935\tbest: 0.7417941 (3499)\ttotal: 4m 45s\tremaining: 35m 59s\n3750:\tlearn: 0.7587993\ttest: 0.7419619\tbest: 0.7419619 (3750)\ttotal: 5m 5s\tremaining: 35m 41s\n4000:\tlearn: 0.7600444\ttest: 0.7420942\tbest: 0.7420949 (3999)\ttotal: 5m 26s\tremaining: 35m 22s\n4250:\tlearn: 0.7612798\ttest: 0.7422073\tbest: 0.7422074 (4249)\ttotal: 5m 47s\tremaining: 35m 2s\n4500:\tlearn: 0.7624865\ttest: 0.7422993\tbest: 0.7422993 (4500)\ttotal: 6m 7s\tremaining: 34m 43s\n4750:\tlearn: 0.7636806\ttest: 0.7424092\tbest: 0.7424092 (4750)\ttotal: 6m 28s\tremaining: 34m 24s\n5000:\tlearn: 0.7648394\ttest: 0.7424819\tbest: 0.7424819 (5000)\ttotal: 6m 49s\tremaining: 34m 4s\n5250:\tlearn: 0.7659918\ttest: 0.7425707\tbest: 0.7425725 (5241)\ttotal: 7m 9s\tremaining: 33m 45s\n5500:\tlearn: 0.7671323\ttest: 0.7426252\tbest: 0.7426281 (5486)\ttotal: 7m 30s\tremaining: 33m 25s\n5750:\tlearn: 0.7682531\ttest: 0.7426993\tbest: 0.7426993 (5750)\ttotal: 7m 51s\tremaining: 33m 6s\n6000:\tlearn: 0.7693505\ttest: 0.7427369\tbest: 0.7427382 (5998)\ttotal: 8m 11s\tremaining: 32m 47s\n6250:\tlearn: 0.7704578\ttest: 0.7427803\tbest: 0.7427803 (6250)\ttotal: 8m 32s\tremaining: 32m 27s\n6500:\tlearn: 0.7715418\ttest: 0.7428173\tbest: 0.7428177 (6498)\ttotal: 8m 53s\tremaining: 32m 8s\n6750:\tlearn: 0.7726044\ttest: 0.7428693\tbest: 0.7428694 (6748)\ttotal: 9m 14s\tremaining: 31m 48s\n7000:\tlearn: 0.7736835\ttest: 0.7429143\tbest: 0.7429171 (6990)\ttotal: 9m 35s\tremaining: 31m 28s\n7250:\tlearn: 0.7747476\ttest: 0.7429455\tbest: 0.7429455 (7250)\ttotal: 9m 55s\tremaining: 31m 8s\n7500:\tlearn: 0.7757769\ttest: 0.7429571\tbest: 0.7429702 (7396)\ttotal: 10m 16s\tremaining: 30m 49s\nbestTest = 0.7429702282\nbestIteration = 7396\nShrink model to first 7397 iterations.\nPredicting validation fold...\nFold time: 00:11:58\nRunning 4 of 5 folds\n0:\tlearn: 0.6840279\ttest: 0.6837019\tbest: 0.6837019 (0)\ttotal: 124ms\tremaining: 1h 2m 14s\n250:\tlearn: 0.7295898\ttest: 0.7288651\tbest: 0.7288651 (250)\ttotal: 19.7s\tremaining: 38m 51s\n500:\tlearn: 0.7354265\ttest: 0.7336980\tbest: 0.7336980 (500)\ttotal: 39.5s\tremaining: 38m 45s\n750:\tlearn: 0.7388861\ttest: 0.7360826\tbest: 0.7360826 (750)\ttotal: 59.6s\tremaining: 38m 39s\n1000:\tlearn: 0.7415335\ttest: 0.7375904\tbest: 0.7375904 (1000)\ttotal: 1m 19s\tremaining: 38m 35s\n1250:\tlearn: 0.7436752\ttest: 0.7386098\tbest: 0.7386098 (1250)\ttotal: 1m 40s\tremaining: 38m 26s\n1500:\tlearn: 0.7456069\ttest: 0.7393535\tbest: 0.7393535 (1500)\ttotal: 2m\tremaining: 38m 10s\n1750:\tlearn: 0.7473272\ttest: 0.7399135\tbest: 0.7399135 (1750)\ttotal: 2m 21s\tremaining: 37m 56s\n2000:\tlearn: 0.7489631\ttest: 0.7403584\tbest: 0.7403584 (2000)\ttotal: 2m 41s\tremaining: 37m 41s\n2250:\tlearn: 0.7505025\ttest: 0.7407418\tbest: 0.7407418 (2250)\ttotal: 3m 2s\tremaining: 37m 27s\n2500:\tlearn: 0.7519901\ttest: 0.7410769\tbest: 0.7410769 (2500)\ttotal: 3m 23s\tremaining: 37m 12s\n2750:\tlearn: 0.7534010\ttest: 0.7413916\tbest: 0.7413928 (2749)\ttotal: 3m 43s\tremaining: 36m 55s\n3000:\tlearn: 0.7547644\ttest: 0.7416371\tbest: 0.7416371 (3000)\ttotal: 4m 4s\tremaining: 36m 37s\n3250:\tlearn: 0.7560982\ttest: 0.7418520\tbest: 0.7418520 (3250)\ttotal: 4m 25s\tremaining: 36m 20s\n3500:\tlearn: 0.7574009\ttest: 0.7420366\tbest: 0.7420366 (3499)\ttotal: 4m 45s\tremaining: 36m 2s\n3750:\tlearn: 0.7587007\ttest: 0.7422026\tbest: 0.7422026 (3750)\ttotal: 5m 6s\tremaining: 35m 43s\n4000:\tlearn: 0.7599286\ttest: 0.7423490\tbest: 0.7423490 (4000)\ttotal: 5m 26s\tremaining: 35m 24s\n4250:\tlearn: 0.7611579\ttest: 0.7424964\tbest: 0.7424964 (4248)\ttotal: 5m 47s\tremaining: 35m 4s\n4500:\tlearn: 0.7623374\ttest: 0.7426186\tbest: 0.7426186 (4493)\ttotal: 6m 8s\tremaining: 34m 45s\n4750:\tlearn: 0.7635263\ttest: 0.7427064\tbest: 0.7427072 (4744)\ttotal: 6m 28s\tremaining: 34m 26s\n5000:\tlearn: 0.7646803\ttest: 0.7427904\tbest: 0.7427904 (5000)\ttotal: 6m 49s\tremaining: 34m 7s\n5250:\tlearn: 0.7658509\ttest: 0.7428794\tbest: 0.7428794 (5250)\ttotal: 7m 10s\tremaining: 33m 48s\n5500:\tlearn: 0.7669673\ttest: 0.7429584\tbest: 0.7429588 (5497)\ttotal: 7m 31s\tremaining: 33m 29s\n5750:\tlearn: 0.7680994\ttest: 0.7430549\tbest: 0.7430549 (5750)\ttotal: 7m 52s\tremaining: 33m 10s\n6000:\tlearn: 0.7692231\ttest: 0.7431363\tbest: 0.7431367 (5998)\ttotal: 8m 12s\tremaining: 32m 50s\n6250:\tlearn: 0.7703366\ttest: 0.7431835\tbest: 0.7431835 (6250)\ttotal: 8m 33s\tremaining: 32m 31s\n6500:\tlearn: 0.7714197\ttest: 0.7432296\tbest: 0.7432296 (6500)\ttotal: 8m 54s\tremaining: 32m 12s\n6750:\tlearn: 0.7724985\ttest: 0.7432509\tbest: 0.7432511 (6747)\ttotal: 9m 15s\tremaining: 31m 52s\n7000:\tlearn: 0.7735813\ttest: 0.7432857\tbest: 0.7432857 (7000)\ttotal: 9m 36s\tremaining: 31m 32s\n7250:\tlearn: 0.7746302\ttest: 0.7433014\tbest: 0.7433034 (7245)\ttotal: 9m 57s\tremaining: 31m 13s\n7500:\tlearn: 0.7756832\ttest: 0.7433423\tbest: 0.7433453 (7476)\ttotal: 10m 17s\tremaining: 30m 53s\n7750:\tlearn: 0.7767162\ttest: 0.7433698\tbest: 0.7433729 (7733)\ttotal: 10m 38s\tremaining: 30m 33s\n8000:\tlearn: 0.7777387\ttest: 0.7433999\tbest: 0.7434005 (7983)\ttotal: 10m 59s\tremaining: 30m 12s\n8250:\tlearn: 0.7787530\ttest: 0.7434103\tbest: 0.7434158 (8179)\ttotal: 11m 20s\tremaining: 29m 52s\n8500:\tlearn: 0.7797645\ttest: 0.7434147\tbest: 0.7434197 (8308)\ttotal: 11m 40s\tremaining: 29m 32s\nbestTest = 0.7434197068\nbestIteration = 8308\nShrink model to first 8309 iterations.\nPredicting validation fold...\nFold time: 00:13:15\nRunning 5 of 5 folds\n0:\tlearn: 0.6848926\ttest: 0.6853259\tbest: 0.6853259 (0)\ttotal: 124ms\tremaining: 1h 2m 12s\n250:\tlearn: 0.7296306\ttest: 0.7283516\tbest: 0.7283516 (250)\ttotal: 19.5s\tremaining: 38m 36s\n500:\tlearn: 0.7355935\ttest: 0.7332525\tbest: 0.7332525 (500)\ttotal: 39.4s\tremaining: 38m 38s\n750:\tlearn: 0.7390787\ttest: 0.7356053\tbest: 0.7356053 (750)\ttotal: 59.5s\tremaining: 38m 38s\n1000:\tlearn: 0.7416857\ttest: 0.7370375\tbest: 0.7370375 (1000)\ttotal: 1m 19s\tremaining: 38m 37s\n1250:\tlearn: 0.7438401\ttest: 0.7380364\tbest: 0.7380364 (1250)\ttotal: 1m 40s\tremaining: 38m 27s\n1500:\tlearn: 0.7457320\ttest: 0.7387847\tbest: 0.7387847 (1500)\ttotal: 2m\tremaining: 38m 17s\n1750:\tlearn: 0.7474715\ttest: 0.7393531\tbest: 0.7393531 (1750)\ttotal: 2m 21s\tremaining: 38m 2s\n2000:\tlearn: 0.7491094\ttest: 0.7398145\tbest: 0.7398145 (2000)\ttotal: 2m 42s\tremaining: 37m 49s\n2250:\tlearn: 0.7506503\ttest: 0.7401853\tbest: 0.7401853 (2250)\ttotal: 3m 2s\tremaining: 37m 34s\n2500:\tlearn: 0.7521340\ttest: 0.7404997\tbest: 0.7404997 (2500)\ttotal: 3m 23s\tremaining: 37m 17s\n2750:\tlearn: 0.7535403\ttest: 0.7407767\tbest: 0.7407767 (2750)\ttotal: 3m 44s\tremaining: 37m 1s\n3000:\tlearn: 0.7548900\ttest: 0.7409991\tbest: 0.7409991 (3000)\ttotal: 4m 4s\tremaining: 36m 43s\n3250:\tlearn: 0.7562026\ttest: 0.7411880\tbest: 0.7411880 (3250)\ttotal: 4m 25s\tremaining: 36m 24s\n3500:\tlearn: 0.7575147\ttest: 0.7413676\tbest: 0.7413676 (3500)\ttotal: 4m 46s\tremaining: 36m 5s\n3750:\tlearn: 0.7587857\ttest: 0.7415340\tbest: 0.7415340 (3750)\ttotal: 5m 6s\tremaining: 35m 47s\n4000:\tlearn: 0.7600163\ttest: 0.7416680\tbest: 0.7416680 (4000)\ttotal: 5m 27s\tremaining: 35m 28s\n4250:\tlearn: 0.7612514\ttest: 0.7418015\tbest: 0.7418021 (4249)\ttotal: 5m 48s\tremaining: 35m 10s\n4500:\tlearn: 0.7624537\ttest: 0.7419295\tbest: 0.7419295 (4500)\ttotal: 6m 9s\tremaining: 34m 51s\n4750:\tlearn: 0.7636208\ttest: 0.7420292\tbest: 0.7420292 (4750)\ttotal: 6m 29s\tremaining: 34m 31s\n5000:\tlearn: 0.7647878\ttest: 0.7421130\tbest: 0.7421130 (5000)\ttotal: 6m 50s\tremaining: 34m 12s\n5250:\tlearn: 0.7659357\ttest: 0.7421789\tbest: 0.7421789 (5244)\ttotal: 7m 11s\tremaining: 33m 52s\n5500:\tlearn: 0.7671063\ttest: 0.7422670\tbest: 0.7422670 (5500)\ttotal: 7m 31s\tremaining: 33m 32s\n"
},
{
"name": "stdout",
"output_type": "stream",
"text": "5750:\tlearn: 0.7682310\ttest: 0.7423315\tbest: 0.7423343 (5742)\ttotal: 7m 52s\tremaining: 33m 13s\n6000:\tlearn: 0.7693284\ttest: 0.7424052\tbest: 0.7424055 (5999)\ttotal: 8m 13s\tremaining: 32m 53s\n6250:\tlearn: 0.7704502\ttest: 0.7424735\tbest: 0.7424735 (6250)\ttotal: 8m 34s\tremaining: 32m 33s\n6500:\tlearn: 0.7715443\ttest: 0.7425126\tbest: 0.7425132 (6490)\ttotal: 8m 54s\tremaining: 32m 13s\n6750:\tlearn: 0.7726235\ttest: 0.7425506\tbest: 0.7425506 (6750)\ttotal: 9m 15s\tremaining: 31m 53s\n7000:\tlearn: 0.7736748\ttest: 0.7425961\tbest: 0.7425979 (6994)\ttotal: 9m 36s\tremaining: 31m 33s\n7250:\tlearn: 0.7747220\ttest: 0.7426147\tbest: 0.7426147 (7250)\ttotal: 9m 57s\tremaining: 31m 13s\n7500:\tlearn: 0.7757820\ttest: 0.7426503\tbest: 0.7426568 (7454)\ttotal: 10m 17s\tremaining: 30m 53s\n7750:\tlearn: 0.7768281\ttest: 0.7426544\tbest: 0.7426600 (7598)\ttotal: 10m 38s\tremaining: 30m 33s\n8000:\tlearn: 0.7778530\ttest: 0.7426778\tbest: 0.7426866 (7953)\ttotal: 10m 59s\tremaining: 30m 13s\n8250:\tlearn: 0.7788802\ttest: 0.7426885\tbest: 0.7426907 (8237)\ttotal: 11m 20s\tremaining: 29m 53s\n8500:\tlearn: 0.7798839\ttest: 0.7427056\tbest: 0.7427127 (8445)\ttotal: 11m 41s\tremaining: 29m 33s\nbestTest = 0.7427127361\nbestIteration = 8445\nShrink model to first 8446 iterations.\nPredicting validation fold...\nFold time: 00:13:27\nTotal amount of training time: 01:08:54\n"
}
]
},
{
"metadata": {
"trusted": true,
"hidden": true
},
"cell_type": "code",
"source": "joblib.dump(ensemble, DATA/'cb_ensemble.pickle')",
"execution_count": 14,
"outputs": [
{
"data": {
"text/plain": "['/home/ck/data/microsoft/cb_ensemble.pickle']"
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
]
},
{
"metadata": {
"trusted": true,
"hidden": true
},
"cell_type": "code",
"source": "np.save(DATA/'val_cb.npy', val_cb)",
"execution_count": 15,
"outputs": []
},
{
"metadata": {
"heading_collapsed": true
},
"cell_type": "markdown",
"source": "## SGD \n\nThis one was added to somehow diversify the ensemble. All previous models use tree-based boosting method so probably SGD could bring some additional information. As a standalone model, it shows quite low accuracy compared to the previous solutions. But maybe it can add some value to the stacked model."
},
{
"metadata": {
"trusted": true,
"hidden": true
},
"cell_type": "code",
"source": "X = np.load(DATA/'x_train.npy')\ny = np.load(DATA/'y_train.npy')",
"execution_count": 7,
"outputs": []
},
{
"metadata": {
"trusted": true,
"hidden": true
},
"cell_type": "code",
"source": "val_sgd = np.zeros(len(X), dtype=np.float16)\nensemble = []\ntotal_time = 0\nfor trn_idx, val_idx in split(X, y):\n with Timer() as timer:\n sgd = SGDClassifier(\n loss='log', early_stopping=True, \n tol=0.001, alpha=1./len(trn_idx),\n fit_intercept=False, n_jobs=-1)\n model = BaggingClassifier(\n base_estimator=sgd, n_estimators=10, \n bootstrap_features=True, max_features=0.5)\n model.fit(X[trn_idx], y[trn_idx])\n print('Predicting validation fold...', end=' ')\n preds = model.predict_proba(X[val_idx])[:, 1]\n val_sgd[val_idx] = preds\n score = roc_auc_score(y[val_idx], preds)\n print(f'AUC score: {score:2.2f}')\n print(f'Fold time: {timer}')\n ensemble.append(model)\n total_time += float(timer)\nprint(f'Total amount of training time: {Timer.format_elapsed_time(total_time)}')",
"execution_count": 8,
"outputs": [
{
"output_type": "stream",
"text": "Running 1 of 5 folds\nPredicting validation fold... AUC score: 0.58\nFold time: 00:09:42\nRunning 2 of 5 folds\nPredicting validation fold... AUC score: 0.56\nFold time: 00:08:59\nRunning 3 of 5 folds\nPredicting validation fold... AUC score: 0.59\nFold time: 00:09:57\nRunning 4 of 5 folds\nPredicting validation fold... AUC score: 0.56\nFold time: 00:09:29\nRunning 5 of 5 folds\nPredicting validation fold... AUC score: 0.60\nFold time: 00:08:54\nTotal amount of training time: 00:47:04\n",
"name": "stdout"
}
]
},
{
"metadata": {
"trusted": true,
"hidden": true
},
"cell_type": "code",
"source": "joblib.dump(ensemble, DATA/'sgd_ensemble.pickle')",
"execution_count": 10,
"outputs": [
{
"output_type": "execute_result",
"execution_count": 10,
"data": {
"text/plain": "['/home/ck/data/microsoft/sgd_ensemble.pickle']"
},
"metadata": {}
}
]
},
{
"metadata": {
"hidden": true,
"trusted": true
},
"cell_type": "code",
"source": "np.save(DATA/'val_sgd.npy', val_sgd)",
"execution_count": 11,
"outputs": []
},
{
"metadata": {
"heading_collapsed": true
},
"cell_type": "markdown",
"source": "## Vowpal Wabbit\n\nOne more linear learner here. In the next cells, the data we have is converted into format expected by VW. Note that these files could occupy a lot of disk space in uncompressed format. Also, we don't use K-fold cross-validation here. Of course, you can generate several files, or try to play with VW options to enable this and be more consistent with the previous models."
},
{
"metadata": {
"hidden": true,
"trusted": true
},
"cell_type": "code",
"source": "X = np.load(DATA/'x_train.npy')\ny = np.load(DATA/'y_train.npy')\nX_test = np.load(DATA/'x_test.npy')",
"execution_count": 7,
"outputs": []
},
{
"metadata": {
"hidden": true,
"trusted": true
},
"cell_type": "code",
"source": "col_groups = 0, 81, 81+36, 81+36+8, 81+36+8+100",
"execution_count": 8,
"outputs": []
},
{
"metadata": {
"hidden": true,
"trusted": true
},
"cell_type": "code",
"source": "def pairs(xs):\n for a, b in zip(xs[:-1], xs[1:]):\n yield a, b",
"execution_count": 9,
"outputs": []
},
{
"metadata": {
"hidden": true,
"trusted": true
},
"cell_type": "code",
"source": "base, cnts, mean, leaf = pairs(col_groups)\ngroups = [('base', base, True), \n ('cnts', cnts, False),\n ('mean', mean, False), \n ('leaf', leaf, True)]",
"execution_count": 10,
"outputs": []
},
{
"metadata": {
"hidden": true,
"trusted": true
},
"cell_type": "code",
"source": "def join(s): return ' '.join(s)",
"execution_count": 11,
"outputs": []
},
{
"metadata": {
"hidden": true,
"trusted": true
},
"cell_type": "code",
"source": "def convert_to_vw(data, targets, filename, groups, logistic=True):\n print('Preparing file:', filename)\n print('Number of samples in the dataset:', len(data))\n with open(filename, 'w') as file:\n for (i, row), target in zip(enumerate(data), targets):\n if (i+1) % 500_000 == 0:\n print(f'{i+1:d} samples prepared')\n sample = []\n for name, (start, end), categorical in groups:\n sep = '_' if categorical else ':'\n prefix = name[0]\n group = [f'{prefix}{j}{sep}{int(x) if categorical else f\"{x:.4f}\"}' \n for j, x in enumerate(row[start:end])]\n sample.append((name, group))\n if logistic:\n target = -1 if not target else 1\n string = f\"{target} 'index={i}\"\n for name, group in sample:\n string += f' |{name} {join(group)}'\n string += '\\n'\n file.write(string)\n print('Done!')",
"execution_count": 16,
"outputs": []
},
{
"metadata": {
"hidden": true,
"trusted": true
},
"cell_type": "code",
"source": "convert_to_vw(X, y, DATA/'train.vw', groups)",
"execution_count": 17,
"outputs": [
{
"output_type": "stream",
"text": "Preparing file: /home/ck/data/microsoft/train.vw\nNumber of samples in the dataset: 8921483\n500000 samples prepared\n1000000 samples prepared\n1500000 samples prepared\n2000000 samples prepared\n2500000 samples prepared\n3000000 samples prepared\n3500000 samples prepared\n4000000 samples prepared\n4500000 samples prepared\n5000000 samples prepared\n5500000 samples prepared\n6000000 samples prepared\n6500000 samples prepared\n7000000 samples prepared\n7500000 samples prepared\n8000000 samples prepared\n8500000 samples prepared\nDone!\n",
"name": "stdout"
}
]
},
{
"metadata": {
"hidden": true,
"trusted": true
},
"cell_type": "code",
"source": "convert_to_vw(X_test, np.zeros(len(X_test)), DATA/'test.vw', groups)",
"execution_count": 18,
"outputs": [
{
"output_type": "stream",
"text": "Preparing file: /home/ck/data/microsoft/test.vw\nNumber of samples in the dataset: 7853253\n500000 samples prepared\n1000000 samples prepared\n1500000 samples prepared\n2000000 samples prepared\n2500000 samples prepared\n3000000 samples prepared\n3500000 samples prepared\n4000000 samples prepared\n4500000 samples prepared\n5000000 samples prepared\n5500000 samples prepared\n6000000 samples prepared\n6500000 samples prepared\n7000000 samples prepared\n7500000 samples prepared\nDone!\n",
"name": "stdout"
}
]
},
{
"metadata": {
"hidden": true,
"trusted": true
},
"cell_type": "code",
"source": "train = DATA/'train.vw'\ntest = DATA/'test.vw'\nmodel = DATA/'vw.model'",
"execution_count": 23,
"outputs": []
},
{
"metadata": {
"hidden": true,
"trusted": true,
"scrolled": true
},
"cell_type": "code",
"source": "!vw -d \"{train}\" --loss_function=logistic --passes=1 --l1=1e-8 -f \"{model}\" --threads -c",
"execution_count": 11,
"outputs": [
{
"output_type": "stream",
"text": "using l1 regularization = 1e-08\nfinal_regressor = /home/ck/data/microsoft/vw.model\nNum weight bits = 18\nlearning rate = 0.5\ninitial_t = 0\npower_t = 0.5\nusing cache_file = /home/ck/data/microsoft/train.vw.cache\nignoring text input in favor of cache input\nnum sources = 1\naverage since example example current current current\nloss last counter weight label predict features\n0.693147 0.693147 1 1.0 -1.0000 0.0000 226\n0.385178 0.077210 2 2.0 -1.0000 -2.5224 221\n0.779407 1.173636 4 4.0 1.0000 -2.0642 220\n0.886432 0.993456 8 8.0 -1.0000 1.6948 225\n0.844876 0.803321 16 16.0 1.0000 0.2158 225\n0.765214 0.685552 32 32.0 -1.0000 0.1218 225\n0.732735 0.700256 64 64.0 -1.0000 -0.9867 223\n0.712478 0.692220 128 128.0 -1.0000 -1.9928 224\n0.688964 0.665450 256 256.0 1.0000 1.0081 226\n0.703236 0.717509 512 512.0 -1.0000 -0.0780 223\n0.687916 0.672595 1024 1024.0 -1.0000 -1.0880 226\n0.673310 0.658704 2048 2048.0 -1.0000 -1.9166 225\n0.663663 0.654015 4096 4096.0 -1.0000 -2.0242 224\n0.652972 0.642281 8192 8192.0 -1.0000 0.1751 225\n0.646093 0.639215 16384 16384.0 -1.0000 -0.7576 226\n0.637280 0.628467 32768 32768.0 -1.0000 -0.3366 225\n0.628735 0.620189 65536 65536.0 1.0000 1.7307 225\n0.620995 0.613256 131072 131072.0 -1.0000 -0.6446 225\n0.615783 0.610570 262144 262144.0 -1.0000 -0.3538 224\n0.611487 0.607191 524288 524288.0 -1.0000 -0.5350 226\n0.608158 0.604829 1048576 1048576.0 1.0000 0.4412 225\n0.605723 0.603287 2097152 2097152.0 1.0000 -0.4425 226\n0.604186 0.602649 4194304 4194304.0 -1.0000 -1.8421 226\n0.602900 0.601615 8388608 8388608.0 1.0000 1.9680 224\n\nfinished run\nnumber of examples per pass = 8921483\npasses used = 1\nweighted example sum = 8921483.000000\nweighted label sum = -3699.000000\naverage loss = 0.602805\nbest constant = -0.000829\nbest constant's loss = 0.693147\ntotal feature number = 2009866018\n",
"name": "stdout"
}
]
},
{
"metadata": {
"hidden": true,
"trusted": true,
"scrolled": true
},
"cell_type": "code",
"source": "!vw -d \"{train}\" -i \"{model}\" -t --loss_function=logistic --link=logistic -p \"{DATA}/train_preds.vw\"",
"execution_count": 24,
"outputs": [
{
"output_type": "stream",
"text": "only testing\npredictions = /home/ck/data/microsoft/train_preds.vw\nNum weight bits = 18\nlearning rate = 0.5\ninitial_t = 0\npower_t = 0.5\nusing no cache\nReading datafile = /home/ck/data/microsoft/train.vw\nnum sources = 1\naverage since example example current current current\nloss last counter weight label predict features\n0.810212 0.810212 1 1.0 -1.0000 0.5552 226\n0.885578 0.960945 2 2.0 -1.0000 0.6175 221\n0.630610 0.375642 4 4.0 1.0000 0.9909 220\n0.653711 0.676812 8 8.0 -1.0000 0.4736 225\n0.574857 0.496003 16 16.0 1.0000 0.9100 225\n0.613110 0.651364 32 32.0 -1.0000 0.2503 225\n0.601791 0.590472 64 64.0 -1.0000 0.3089 223\n0.619216 0.636641 128 128.0 -1.0000 0.3140 224\n0.606463 0.593710 256 256.0 1.0000 0.4752 226\n0.590612 0.574761 512 512.0 -1.0000 0.4016 223\n0.596969 0.603326 1024 1024.0 -1.0000 0.3293 226\n0.592711 0.588453 2048 2048.0 -1.0000 0.2576 225\n0.596289 0.599867 4096 4096.0 -1.0000 0.0842 224\n0.597137 0.597985 8192 8192.0 -1.0000 0.5245 225\n0.600479 0.603822 16384 16384.0 -1.0000 0.2062 226\n0.601834 0.603188 32768 32768.0 -1.0000 0.3038 225\n0.600483 0.599133 65536 65536.0 1.0000 0.8231 225\n0.599772 0.599060 131072 131072.0 -1.0000 0.4382 225\n0.600129 0.600487 262144 262144.0 -1.0000 0.4013 224\n0.600324 0.600520 524288 524288.0 -1.0000 0.3935 226\n0.600431 0.600538 1048576 1048576.0 1.0000 0.5249 225\n0.600565 0.600700 2097152 2097152.0 1.0000 0.3649 226\n0.600901 0.601236 4194304 4194304.0 -1.0000 0.1249 226\n0.600789 0.600677 8388608 8388608.0 1.0000 0.8713 224\n\nfinished run\nnumber of examples per pass = 8921483\npasses used = 1\nweighted example sum = 8921483.000000\nweighted label sum = -3699.000000\naverage loss = 0.600727\nbest constant = -0.000829\nbest constant's loss = 0.693147\ntotal feature number = 2009866018\n",
"name": "stdout"
}
]
},
{
"metadata": {
"hidden": true,
"trusted": true
},
"cell_type": "code",
"source": "!vw -d \"{test}\" -i \"{model}\" -t --loss_function=logistic --link=logistic -p \"{DATA}/test_preds.vw\"",
"execution_count": 25,
"outputs": [
{
"output_type": "stream",
"text": "only testing\npredictions = /home/ck/data/microsoft/test_preds.vw\nNum weight bits = 18\nlearning rate = 0.5\ninitial_t = 0\npower_t = 0.5\nusing no cache\nReading datafile = /home/ck/data/microsoft/test.vw\nnum sources = 1\naverage since example example current current current\nloss last counter weight label predict features\n0.719431 0.719431 1 1.0 -1.0000 0.5130 225\n0.754351 0.789272 2 2.0 -1.0000 0.5458 226\n0.596922 0.439494 4 4.0 -1.0000 0.3611 225\n0.559359 0.521797 8 8.0 -1.0000 0.2534 223\n0.528392 0.497425 16 16.0 -1.0000 0.3468 226\n0.567522 0.606651 32 32.0 -1.0000 0.5032 226\n0.562386 0.557250 64 64.0 -1.0000 0.4206 226\n0.629696 0.697005 128 128.0 -1.0000 0.5881 226\n0.655898 0.682101 256 256.0 -1.0000 0.6589 226\n0.649914 0.643930 512 512.0 -1.0000 0.4845 226\n0.647522 0.645130 1024 1024.0 -1.0000 0.4133 225\n0.663348 0.679174 2048 2048.0 -1.0000 0.5276 226\n0.670636 0.677924 4096 4096.0 -1.0000 0.2393 226\n0.670951 0.671266 8192 8192.0 -1.0000 0.4425 224\n0.674848 0.678746 16384 16384.0 -1.0000 0.7229 224\n0.670143 0.665438 32768 32768.0 -1.0000 0.5403 226\n0.668797 0.667450 65536 65536.0 -1.0000 0.6339 226\n0.668300 0.667802 131072 131072.0 -1.0000 0.7732 226\n0.669747 0.671195 262144 262144.0 -1.0000 0.5273 226\n0.669958 0.670168 524288 524288.0 -1.0000 0.5058 226\n0.670054 0.670150 1048576 1048576.0 -1.0000 0.3149 226\n0.669620 0.669186 2097152 2097152.0 -1.0000 0.6468 226\n0.669303 0.668987 4194304 4194304.0 -1.0000 0.4008 225\n\nfinished run\nnumber of examples per pass = 7853253\npasses used = 1\nweighted example sum = 7853253.000000\nweighted label sum = -7853253.000000\naverage loss = 0.669313\nbest constant = -1.000000\nbest constant's loss = 0.313262\ntotal feature number = 1768544335\n",
"name": "stdout"
}
]
},
{
"metadata": {
"hidden": true,
"trusted": true
},
"cell_type": "code",
"source": "for filename in ('train_preds.vw', 'test_preds.vw'):\n with open(DATA/filename) as file:\n prefix = filename.strip('.vw')\n values = [(line.strip().split()[0]) for line in file]\n out = DATA/f'vw_{prefix}.npy'\n np.save(out, np.array(values, dtype=np.float16))\n print('Saved file:', out)",
"execution_count": 26,
"outputs": [
{
"output_type": "stream",
"text": "Saved file: /home/ck/data/microsoft/vw_train_preds.npy\nSaved file: /home/ck/data/microsoft/vw_test_preds.npy\n",
"name": "stdout"
}
]
},
{
"metadata": {
"hidden": true,
"trusted": true
},
"cell_type": "code",
"source": "preds = np.load(DATA/'vw_test_preds.npy')",
"execution_count": 27,
"outputs": []
},
{
"metadata": {
"hidden": true,
"trusted": true
},
"cell_type": "code",
"source": "submit = pd.read_csv(DATA/'sample_submission.csv')\nsubmit['HasDetections'] = preds\nsubmit.to_csv('vw.csv', index=None)",
"execution_count": 28,
"outputs": []
},
{
"metadata": {
"hidden": true,
"trusted": true
},
"cell_type": "code",
"source": "!kaggle c submit -c microsoft-malware-prediction -f \"vw.csv\" -m \"VW\"",
"execution_count": 29,
"outputs": [
{
"output_type": "stream",
"text": "100%|████████████████████████████████████████| 297M/297M [00:38<00:00, 8.08MB/s]\nSuccessfully submitted to Microsoft Malware Prediction",
"name": "stdout"
}
]
},
{
"metadata": {},
"cell_type": "markdown",
"source": "# Stack Them All\n\nHaving several trained classifiers, we can try to blend their predictions to improve the overall quality of the solution. For this purpose, we need to generate new datasets where the predictions from previous stages become features."
},
{
"metadata": {
"heading_collapsed": true
},
"cell_type": "markdown",
"source": "## Concatenate the Predictions\n\nThe previous stages saved the training predictions into numpy arrays so now we only need to restore them. Also, we need to apply the trained models to the test set and build a new testing set from these predictions as well."
},
{
"metadata": {
"trusted": true,
"hidden": true
},
"cell_type": "code",
"source": "names = ['lgb', 'cb', 'sgd']",
"execution_count": 5,
"outputs": []
},
{
"metadata": {
"trusted": true,
"hidden": true
},
"cell_type": "code",
"source": "train_cols, test_cols = [], []",
"execution_count": 6,
"outputs": []
},
{
"metadata": {
"trusted": true,
"hidden": true
},
"cell_type": "code",
"source": "X_test = np.load(DATA/'x_test.npy')",
"execution_count": 7,
"outputs": []
},
{
"metadata": {
"trusted": true,
"hidden": true
},
"cell_type": "code",
"source": "for name in names:\n print('Preparing model:', name)\n ensemble = joblib.load(DATA/f'{name}_ensemble.pickle')\n test_result = np.zeros(len(X_test), dtype=np.float16)\n for model in ensemble:\n test_result += model.predict_proba(X_test)[:, 1]\n test_result /= len(ensemble)\n test_cols.append(test_result)\n train_cols.append(np.load(DATA/f'val_{name}.npy'))",
"execution_count": 9,
"outputs": [
{
"output_type": "stream",
"text": "Preparing model: lgb\nPreparing model: cb\nPreparing model: sgd\n",
"name": "stdout"
}
]
},
{
"metadata": {
"hidden": true,
"trusted": true
},
"cell_type": "code",
"source": "vw_train = np.load(DATA/'vw_train_preds.npy')\ntrain_cols.append(vw_train)",
"execution_count": 10,
"outputs": []
},
{
"metadata": {
"hidden": true,
"trusted": true
},
"cell_type": "code",
"source": "vw_test = np.load(DATA/'vw_test_preds.npy')\ntest_cols.append(vw_test)",
"execution_count": 16,
"outputs": []
},
{
"metadata": {
"trusted": true,
"hidden": true
},
"cell_type": "code",
"source": "X = np.column_stack(train_cols)\nX_test = np.column_stack(test_cols)",
"execution_count": 18,
"outputs": []
},
{
"metadata": {
"hidden": true,
"trusted": true
},
"cell_type": "code",
"source": "np.save(DATA/'x_train_stacked.npy', X)\nnp.save(DATA/'x_test_stacked.npy', X_test)",
"execution_count": 20,
"outputs": []
},
{
"metadata": {},
"cell_type": "markdown",
"source": "## Fit Model\n\nWe put a very simple classifier on top of the our meta-dataset, an instance of `LogisticRegression` class from `sklearn` with interaction features added."
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "X = np.load(DATA/'x_train_stacked.npy').astype(np.float32)\nX_test = np.load(DATA/'x_test_stacked.npy').astype(np.float32)\ny = np.load(DATA/'y_train.npy')",
"execution_count": 5,
"outputs": []
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "poly = PolynomialFeatures(interaction_only=True, include_bias=False)",
"execution_count": 6,
"outputs": []
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "X = poly.fit_transform(X)\nX_test = poly.transform(X_test)",
"execution_count": 7,
"outputs": []
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "test_result = np.zeros(len(X_test), dtype=np.float16)\nfor trn_idx, val_idx in split(X, y):\n with Timer() as timer:\n model = LogisticRegression(C=1, fit_intercept=True, penalty='l1', solver='saga')\n model.fit(X[trn_idx], y[trn_idx])\n print('Predicting validation fold...', end=' ')\n preds = model.predict_proba(X[val_idx])[:, 1]\n score = roc_auc_score(y[val_idx], preds)\n print(f'Fold AUC score: {score:2.2f}')\n print('Predicting testing dataset...')\n test_result += model.predict_proba(X_test)[:, 1]\ntest_result /= 5",
"execution_count": 11,
"outputs": [
{
"output_type": "stream",
"text": "Running 1 of 5 folds\nPredicting validation fold... Fold AUC score: 0.74\nPredicting testing dataset...\nRunning 2 of 5 folds\nPredicting validation fold... Fold AUC score: 0.74\nPredicting testing dataset...\nRunning 3 of 5 folds\nPredicting validation fold... Fold AUC score: 0.74\nPredicting testing dataset...\nRunning 4 of 5 folds\nPredicting validation fold... Fold AUC score: 0.74\nPredicting testing dataset...\nRunning 5 of 5 folds\nPredicting validation fold... Fold AUC score: 0.74\nPredicting testing dataset...\n",
"name": "stdout"
}
]
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "submit = pd.read_csv(DATA/'sample_submission.csv')\nsubmit['HasDetections'] = np.clip(test_result, 0.05, 0.95)\nsubmit.to_csv('stacked.csv', index=None)",
"execution_count": 14,
"outputs": []
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "!kaggle c submit -c microsoft-malware-prediction -f \"stacked.csv\" -m \"Stacked classifier\"",
"execution_count": 15,
"outputs": [
{
"output_type": "stream",
"text": "100%|████████████████████████████████████████| 297M/297M [00:22<00:00, 13.6MB/s]\nSuccessfully submitted to Microsoft Malware Prediction",
"name": "stdout"
}
]
},
{
"metadata": {},
"cell_type": "markdown",
"source": "# Conculsion"
},
{
"metadata": {},
"cell_type": "markdown",
"source": "Though the author didn't achieve an outstanding result in this competition, it was a very interesting experience. Practice always brings new aspects that are not always obvious from theory. Things like memory errors, corrupted data and floatings overflow bring additional complexity to the process. Therefore, it is essential to practice not only in getting a well-performing solution but also make it robust and computationally efficient."
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "",
"execution_count": null,
"outputs": []
}
],
"metadata": {
"kernelspec": {
"name": "python3",
"display_name": "Python 3",
"language": "python"
},
"language_info": {
"name": "python",
"version": "3.7.1",
"mimetype": "text/x-python",
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"pygments_lexer": "ipython3",
"nbconvert_exporter": "python",
"file_extension": ".py"
},
"toc": {
"nav_menu": {},
"number_sections": true,
"sideBar": true,
"skip_h1_title": false,
"base_numbering": 1,
"title_cell": "Table of Contents",
"title_sidebar": "Contents",
"toc_cell": false,
"toc_position": {},
"toc_section_display": true,
"toc_window_display": false
},
"gist_id": "400b2c37858143201a62b8f59189352d"
},
"nbformat": 4,
"nbformat_minor": 2
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment