h5li/kernel.ipynb

## kernel.ipynb
{
  "cells": [
    {
      "metadata": {
        "_uuid": "67c1e7ee2f47765065b02f74ec692ab572801a94"
      },
      "cell_type": "markdown",
      "source": "# This solution mostly comes from the kernel of the user jiegeng\nReference: https://www.kaggle.com/jiegeng94/everyone-do-this-at-the-beginning"
    },
    {
      "metadata": {
        "_uuid": "27e9d5488926d6dc6bc070b9b0921418f02188d5"
      },
      "cell_type": "markdown",
      "source": " # You can remove 17 columns at the Beginning!!\n\nAs the data is highly dimensional in this competition, it is really difficult to do even a little thing. So, before you begin any work, read this kernel and save your time!\n\nI have tried to reduce the column dimension by eliminating less useful columns and selected 17 columns which you can remove just after loading the data sets.\n\n* Selected `mostly-missing feaures` which have more than 99% of missing values.\n* Selected `too-skewed features` whose majority categories cover more than 99% of occurences.\n* Selected `hightly-correlated features`. Tested correlations between columns, picked up pairs whose corr is greater than 0.99, compared the distribution of the features in the pairs and corr with `HasDetections`,  and selected the minor column for elimination.\n\n**You can eliminate 17 columns without worry:**\n1. (M) PuaMode\n1. (M) Census_ProcessorClass\n1. (S) Census_IsWIMBootEnabled\n1. (S) IsBeta\n1. (S) Census_IsFlightsDisabled\n1. (S) Census_IsFlightingInternal\n1. (S) AutoSampleOptIn\n1. (S) Census_ThresholdOptIn\n1. (S) SMode\n1. (S) Census_IsPortableOperatingSystem\n1. (S) Census_DeviceFamily\n1. (S) UacLuaenable\n1. (S) Census_IsVirtualDevice\n1. (C) Processor\n1. (C) Census_OSSkuName\n1. (C) Census_OSInstallLanguageIdentifier\n1. (C) Processor\n\nHere, (M) denotes `mostly-missing feaures`, (S) means  `too-skewed features`, and (C) indicates `hightly-correlated features`.\n\n\n\n**Use this code:**\n\n> remove_cols = ['PuaMode', 'Census_ProcessorClass', 'Census_IsWIMBootEnabled', 'IsBeta', 'Census_IsFlightsDisabled', 'Census_IsFlightingInternal', 'AutoSampleOptIn', 'Census_ThresholdOptIn', 'SMode', 'Census_IsPortableOperatingSystem',  'Census_DeviceFamily', 'UacLuaenable', 'Census_IsVirtualDevice', 'Platform', 'Census_OSSkuName', 'Census_OSInstallLanguageIdentifier', 'Processor']\n>\n> train.drop(remove_cols, axis=1, inplace=True)\n>\n> test.drop(remove_cols, axis=1, inplace=True)\n\n\n## If you want to see how I got this:\nIn this kernel, I used only train dataset but the result was the same when I used train+test dataset."
    },
    {
      "metadata": {
        "trusted": true,
        "_uuid": "9ab5367b38f084495f2a939271de7a9c3aa18f79",
        "_kg_hide-input": true
      },
      "cell_type": "code",
      "source": "import pandas as pd\nimport numpy as np\nimport matplotlib.pyplot as plt",
      "execution_count": null,
      "outputs": []
    },
    {
      "metadata": {
        "_uuid": "a4c19c382f82484424153d6fceb02f66d66fb410"
      },
      "cell_type": "markdown",
      "source": "# 1. Load Data"
    },
    {
      "metadata": {
        "trusted": true,
        "_uuid": "3793ee996ed933d12db8ef151ce78611e99423d5",
        "_kg_hide-input": true
      },
      "cell_type": "code",
      "source": "# referred https://www.kaggle.com/theoviel/load-the-totality-of-the-data\ndtypes = {\n        'MachineIdentifier':                                    'category',\n        'ProductName':                                          'category',\n        'EngineVersion':                                        'category',\n        'AppVersion':                                           'category',\n        'AvSigVersion':                                         'category',\n        'IsBeta':                                               'int8',\n        'RtpStateBitfield':                                     'float16',\n        'IsSxsPassiveMode':                                     'int8',\n        'DefaultBrowsersIdentifier':                            'float32',\n        'AVProductStatesIdentifier':                            'float32',\n        'AVProductsInstalled':                                  'float16',\n        'AVProductsEnabled':                                    'float16',\n        'HasTpm':                                               'int8',\n        'CountryIdentifier':                                    'int16',\n        'CityIdentifier':                                       'float32',\n        'OrganizationIdentifier':                               'float16',\n        'GeoNameIdentifier':                                    'float16',\n        'LocaleEnglishNameIdentifier':                          'int16',\n        'Platform':                                             'category',\n        'Processor':                                            'category',\n        'OsVer':                                                'category',\n        'OsBuild':                                              'int16',\n        'OsSuite':                                              'int16',\n        'OsPlatformSubRelease':                                 'category',\n        'OsBuildLab':                                           'category',\n        'SkuEdition':                                           'category',\n        'IsProtected':                                          'float16',\n        'AutoSampleOptIn':                                      'int8',\n        'PuaMode':                                              'category',\n        'SMode':                                                'float16',\n        'IeVerIdentifier':                                      'float16',\n        'SmartScreen':                                          'category',\n        'Firewall':                                             'float16',\n        'UacLuaenable':                                         'float32',\n        'UacLuaenable':                                         'float64', # was 'float32'\n        'Census_MDC2FormFactor':                                'category',\n        'Census_DeviceFamily':                                  'category',\n        'Census_OEMNameIdentifier':                             'float32', # was 'float16'\n        'Census_OEMModelIdentifier':                            'float32',\n        'Census_ProcessorCoreCount':                            'float16',\n        'Census_ProcessorManufacturerIdentifier':               'float16',\n        'Census_ProcessorModelIdentifier':                      'float32', # was 'float16'\n        'Census_ProcessorClass':                                'category',\n        'Census_PrimaryDiskTotalCapacity':                      'float64', # was 'float32'\n        'Census_PrimaryDiskTypeName':                           'category',\n        'Census_SystemVolumeTotalCapacity':                     'float64', # was 'float32'\n        'Census_HasOpticalDiskDrive':                           'int8',\n        'Census_TotalPhysicalRAM':                              'float32',\n        'Census_ChassisTypeName':                               'category',\n        'Census_InternalPrimaryDiagonalDisplaySizeInInches':    'float32', # was 'float16'\n        'Census_InternalPrimaryDisplayResolutionHorizontal':    'float32', # was 'float16'\n        'Census_InternalPrimaryDisplayResolutionVertical':      'float32', # was 'float16'\n        'Census_PowerPlatformRoleName':                         'category',\n        'Census_InternalBatteryType':                           'category',\n        'Census_InternalBatteryNumberOfCharges':                'float64', # was 'float32'\n        'Census_OSVersion':                                     'category',\n        'Census_OSArchitecture':                                'category',\n        'Census_OSBranch':                                      'category',\n        'Census_OSBuildNumber':                                 'int16',\n        'Census_OSBuildRevision':                               'int32',\n        'Census_OSEdition':                                     'category',\n        'Census_OSSkuName':                                     'category',\n        'Census_OSInstallTypeName':                             'category',\n        'Census_OSInstallLanguageIdentifier':                   'float16',\n        'Census_OSUILocaleIdentifier':                          'int16',\n        'Census_OSWUAutoUpdateOptionsName':                     'category',\n        'Census_IsPortableOperatingSystem':                     'int8',\n        'Census_GenuineStateName':                              'category',\n        'Census_ActivationChannel':                             'category',\n        'Census_IsFlightingInternal':                           'float16',\n        'Census_IsFlightsDisabled':                             'float16',\n        'Census_FlightRing':                                    'category',\n        'Census_ThresholdOptIn':                                'float16',\n        'Census_FirmwareManufacturerIdentifier':                'float16',\n        'Census_FirmwareVersionIdentifier':                     'float32',\n        'Census_IsSecureBootEnabled':                           'int8',\n        'Census_IsWIMBootEnabled':                              'float16',\n        'Census_IsVirtualDevice':                               'float16',\n        'Census_IsTouchEnabled':                                'int8',\n        'Census_IsPenCapable':                                  'int8',\n        'Census_IsAlwaysOnAlwaysConnectedCapable':              'float16',\n        'Wdft_IsGamer':                                         'float16',\n        'Wdft_RegionIdentifier':                                'float16',\n        'HasDetections':                                        'int8'\n        }\ntrain = pd.DataFrame(pd.read_csv('../input/train.csv', dtype=dtypes, chunksize=4000).get_chunk(4000))\ntrain.shape",
      "execution_count": null,
      "outputs": []
    },
    {
      "metadata": {
        "trusted": true,
        "_uuid": "9ba6d8da4bb54da6e9897b9d8ad614b62005b588"
      },
      "cell_type": "code",
      "source": "train.head()",
      "execution_count": null,
      "outputs": []
    },
    {
      "metadata": {
        "trusted": true,
        "_uuid": "dd37df0ad632353401873cf685849bc198ccda5a"
      },
      "cell_type": "code",
      "source": "droppable_features = []",
      "execution_count": null,
      "outputs": []
    },
    {
      "metadata": {
        "_uuid": "aafaa9403a695048fcc017f2b264ea94ef8113d2"
      },
      "cell_type": "markdown",
      "source": "# 2. Feature Engineering\n\n## 2.1 mostly-missing Columns"
    },
    {
      "metadata": {
        "trusted": true,
        "_uuid": "07ae5fbbda6511f31d365de2505cd7babb17ac38"
      },
      "cell_type": "code",
      "source": "# missing_cols = (train.isnull().sum()/train.shape[0]).sort_values(ascending=False)\nmissing_df = pd.DataFrame([{'column': c, 'missing': train[c].isnull().sum()/train.shape[0]} for c in train.columns])",
      "execution_count": null,
      "outputs": []
    },
    {
      "metadata": {
        "trusted": true,
        "_uuid": "13a2f1edfc9588873f9fa1c3f64e58ed137fe202"
      },
      "cell_type": "code",
      "source": "missing_df = missing_df.sort_values('missing', ascending=False)\nmissing_df",
      "execution_count": null,
      "outputs": []
    },
    {
      "metadata": {
        "trusted": true,
        "_uuid": "b483db2afd369ee710c1a130e7b84d2ec446382b"
      },
      "cell_type": "code",
      "source": "# droppable_features.append('PuaMode')\n# droppable_features.append('Census_ProcessorClass')\ndroppable_features = []\ndroppable_features.extend(missing_df[missing_df.missing > 0.6].column.tolist())",
      "execution_count": null,
      "outputs": []
    },
    {
      "metadata": {
        "_uuid": "7e30bf688867ca1f28b36d5933c3befef8da3b69"
      },
      "cell_type": "markdown",
      "source": "## 2.2 Too skewed columns"
    },
    {
      "metadata": {
        "trusted": true,
        "_uuid": "3b1dde51fd356d09b5889df06e482dcd5b6d5546"
      },
      "cell_type": "code",
      "source": "pd.options.display.float_format = '{:,.4f}'.format\nsk_df = pd.DataFrame([{'column': c, 'uniq': train[c].nunique(), 'skewness': train[c].value_counts(normalize=True).values[0]} for c in train.columns])\nsk_df = sk_df.sort_values('skewness', ascending=False)\nsk_df",
      "execution_count": null,
      "outputs": []
    },
    {
      "metadata": {
        "trusted": true,
        "scrolled": false,
        "_uuid": "cf0d4755c096d163feb3057413d2672ba8ad6611"
      },
      "cell_type": "code",
      "source": "droppable_features.extend(sk_df[sk_df.skewness > 0.99].column.tolist())\ndroppable_features",
      "execution_count": null,
      "outputs": []
    },
    {
      "metadata": {
        "trusted": true,
        "_uuid": "be8ac68316e3f8f0bec385e9bda898e6fe8b9de6"
      },
      "cell_type": "code",
      "source": "# PuaMode is duplicated in the two categories.\n# droppable_features.remove('PuaMode')\n\n# Drop these columns.\ntrain.drop(droppable_features, axis=1, inplace=True)",
      "execution_count": null,
      "outputs": []
    },
    {
      "metadata": {
        "_uuid": "c031676c0fcd764802220194e80e48c262c3f695"
      },
      "cell_type": "markdown",
      "source": "Replace missing values with 0."
    },
    {
      "metadata": {
        "trusted": true,
        "_uuid": "36a8a4f39b8542a8abbe27a28f6ea7ddd13c5502"
      },
      "cell_type": "code",
      "source": "train.OrganizationIdentifier.fillna(0, inplace=True)",
      "execution_count": null,
      "outputs": []
    },
    {
      "metadata": {
        "trusted": true,
        "_uuid": "387d09cda54626371a89b8d53622059138db2eb2"
      },
      "cell_type": "code",
      "source": "train.shape",
      "execution_count": null,
      "outputs": []
    },
    {
      "metadata": {
        "trusted": true,
        "_uuid": "31b50e5e4c8ccb94f9ebf62a8c54f1edef417904"
      },
      "cell_type": "code",
      "source": "train.dropna(inplace=True)",
      "execution_count": null,
      "outputs": []
    },
    {
      "metadata": {
        "trusted": false,
        "_uuid": "4a01ea4ef465ee3e9532311db194b9613eb06cd8"
      },
      "cell_type": "markdown",
      "source": "MachineIdentifier is not useful for prediction of malware detection."
    },
    {
      "metadata": {
        "trusted": true,
        "_uuid": "68ee71d0f0fe22f061974f6b114a88ea971a5ae3"
      },
      "cell_type": "code",
      "source": "train.drop('MachineIdentifier', axis=1, inplace=True)",
      "execution_count": null,
      "outputs": []
    },
    {
      "metadata": {
        "trusted": true,
        "_uuid": "455a81e0eced560d2df7ff05690cc887450e2146"
      },
      "cell_type": "code",
      "source": "train.shape",
      "execution_count": null,
      "outputs": []
    },
    {
      "metadata": {
        "_uuid": "eccfab14ceb4346faea6c86743baf052ce2715ad"
      },
      "cell_type": "markdown",
      "source": "## 2.3 Highly correlated features.\n\nAs there are still too many features, it is bad to calculate and look at all the correlations at once. So, I grouped them by 10 columns and considered their correlations, and finally calculated all the correlation of remaining features."
    },
    {
      "metadata": {
        "trusted": true,
        "_uuid": "a97cd0788772f2679bba1154fd1ab9d32a7b294e"
      },
      "cell_type": "code",
      "source": "cols = train.columns.tolist()",
      "execution_count": null,
      "outputs": []
    },
    {
      "metadata": {
        "trusted": true,
        "_uuid": "5ede06a95a1a79cca3240350112f826ac9378266"
      },
      "cell_type": "code",
      "source": "corr_remove = []",
      "execution_count": null,
      "outputs": []
    },
    {
      "metadata": {
        "scrolled": false,
        "trusted": true,
        "_uuid": "70a9156d88a393726f972d5ec4835413561f03ff",
        "_kg_hide-input": true
      },
      "cell_type": "code",
      "source": "import seaborn as sns\nco_cols = cols[50:60]\nco_cols.append('HasDetections')\nplt.figure(figsize=(10,10))\nsns.heatmap(train[co_cols].corr(), annot=True)\nplt.title('Correlation between 51 ~ 60th columns')\nplt.show()",
      "execution_count": null,
      "outputs": []
    },
    {
      "metadata": {
        "trusted": true,
        "_uuid": "742bdc22441abbd35e4e59cdb361da63525c6548"
      },
      "cell_type": "code",
      "source": "print(train.Census_OSInstallLanguageIdentifier.nunique())\nprint(train.Census_OSUILocaleIdentifier.nunique())\n",
      "execution_count": null,
      "outputs": []
    },
    {
      "metadata": {
        "_uuid": "f1e0271c5d3275d1d5fa51ea170aaac137bda485"
      },
      "cell_type": "markdown",
      "source": "* `Census_OSInstallLanguageIdentifier` vs `Census_OSUILocaleIdentifier`: remove **`Census_OSInstallLanguageIdentifier`**"
    },
    {
      "metadata": {
        "trusted": true,
        "_uuid": "7edc4de6e743f5f45b886d112a4043e6d03fdb10"
      },
      "cell_type": "code",
      "source": "corr_remove.append('Census_OSInstallLanguageIdentifier')",
      "execution_count": null,
      "outputs": []
    },
    {
      "metadata": {
        "trusted": true,
        "_uuid": "2f8b8c5d7dbba20bd64ad325e0cf3f87cba83592"
      },
      "cell_type": "code",
      "source": "corr_remove",
      "execution_count": null,
      "outputs": []
    },
    {
      "metadata": {
        "trusted": true,
        "_uuid": "ec86d12a0cec83fdf6223778a57e42308cd0f46e"
      },
      "cell_type": "code",
      "source": "droppable_features.extend(corr_remove)\nprint(len(droppable_features))\ndroppable_features",
      "execution_count": null,
      "outputs": []
    },
    {
      "metadata": {
        "trusted": true,
        "_uuid": "74fc73ea9b94e539166f0d3598355fc7f3aa6fec"
      },
      "cell_type": "code",
      "source": "train.drop(corr_remove,axis=1,inplace=True)",
      "execution_count": null,
      "outputs": []
    },
    {
      "metadata": {
        "trusted": true,
        "_uuid": "7d04c44a8183c10c6d7ad79440b758264db1cdca"
      },
      "cell_type": "code",
      "source": "train.shape",
      "execution_count": null,
      "outputs": []
    },
    {
      "metadata": {
        "trusted": true,
        "_uuid": "3c8f6a1545b22f886b55d9252e98b9109e46a4ea"
      },
      "cell_type": "code",
      "source": "train.dtypes",
      "execution_count": null,
      "outputs": []
    },
    {
      "metadata": {
        "trusted": true,
        "_uuid": "001196d2cf0fc74ebe2f4c2b3c1121ee22b70f3f"
      },
      "cell_type": "code",
      "source": "feats = train.select_dtypes(exclude=['category'])\nprint(feats.shape)\nfeats.head()",
      "execution_count": null,
      "outputs": []
    },
    {
      "metadata": {
        "trusted": true,
        "_uuid": "d583bda6107b741a442ffb86d67f35a33f1f8be6"
      },
      "cell_type": "code",
      "source": "from sklearn.linear_model import LogisticRegression\nfrom sklearn.svm import LinearSVC\nfrom sklearn.ensemble import RandomForestClassifier",
      "execution_count": null,
      "outputs": []
    },
    {
      "metadata": {
        "trusted": true,
        "_uuid": "29b6a58c173a57466525267eab656862409c12c8"
      },
      "cell_type": "code",
      "source": "test = pd.read_csv('../input/test.csv', dtype=dtypes)\ntest.shape",
      "execution_count": null,
      "outputs": []
    },
    {
      "metadata": {
        "trusted": true,
        "_uuid": "f54319b9fbe9458d26bdebb251114fc46cd2463b"
      },
      "cell_type": "code",
      "source": "test.drop(columns=droppable_features,inplace=True)\ntest_feats = test.select_dtypes(exclude=['category'])",
      "execution_count": null,
      "outputs": []
    },
    {
      "metadata": {
        "trusted": true,
        "_uuid": "4193953f873448790b26d0e0d41594a6703d97cf"
      },
      "cell_type": "code",
      "source": "print(test.shape)\ntest.head()",
      "execution_count": null,
      "outputs": []
    },
    {
      "metadata": {
        "trusted": true,
        "_uuid": "61ca09f3bc19b1e284997f60b3f50a2a970c0158"
      },
      "cell_type": "code",
      "source": "feats.head()",
      "execution_count": null,
      "outputs": []
    },
    {
      "metadata": {
        "trusted": true,
        "_uuid": "624cdcbac026ecdb8e530ed294941366300a4229"
      },
      "cell_type": "code",
      "source": "apply_fun = lambda x: (x - x.mean())/(x.std() + 10**(-9))\nnorm_feats = feats.astype(float).iloc[:,:-1].apply(apply_fun, axis=0)\ntest_feats = test_feats.astype(float).apply(apply_fun, axis=0)\ntest_feats.fillna(0,inplace=True)\ntest_feats.head()",
      "execution_count": null,
      "outputs": []
    },
    {
      "metadata": {
        "trusted": true,
        "_uuid": "59092b1c8f8b0b984b223ff38d477b568f31e69f"
      },
      "cell_type": "code",
      "source": "print(norm_feats.head())\nnorm_feats.shape",
      "execution_count": null,
      "outputs": []
    },
    {
      "metadata": {
        "trusted": true,
        "_uuid": "3477eb9047f6c84d1abbdcc37256260e6b6f5c8e"
      },
      "cell_type": "code",
      "source": "clf = LogisticRegression()\nclf.fit(norm_feats,train.iloc[:,-1])",
      "execution_count": null,
      "outputs": []
    },
    {
      "metadata": {
        "trusted": true,
        "_uuid": "9d01ee10e8b60e5fd70361939445ca95e5e1cdaf"
      },
      "cell_type": "code",
      "source": "preds = clf.predict(test_feats)",
      "execution_count": null,
      "outputs": []
    },
    {
      "metadata": {
        "trusted": true,
        "_uuid": "0d15c21f7ea506fff64351b3b57cd42cc8013472"
      },
      "cell_type": "code",
      "source": "res = {test.columns[0]: test.iloc[:,0], train.columns[-1]: preds}\nresults = pd.DataFrame(res)",
      "execution_count": null,
      "outputs": []
    },
    {
      "metadata": {
        "trusted": true,
        "_uuid": "b33fe5e201eb9f6d33863b524285ca15e9129710"
      },
      "cell_type": "code",
      "source": "results.to_csv('predictions.csv',index=False)",
      "execution_count": null,
      "outputs": []
    },
    {
      "metadata": {
        "trusted": true,
        "_uuid": "f95bdca711a3622beb5113009f8ec6002d644e02"
      },
      "cell_type": "code",
      "source": "",
      "execution_count": null,
      "outputs": []
    }
  ],
  "metadata": {
    "kernelspec": {
      "display_name": "Python 3",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "name": "python",
      "version": "3.6.6",
      "mimetype": "text/x-python",
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "pygments_lexer": "ipython3",
      "nbconvert_exporter": "python",
      "file_extension": ".py"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 1
}
	{
	"cells": [
	{
	"metadata": {
	"_uuid": "67c1e7ee2f47765065b02f74ec692ab572801a94"
	},
	"cell_type": "markdown",
	"source": "# This solution mostly comes from the kernel of the user jiegeng\nReference: https://www.kaggle.com/jiegeng94/everyone-do-this-at-the-beginning"
	},
	{
	"metadata": {
	"_uuid": "27e9d5488926d6dc6bc070b9b0921418f02188d5"
	},
	"cell_type": "markdown",
	"source": " # You can remove 17 columns at the Beginning!!\n\nAs the data is highly dimensional in this competition, it is really difficult to do even a little thing. So, before you begin any work, read this kernel and save your time!\n\nI have tried to reduce the column dimension by eliminating less useful columns and selected 17 columns which you can remove just after loading the data sets.\n\n* Selected `mostly-missing feaures` which have more than 99% of missing values.\n* Selected `too-skewed features` whose majority categories cover more than 99% of occurences.\n* Selected `hightly-correlated features`. Tested correlations between columns, picked up pairs whose corr is greater than 0.99, compared the distribution of the features in the pairs and corr with `HasDetections`, and selected the minor column for elimination.\n\nYou can eliminate 17 columns without worry:\n1. (M) PuaMode\n1. (M) Census_ProcessorClass\n1. (S) Census_IsWIMBootEnabled\n1. (S) IsBeta\n1. (S) Census_IsFlightsDisabled\n1. (S) Census_IsFlightingInternal\n1. (S) AutoSampleOptIn\n1. (S) Census_ThresholdOptIn\n1. (S) SMode\n1. (S) Census_IsPortableOperatingSystem\n1. (S) Census_DeviceFamily\n1. (S) UacLuaenable\n1. (S) Census_IsVirtualDevice\n1. (C) Processor\n1. (C) Census_OSSkuName\n1. (C) Census_OSInstallLanguageIdentifier\n1. (C) Processor\n\nHere, (M) denotes `mostly-missing feaures`, (S) means `too-skewed features`, and (C) indicates `hightly-correlated features`.\n\n\n\nUse this code:\n\n> remove_cols = ['PuaMode', 'Census_ProcessorClass', 'Census_IsWIMBootEnabled', 'IsBeta', 'Census_IsFlightsDisabled', 'Census_IsFlightingInternal', 'AutoSampleOptIn', 'Census_ThresholdOptIn', 'SMode', 'Census_IsPortableOperatingSystem', 'Census_DeviceFamily', 'UacLuaenable', 'Census_IsVirtualDevice', 'Platform', 'Census_OSSkuName', 'Census_OSInstallLanguageIdentifier', 'Processor']\n>\n> train.drop(remove_cols, axis=1, inplace=True)\n>\n> test.drop(remove_cols, axis=1, inplace=True)\n\n\n## If you want to see how I got this:\nIn this kernel, I used only train dataset but the result was the same when I used train+test dataset."
	},
	{
	"metadata": {
	"trusted": true,
	"_uuid": "9ab5367b38f084495f2a939271de7a9c3aa18f79",
	"_kg_hide-input": true
	},
	"cell_type": "code",
	"source": "import pandas as pd\nimport numpy as np\nimport matplotlib.pyplot as plt",
	"execution_count": null,
	"outputs": []
	},
	{
	"metadata": {
	"_uuid": "a4c19c382f82484424153d6fceb02f66d66fb410"
	},
	"cell_type": "markdown",
	"source": "# 1. Load Data"
	},
	{
	"metadata": {
	"trusted": true,
	"_uuid": "3793ee996ed933d12db8ef151ce78611e99423d5",
	"_kg_hide-input": true
	},
	"cell_type": "code",
	"source": "# referred https://www.kaggle.com/theoviel/load-the-totality-of-the-data\ndtypes = {\n 'MachineIdentifier': 'category',\n 'ProductName': 'category',\n 'EngineVersion': 'category',\n 'AppVersion': 'category',\n 'AvSigVersion': 'category',\n 'IsBeta': 'int8',\n 'RtpStateBitfield': 'float16',\n 'IsSxsPassiveMode': 'int8',\n 'DefaultBrowsersIdentifier': 'float32',\n 'AVProductStatesIdentifier': 'float32',\n 'AVProductsInstalled': 'float16',\n 'AVProductsEnabled': 'float16',\n 'HasTpm': 'int8',\n 'CountryIdentifier': 'int16',\n 'CityIdentifier': 'float32',\n 'OrganizationIdentifier': 'float16',\n 'GeoNameIdentifier': 'float16',\n 'LocaleEnglishNameIdentifier': 'int16',\n 'Platform': 'category',\n 'Processor': 'category',\n 'OsVer': 'category',\n 'OsBuild': 'int16',\n 'OsSuite': 'int16',\n 'OsPlatformSubRelease': 'category',\n 'OsBuildLab': 'category',\n 'SkuEdition': 'category',\n 'IsProtected': 'float16',\n 'AutoSampleOptIn': 'int8',\n 'PuaMode': 'category',\n 'SMode': 'float16',\n 'IeVerIdentifier': 'float16',\n 'SmartScreen': 'category',\n 'Firewall': 'float16',\n 'UacLuaenable': 'float32',\n 'UacLuaenable': 'float64', # was 'float32'\n 'Census_MDC2FormFactor': 'category',\n 'Census_DeviceFamily': 'category',\n 'Census_OEMNameIdentifier': 'float32', # was 'float16'\n 'Census_OEMModelIdentifier': 'float32',\n 'Census_ProcessorCoreCount': 'float16',\n 'Census_ProcessorManufacturerIdentifier': 'float16',\n 'Census_ProcessorModelIdentifier': 'float32', # was 'float16'\n 'Census_ProcessorClass': 'category',\n 'Census_PrimaryDiskTotalCapacity': 'float64', # was 'float32'\n 'Census_PrimaryDiskTypeName': 'category',\n 'Census_SystemVolumeTotalCapacity': 'float64', # was 'float32'\n 'Census_HasOpticalDiskDrive': 'int8',\n 'Census_TotalPhysicalRAM': 'float32',\n 'Census_ChassisTypeName': 'category',\n 'Census_InternalPrimaryDiagonalDisplaySizeInInches': 'float32', # was 'float16'\n 'Census_InternalPrimaryDisplayResolutionHorizontal': 'float32', # was 'float16'\n 'Census_InternalPrimaryDisplayResolutionVertical': 'float32', # was 'float16'\n 'Census_PowerPlatformRoleName': 'category',\n 'Census_InternalBatteryType': 'category',\n 'Census_InternalBatteryNumberOfCharges': 'float64', # was 'float32'\n 'Census_OSVersion': 'category',\n 'Census_OSArchitecture': 'category',\n 'Census_OSBranch': 'category',\n 'Census_OSBuildNumber': 'int16',\n 'Census_OSBuildRevision': 'int32',\n 'Census_OSEdition': 'category',\n 'Census_OSSkuName': 'category',\n 'Census_OSInstallTypeName': 'category',\n 'Census_OSInstallLanguageIdentifier': 'float16',\n 'Census_OSUILocaleIdentifier': 'int16',\n 'Census_OSWUAutoUpdateOptionsName': 'category',\n 'Census_IsPortableOperatingSystem': 'int8',\n 'Census_GenuineStateName': 'category',\n 'Census_ActivationChannel': 'category',\n 'Census_IsFlightingInternal': 'float16',\n 'Census_IsFlightsDisabled': 'float16',\n 'Census_FlightRing': 'category',\n 'Census_ThresholdOptIn': 'float16',\n 'Census_FirmwareManufacturerIdentifier': 'float16',\n 'Census_FirmwareVersionIdentifier': 'float32',\n 'Census_IsSecureBootEnabled': 'int8',\n 'Census_IsWIMBootEnabled': 'float16',\n 'Census_IsVirtualDevice': 'float16',\n 'Census_IsTouchEnabled': 'int8',\n 'Census_IsPenCapable': 'int8',\n 'Census_IsAlwaysOnAlwaysConnectedCapable': 'float16',\n 'Wdft_IsGamer': 'float16',\n 'Wdft_RegionIdentifier': 'float16',\n 'HasDetections': 'int8'\n }\ntrain = pd.DataFrame(pd.read_csv('../input/train.csv', dtype=dtypes, chunksize=4000).get_chunk(4000))\ntrain.shape",
	"execution_count": null,
	"outputs": []
	},
	{
	"metadata": {
	"trusted": true,
	"_uuid": "9ba6d8da4bb54da6e9897b9d8ad614b62005b588"
	},
	"cell_type": "code",
	"source": "train.head()",
	"execution_count": null,
	"outputs": []
	},
	{
	"metadata": {
	"trusted": true,
	"_uuid": "dd37df0ad632353401873cf685849bc198ccda5a"
	},
	"cell_type": "code",
	"source": "droppable_features = []",
	"execution_count": null,
	"outputs": []
	},
	{
	"metadata": {
	"_uuid": "aafaa9403a695048fcc017f2b264ea94ef8113d2"
	},
	"cell_type": "markdown",
	"source": "# 2. Feature Engineering\n\n## 2.1 mostly-missing Columns"
	},
	{
	"metadata": {
	"trusted": true,
	"_uuid": "07ae5fbbda6511f31d365de2505cd7babb17ac38"
	},
	"cell_type": "code",
	"source": "# missing_cols = (train.isnull().sum()/train.shape[0]).sort_values(ascending=False)\nmissing_df = pd.DataFrame([{'column': c, 'missing': train[c].isnull().sum()/train.shape[0]} for c in train.columns])",
	"execution_count": null,
	"outputs": []
	},
	{
	"metadata": {
	"trusted": true,
	"_uuid": "13a2f1edfc9588873f9fa1c3f64e58ed137fe202"
	},
	"cell_type": "code",
	"source": "missing_df = missing_df.sort_values('missing', ascending=False)\nmissing_df",
	"execution_count": null,
	"outputs": []
	},
	{
	"metadata": {
	"trusted": true,
	"_uuid": "b483db2afd369ee710c1a130e7b84d2ec446382b"
	},
	"cell_type": "code",
	"source": "# droppable_features.append('PuaMode')\n# droppable_features.append('Census_ProcessorClass')\ndroppable_features = []\ndroppable_features.extend(missing_df[missing_df.missing > 0.6].column.tolist())",
	"execution_count": null,
	"outputs": []
	},
	{
	"metadata": {
	"_uuid": "7e30bf688867ca1f28b36d5933c3befef8da3b69"
	},
	"cell_type": "markdown",
	"source": "## 2.2 Too skewed columns"
	},
	{
	"metadata": {
	"trusted": true,
	"_uuid": "3b1dde51fd356d09b5889df06e482dcd5b6d5546"
	},
	"cell_type": "code",
	"source": "pd.options.display.float_format = '{:,.4f}'.format\nsk_df = pd.DataFrame([{'column': c, 'uniq': train[c].nunique(), 'skewness': train[c].value_counts(normalize=True).values[0]} for c in train.columns])\nsk_df = sk_df.sort_values('skewness', ascending=False)\nsk_df",
	"execution_count": null,
	"outputs": []
	},
	{
	"metadata": {
	"trusted": true,
	"scrolled": false,
	"_uuid": "cf0d4755c096d163feb3057413d2672ba8ad6611"
	},
	"cell_type": "code",
	"source": "droppable_features.extend(sk_df[sk_df.skewness > 0.99].column.tolist())\ndroppable_features",
	"execution_count": null,
	"outputs": []
	},
	{
	"metadata": {
	"trusted": true,
	"_uuid": "be8ac68316e3f8f0bec385e9bda898e6fe8b9de6"
	},
	"cell_type": "code",
	"source": "# PuaMode is duplicated in the two categories.\n# droppable_features.remove('PuaMode')\n\n# Drop these columns.\ntrain.drop(droppable_features, axis=1, inplace=True)",
	"execution_count": null,
	"outputs": []
	},
	{
	"metadata": {
	"_uuid": "c031676c0fcd764802220194e80e48c262c3f695"
	},
	"cell_type": "markdown",
	"source": "Replace missing values with 0."
	},
	{
	"metadata": {
	"trusted": true,
	"_uuid": "36a8a4f39b8542a8abbe27a28f6ea7ddd13c5502"
	},
	"cell_type": "code",
	"source": "train.OrganizationIdentifier.fillna(0, inplace=True)",
	"execution_count": null,
	"outputs": []
	},
	{
	"metadata": {
	"trusted": true,
	"_uuid": "387d09cda54626371a89b8d53622059138db2eb2"
	},
	"cell_type": "code",
	"source": "train.shape",
	"execution_count": null,
	"outputs": []
	},
	{
	"metadata": {
	"trusted": true,
	"_uuid": "31b50e5e4c8ccb94f9ebf62a8c54f1edef417904"
	},
	"cell_type": "code",
	"source": "train.dropna(inplace=True)",
	"execution_count": null,
	"outputs": []
	},
	{
	"metadata": {
	"trusted": false,
	"_uuid": "4a01ea4ef465ee3e9532311db194b9613eb06cd8"
	},
	"cell_type": "markdown",
	"source": "MachineIdentifier is not useful for prediction of malware detection."
	},
	{
	"metadata": {
	"trusted": true,
	"_uuid": "68ee71d0f0fe22f061974f6b114a88ea971a5ae3"
	},
	"cell_type": "code",
	"source": "train.drop('MachineIdentifier', axis=1, inplace=True)",
	"execution_count": null,
	"outputs": []
	},
	{
	"metadata": {
	"trusted": true,
	"_uuid": "455a81e0eced560d2df7ff05690cc887450e2146"
	},
	"cell_type": "code",
	"source": "train.shape",
	"execution_count": null,
	"outputs": []
	},
	{
	"metadata": {
	"_uuid": "eccfab14ceb4346faea6c86743baf052ce2715ad"
	},
	"cell_type": "markdown",
	"source": "## 2.3 Highly correlated features.\n\nAs there are still too many features, it is bad to calculate and look at all the correlations at once. So, I grouped them by 10 columns and considered their correlations, and finally calculated all the correlation of remaining features."
	},
	{
	"metadata": {
	"trusted": true,
	"_uuid": "a97cd0788772f2679bba1154fd1ab9d32a7b294e"
	},
	"cell_type": "code",
	"source": "cols = train.columns.tolist()",
	"execution_count": null,
	"outputs": []
	},
	{
	"metadata": {
	"trusted": true,
	"_uuid": "5ede06a95a1a79cca3240350112f826ac9378266"
	},
	"cell_type": "code",
	"source": "corr_remove = []",
	"execution_count": null,
	"outputs": []
	},
	{
	"metadata": {
	"scrolled": false,
	"trusted": true,
	"_uuid": "70a9156d88a393726f972d5ec4835413561f03ff",
	"_kg_hide-input": true
	},
	"cell_type": "code",
	"source": "import seaborn as sns\nco_cols = cols[50:60]\nco_cols.append('HasDetections')\nplt.figure(figsize=(10,10))\nsns.heatmap(train[co_cols].corr(), annot=True)\nplt.title('Correlation between 51 ~ 60th columns')\nplt.show()",
	"execution_count": null,
	"outputs": []
	},
	{
	"metadata": {
	"trusted": true,
	"_uuid": "742bdc22441abbd35e4e59cdb361da63525c6548"
	},
	"cell_type": "code",
	"source": "print(train.Census_OSInstallLanguageIdentifier.nunique())\nprint(train.Census_OSUILocaleIdentifier.nunique())\n",
	"execution_count": null,
	"outputs": []
	},
	{
	"metadata": {
	"_uuid": "f1e0271c5d3275d1d5fa51ea170aaac137bda485"
	},
	"cell_type": "markdown",
	"source": "* `Census_OSInstallLanguageIdentifier` vs `Census_OSUILocaleIdentifier`: remove `Census_OSInstallLanguageIdentifier`"
	},
	{
	"metadata": {
	"trusted": true,
	"_uuid": "7edc4de6e743f5f45b886d112a4043e6d03fdb10"
	},
	"cell_type": "code",
	"source": "corr_remove.append('Census_OSInstallLanguageIdentifier')",
	"execution_count": null,
	"outputs": []
	},
	{
	"metadata": {
	"trusted": true,
	"_uuid": "2f8b8c5d7dbba20bd64ad325e0cf3f87cba83592"
	},
	"cell_type": "code",
	"source": "corr_remove",
	"execution_count": null,
	"outputs": []
	},
	{
	"metadata": {
	"trusted": true,
	"_uuid": "ec86d12a0cec83fdf6223778a57e42308cd0f46e"
	},
	"cell_type": "code",
	"source": "droppable_features.extend(corr_remove)\nprint(len(droppable_features))\ndroppable_features",
	"execution_count": null,
	"outputs": []
	},
	{
	"metadata": {
	"trusted": true,
	"_uuid": "74fc73ea9b94e539166f0d3598355fc7f3aa6fec"
	},
	"cell_type": "code",
	"source": "train.drop(corr_remove,axis=1,inplace=True)",
	"execution_count": null,
	"outputs": []
	},
	{
	"metadata": {
	"trusted": true,
	"_uuid": "7d04c44a8183c10c6d7ad79440b758264db1cdca"
	},
	"cell_type": "code",
	"source": "train.shape",
	"execution_count": null,
	"outputs": []
	},
	{
	"metadata": {
	"trusted": true,
	"_uuid": "3c8f6a1545b22f886b55d9252e98b9109e46a4ea"
	},
	"cell_type": "code",
	"source": "train.dtypes",
	"execution_count": null,
	"outputs": []
	},
	{
	"metadata": {
	"trusted": true,
	"_uuid": "001196d2cf0fc74ebe2f4c2b3c1121ee22b70f3f"
	},
	"cell_type": "code",
	"source": "feats = train.select_dtypes(exclude=['category'])\nprint(feats.shape)\nfeats.head()",
	"execution_count": null,
	"outputs": []
	},
	{
	"metadata": {
	"trusted": true,
	"_uuid": "d583bda6107b741a442ffb86d67f35a33f1f8be6"
	},
	"cell_type": "code",
	"source": "from sklearn.linear_model import LogisticRegression\nfrom sklearn.svm import LinearSVC\nfrom sklearn.ensemble import RandomForestClassifier",
	"execution_count": null,
	"outputs": []
	},
	{
	"metadata": {
	"trusted": true,
	"_uuid": "29b6a58c173a57466525267eab656862409c12c8"
	},
	"cell_type": "code",
	"source": "test = pd.read_csv('../input/test.csv', dtype=dtypes)\ntest.shape",
	"execution_count": null,
	"outputs": []
	},
	{
	"metadata": {
	"trusted": true,
	"_uuid": "f54319b9fbe9458d26bdebb251114fc46cd2463b"
	},
	"cell_type": "code",
	"source": "test.drop(columns=droppable_features,inplace=True)\ntest_feats = test.select_dtypes(exclude=['category'])",
	"execution_count": null,
	"outputs": []
	},
	{
	"metadata": {
	"trusted": true,
	"_uuid": "4193953f873448790b26d0e0d41594a6703d97cf"
	},
	"cell_type": "code",
	"source": "print(test.shape)\ntest.head()",
	"execution_count": null,
	"outputs": []
	},
	{
	"metadata": {
	"trusted": true,
	"_uuid": "61ca09f3bc19b1e284997f60b3f50a2a970c0158"
	},
	"cell_type": "code",
	"source": "feats.head()",
	"execution_count": null,
	"outputs": []
	},
	{
	"metadata": {
	"trusted": true,
	"_uuid": "624cdcbac026ecdb8e530ed294941366300a4229"
	},
	"cell_type": "code",
	"source": "apply_fun = lambda x: (x - x.mean())/(x.std() + 10**(-9))\nnorm_feats = feats.astype(float).iloc[:,:-1].apply(apply_fun, axis=0)\ntest_feats = test_feats.astype(float).apply(apply_fun, axis=0)\ntest_feats.fillna(0,inplace=True)\ntest_feats.head()",
	"execution_count": null,
	"outputs": []
	},
	{
	"metadata": {
	"trusted": true,
	"_uuid": "59092b1c8f8b0b984b223ff38d477b568f31e69f"
	},
	"cell_type": "code",
	"source": "print(norm_feats.head())\nnorm_feats.shape",
	"execution_count": null,
	"outputs": []
	},
	{
	"metadata": {
	"trusted": true,
	"_uuid": "3477eb9047f6c84d1abbdcc37256260e6b6f5c8e"
	},
	"cell_type": "code",
	"source": "clf = LogisticRegression()\nclf.fit(norm_feats,train.iloc[:,-1])",
	"execution_count": null,
	"outputs": []
	},
	{
	"metadata": {
	"trusted": true,
	"_uuid": "9d01ee10e8b60e5fd70361939445ca95e5e1cdaf"
	},
	"cell_type": "code",
	"source": "preds = clf.predict(test_feats)",
	"execution_count": null,
	"outputs": []
	},
	{
	"metadata": {
	"trusted": true,
	"_uuid": "0d15c21f7ea506fff64351b3b57cd42cc8013472"
	},
	"cell_type": "code",
	"source": "res = {test.columns[0]: test.iloc[:,0], train.columns[-1]: preds}\nresults = pd.DataFrame(res)",
	"execution_count": null,
	"outputs": []
	},
	{
	"metadata": {
	"trusted": true,
	"_uuid": "b33fe5e201eb9f6d33863b524285ca15e9129710"
	},
	"cell_type": "code",
	"source": "results.to_csv('predictions.csv',index=False)",
	"execution_count": null,
	"outputs": []
	},
	{
	"metadata": {
	"trusted": true,
	"_uuid": "f95bdca711a3622beb5113009f8ec6002d644e02"
	},
	"cell_type": "code",
	"source": "",
	"execution_count": null,
	"outputs": []
	}
	],
	"metadata": {
	"kernelspec": {
	"display_name": "Python 3",
	"language": "python",
	"name": "python3"
	},
	"language_info": {
	"name": "python",
	"version": "3.6.6",
	"mimetype": "text/x-python",
	"codemirror_mode": {
	"name": "ipython",
	"version": 3
	},
	"pygments_lexer": "ipython3",
	"nbconvert_exporter": "python",
	"file_extension": ".py"
	}
	},
	"nbformat": 4,
	"nbformat_minor": 1
	}