Skip to content

Instantly share code, notes, and snippets.

@h5li
Created March 13, 2019 20:53
Show Gist options
  • Save h5li/a2c38f9cc25a374f4dcd440b5d4129a4 to your computer and use it in GitHub Desktop.
Save h5li/a2c38f9cc25a374f4dcd440b5d4129a4 to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"metadata": {
"_uuid": "67c1e7ee2f47765065b02f74ec692ab572801a94"
},
"cell_type": "markdown",
"source": "# This solution mostly comes from the kernel of the user jiegeng\nReference: https://www.kaggle.com/jiegeng94/everyone-do-this-at-the-beginning"
},
{
"metadata": {
"_uuid": "27e9d5488926d6dc6bc070b9b0921418f02188d5"
},
"cell_type": "markdown",
"source": " # You can remove 17 columns at the Beginning!!\n\nAs the data is highly dimensional in this competition, it is really difficult to do even a little thing. So, before you begin any work, read this kernel and save your time!\n\nI have tried to reduce the column dimension by eliminating less useful columns and selected 17 columns which you can remove just after loading the data sets.\n\n* Selected `mostly-missing feaures` which have more than 99% of missing values.\n* Selected `too-skewed features` whose majority categories cover more than 99% of occurences.\n* Selected `hightly-correlated features`. Tested correlations between columns, picked up pairs whose corr is greater than 0.99, compared the distribution of the features in the pairs and corr with `HasDetections`, and selected the minor column for elimination.\n\n**You can eliminate 17 columns without worry:**\n1. (M) PuaMode\n1. (M) Census_ProcessorClass\n1. (S) Census_IsWIMBootEnabled\n1. (S) IsBeta\n1. (S) Census_IsFlightsDisabled\n1. (S) Census_IsFlightingInternal\n1. (S) AutoSampleOptIn\n1. (S) Census_ThresholdOptIn\n1. (S) SMode\n1. (S) Census_IsPortableOperatingSystem\n1. (S) Census_DeviceFamily\n1. (S) UacLuaenable\n1. (S) Census_IsVirtualDevice\n1. (C) Processor\n1. (C) Census_OSSkuName\n1. (C) Census_OSInstallLanguageIdentifier\n1. (C) Processor\n\nHere, (M) denotes `mostly-missing feaures`, (S) means `too-skewed features`, and (C) indicates `hightly-correlated features`.\n\n\n\n**Use this code:**\n\n> remove_cols = ['PuaMode', 'Census_ProcessorClass', 'Census_IsWIMBootEnabled', 'IsBeta', 'Census_IsFlightsDisabled', 'Census_IsFlightingInternal', 'AutoSampleOptIn', 'Census_ThresholdOptIn', 'SMode', 'Census_IsPortableOperatingSystem', 'Census_DeviceFamily', 'UacLuaenable', 'Census_IsVirtualDevice', 'Platform', 'Census_OSSkuName', 'Census_OSInstallLanguageIdentifier', 'Processor']\n>\n> train.drop(remove_cols, axis=1, inplace=True)\n>\n> test.drop(remove_cols, axis=1, inplace=True)\n\n\n## If you want to see how I got this:\nIn this kernel, I used only train dataset but the result was the same when I used train+test dataset."
},
{
"metadata": {
"trusted": true,
"_uuid": "9ab5367b38f084495f2a939271de7a9c3aa18f79",
"_kg_hide-input": true
},
"cell_type": "code",
"source": "import pandas as pd\nimport numpy as np\nimport matplotlib.pyplot as plt",
"execution_count": null,
"outputs": []
},
{
"metadata": {
"_uuid": "a4c19c382f82484424153d6fceb02f66d66fb410"
},
"cell_type": "markdown",
"source": "# 1. Load Data"
},
{
"metadata": {
"trusted": true,
"_uuid": "3793ee996ed933d12db8ef151ce78611e99423d5",
"_kg_hide-input": true
},
"cell_type": "code",
"source": "# referred https://www.kaggle.com/theoviel/load-the-totality-of-the-data\ndtypes = {\n 'MachineIdentifier': 'category',\n 'ProductName': 'category',\n 'EngineVersion': 'category',\n 'AppVersion': 'category',\n 'AvSigVersion': 'category',\n 'IsBeta': 'int8',\n 'RtpStateBitfield': 'float16',\n 'IsSxsPassiveMode': 'int8',\n 'DefaultBrowsersIdentifier': 'float32',\n 'AVProductStatesIdentifier': 'float32',\n 'AVProductsInstalled': 'float16',\n 'AVProductsEnabled': 'float16',\n 'HasTpm': 'int8',\n 'CountryIdentifier': 'int16',\n 'CityIdentifier': 'float32',\n 'OrganizationIdentifier': 'float16',\n 'GeoNameIdentifier': 'float16',\n 'LocaleEnglishNameIdentifier': 'int16',\n 'Platform': 'category',\n 'Processor': 'category',\n 'OsVer': 'category',\n 'OsBuild': 'int16',\n 'OsSuite': 'int16',\n 'OsPlatformSubRelease': 'category',\n 'OsBuildLab': 'category',\n 'SkuEdition': 'category',\n 'IsProtected': 'float16',\n 'AutoSampleOptIn': 'int8',\n 'PuaMode': 'category',\n 'SMode': 'float16',\n 'IeVerIdentifier': 'float16',\n 'SmartScreen': 'category',\n 'Firewall': 'float16',\n 'UacLuaenable': 'float32',\n 'UacLuaenable': 'float64', # was 'float32'\n 'Census_MDC2FormFactor': 'category',\n 'Census_DeviceFamily': 'category',\n 'Census_OEMNameIdentifier': 'float32', # was 'float16'\n 'Census_OEMModelIdentifier': 'float32',\n 'Census_ProcessorCoreCount': 'float16',\n 'Census_ProcessorManufacturerIdentifier': 'float16',\n 'Census_ProcessorModelIdentifier': 'float32', # was 'float16'\n 'Census_ProcessorClass': 'category',\n 'Census_PrimaryDiskTotalCapacity': 'float64', # was 'float32'\n 'Census_PrimaryDiskTypeName': 'category',\n 'Census_SystemVolumeTotalCapacity': 'float64', # was 'float32'\n 'Census_HasOpticalDiskDrive': 'int8',\n 'Census_TotalPhysicalRAM': 'float32',\n 'Census_ChassisTypeName': 'category',\n 'Census_InternalPrimaryDiagonalDisplaySizeInInches': 'float32', # was 'float16'\n 'Census_InternalPrimaryDisplayResolutionHorizontal': 'float32', # was 'float16'\n 'Census_InternalPrimaryDisplayResolutionVertical': 'float32', # was 'float16'\n 'Census_PowerPlatformRoleName': 'category',\n 'Census_InternalBatteryType': 'category',\n 'Census_InternalBatteryNumberOfCharges': 'float64', # was 'float32'\n 'Census_OSVersion': 'category',\n 'Census_OSArchitecture': 'category',\n 'Census_OSBranch': 'category',\n 'Census_OSBuildNumber': 'int16',\n 'Census_OSBuildRevision': 'int32',\n 'Census_OSEdition': 'category',\n 'Census_OSSkuName': 'category',\n 'Census_OSInstallTypeName': 'category',\n 'Census_OSInstallLanguageIdentifier': 'float16',\n 'Census_OSUILocaleIdentifier': 'int16',\n 'Census_OSWUAutoUpdateOptionsName': 'category',\n 'Census_IsPortableOperatingSystem': 'int8',\n 'Census_GenuineStateName': 'category',\n 'Census_ActivationChannel': 'category',\n 'Census_IsFlightingInternal': 'float16',\n 'Census_IsFlightsDisabled': 'float16',\n 'Census_FlightRing': 'category',\n 'Census_ThresholdOptIn': 'float16',\n 'Census_FirmwareManufacturerIdentifier': 'float16',\n 'Census_FirmwareVersionIdentifier': 'float32',\n 'Census_IsSecureBootEnabled': 'int8',\n 'Census_IsWIMBootEnabled': 'float16',\n 'Census_IsVirtualDevice': 'float16',\n 'Census_IsTouchEnabled': 'int8',\n 'Census_IsPenCapable': 'int8',\n 'Census_IsAlwaysOnAlwaysConnectedCapable': 'float16',\n 'Wdft_IsGamer': 'float16',\n 'Wdft_RegionIdentifier': 'float16',\n 'HasDetections': 'int8'\n }\ntrain = pd.DataFrame(pd.read_csv('../input/train.csv', dtype=dtypes, chunksize=4000).get_chunk(4000))\ntrain.shape",
"execution_count": null,
"outputs": []
},
{
"metadata": {
"trusted": true,
"_uuid": "9ba6d8da4bb54da6e9897b9d8ad614b62005b588"
},
"cell_type": "code",
"source": "train.head()",
"execution_count": null,
"outputs": []
},
{
"metadata": {
"trusted": true,
"_uuid": "dd37df0ad632353401873cf685849bc198ccda5a"
},
"cell_type": "code",
"source": "droppable_features = []",
"execution_count": null,
"outputs": []
},
{
"metadata": {
"_uuid": "aafaa9403a695048fcc017f2b264ea94ef8113d2"
},
"cell_type": "markdown",
"source": "# 2. Feature Engineering\n\n## 2.1 mostly-missing Columns"
},
{
"metadata": {
"trusted": true,
"_uuid": "07ae5fbbda6511f31d365de2505cd7babb17ac38"
},
"cell_type": "code",
"source": "# missing_cols = (train.isnull().sum()/train.shape[0]).sort_values(ascending=False)\nmissing_df = pd.DataFrame([{'column': c, 'missing': train[c].isnull().sum()/train.shape[0]} for c in train.columns])",
"execution_count": null,
"outputs": []
},
{
"metadata": {
"trusted": true,
"_uuid": "13a2f1edfc9588873f9fa1c3f64e58ed137fe202"
},
"cell_type": "code",
"source": "missing_df = missing_df.sort_values('missing', ascending=False)\nmissing_df",
"execution_count": null,
"outputs": []
},
{
"metadata": {
"trusted": true,
"_uuid": "b483db2afd369ee710c1a130e7b84d2ec446382b"
},
"cell_type": "code",
"source": "# droppable_features.append('PuaMode')\n# droppable_features.append('Census_ProcessorClass')\ndroppable_features = []\ndroppable_features.extend(missing_df[missing_df.missing > 0.6].column.tolist())",
"execution_count": null,
"outputs": []
},
{
"metadata": {
"_uuid": "7e30bf688867ca1f28b36d5933c3befef8da3b69"
},
"cell_type": "markdown",
"source": "## 2.2 Too skewed columns"
},
{
"metadata": {
"trusted": true,
"_uuid": "3b1dde51fd356d09b5889df06e482dcd5b6d5546"
},
"cell_type": "code",
"source": "pd.options.display.float_format = '{:,.4f}'.format\nsk_df = pd.DataFrame([{'column': c, 'uniq': train[c].nunique(), 'skewness': train[c].value_counts(normalize=True).values[0]} for c in train.columns])\nsk_df = sk_df.sort_values('skewness', ascending=False)\nsk_df",
"execution_count": null,
"outputs": []
},
{
"metadata": {
"trusted": true,
"scrolled": false,
"_uuid": "cf0d4755c096d163feb3057413d2672ba8ad6611"
},
"cell_type": "code",
"source": "droppable_features.extend(sk_df[sk_df.skewness > 0.99].column.tolist())\ndroppable_features",
"execution_count": null,
"outputs": []
},
{
"metadata": {
"trusted": true,
"_uuid": "be8ac68316e3f8f0bec385e9bda898e6fe8b9de6"
},
"cell_type": "code",
"source": "# PuaMode is duplicated in the two categories.\n# droppable_features.remove('PuaMode')\n\n# Drop these columns.\ntrain.drop(droppable_features, axis=1, inplace=True)",
"execution_count": null,
"outputs": []
},
{
"metadata": {
"_uuid": "c031676c0fcd764802220194e80e48c262c3f695"
},
"cell_type": "markdown",
"source": "Replace missing values with 0."
},
{
"metadata": {
"trusted": true,
"_uuid": "36a8a4f39b8542a8abbe27a28f6ea7ddd13c5502"
},
"cell_type": "code",
"source": "train.OrganizationIdentifier.fillna(0, inplace=True)",
"execution_count": null,
"outputs": []
},
{
"metadata": {
"trusted": true,
"_uuid": "387d09cda54626371a89b8d53622059138db2eb2"
},
"cell_type": "code",
"source": "train.shape",
"execution_count": null,
"outputs": []
},
{
"metadata": {
"trusted": true,
"_uuid": "31b50e5e4c8ccb94f9ebf62a8c54f1edef417904"
},
"cell_type": "code",
"source": "train.dropna(inplace=True)",
"execution_count": null,
"outputs": []
},
{
"metadata": {
"trusted": false,
"_uuid": "4a01ea4ef465ee3e9532311db194b9613eb06cd8"
},
"cell_type": "markdown",
"source": "MachineIdentifier is not useful for prediction of malware detection."
},
{
"metadata": {
"trusted": true,
"_uuid": "68ee71d0f0fe22f061974f6b114a88ea971a5ae3"
},
"cell_type": "code",
"source": "train.drop('MachineIdentifier', axis=1, inplace=True)",
"execution_count": null,
"outputs": []
},
{
"metadata": {
"trusted": true,
"_uuid": "455a81e0eced560d2df7ff05690cc887450e2146"
},
"cell_type": "code",
"source": "train.shape",
"execution_count": null,
"outputs": []
},
{
"metadata": {
"_uuid": "eccfab14ceb4346faea6c86743baf052ce2715ad"
},
"cell_type": "markdown",
"source": "## 2.3 Highly correlated features.\n\nAs there are still too many features, it is bad to calculate and look at all the correlations at once. So, I grouped them by 10 columns and considered their correlations, and finally calculated all the correlation of remaining features."
},
{
"metadata": {
"trusted": true,
"_uuid": "a97cd0788772f2679bba1154fd1ab9d32a7b294e"
},
"cell_type": "code",
"source": "cols = train.columns.tolist()",
"execution_count": null,
"outputs": []
},
{
"metadata": {
"trusted": true,
"_uuid": "5ede06a95a1a79cca3240350112f826ac9378266"
},
"cell_type": "code",
"source": "corr_remove = []",
"execution_count": null,
"outputs": []
},
{
"metadata": {
"scrolled": false,
"trusted": true,
"_uuid": "70a9156d88a393726f972d5ec4835413561f03ff",
"_kg_hide-input": true
},
"cell_type": "code",
"source": "import seaborn as sns\nco_cols = cols[50:60]\nco_cols.append('HasDetections')\nplt.figure(figsize=(10,10))\nsns.heatmap(train[co_cols].corr(), annot=True)\nplt.title('Correlation between 51 ~ 60th columns')\nplt.show()",
"execution_count": null,
"outputs": []
},
{
"metadata": {
"trusted": true,
"_uuid": "742bdc22441abbd35e4e59cdb361da63525c6548"
},
"cell_type": "code",
"source": "print(train.Census_OSInstallLanguageIdentifier.nunique())\nprint(train.Census_OSUILocaleIdentifier.nunique())\n",
"execution_count": null,
"outputs": []
},
{
"metadata": {
"_uuid": "f1e0271c5d3275d1d5fa51ea170aaac137bda485"
},
"cell_type": "markdown",
"source": "* `Census_OSInstallLanguageIdentifier` vs `Census_OSUILocaleIdentifier`: remove **`Census_OSInstallLanguageIdentifier`**"
},
{
"metadata": {
"trusted": true,
"_uuid": "7edc4de6e743f5f45b886d112a4043e6d03fdb10"
},
"cell_type": "code",
"source": "corr_remove.append('Census_OSInstallLanguageIdentifier')",
"execution_count": null,
"outputs": []
},
{
"metadata": {
"trusted": true,
"_uuid": "2f8b8c5d7dbba20bd64ad325e0cf3f87cba83592"
},
"cell_type": "code",
"source": "corr_remove",
"execution_count": null,
"outputs": []
},
{
"metadata": {
"trusted": true,
"_uuid": "ec86d12a0cec83fdf6223778a57e42308cd0f46e"
},
"cell_type": "code",
"source": "droppable_features.extend(corr_remove)\nprint(len(droppable_features))\ndroppable_features",
"execution_count": null,
"outputs": []
},
{
"metadata": {
"trusted": true,
"_uuid": "74fc73ea9b94e539166f0d3598355fc7f3aa6fec"
},
"cell_type": "code",
"source": "train.drop(corr_remove,axis=1,inplace=True)",
"execution_count": null,
"outputs": []
},
{
"metadata": {
"trusted": true,
"_uuid": "7d04c44a8183c10c6d7ad79440b758264db1cdca"
},
"cell_type": "code",
"source": "train.shape",
"execution_count": null,
"outputs": []
},
{
"metadata": {
"trusted": true,
"_uuid": "3c8f6a1545b22f886b55d9252e98b9109e46a4ea"
},
"cell_type": "code",
"source": "train.dtypes",
"execution_count": null,
"outputs": []
},
{
"metadata": {
"trusted": true,
"_uuid": "001196d2cf0fc74ebe2f4c2b3c1121ee22b70f3f"
},
"cell_type": "code",
"source": "feats = train.select_dtypes(exclude=['category'])\nprint(feats.shape)\nfeats.head()",
"execution_count": null,
"outputs": []
},
{
"metadata": {
"trusted": true,
"_uuid": "d583bda6107b741a442ffb86d67f35a33f1f8be6"
},
"cell_type": "code",
"source": "from sklearn.linear_model import LogisticRegression\nfrom sklearn.svm import LinearSVC\nfrom sklearn.ensemble import RandomForestClassifier",
"execution_count": null,
"outputs": []
},
{
"metadata": {
"trusted": true,
"_uuid": "29b6a58c173a57466525267eab656862409c12c8"
},
"cell_type": "code",
"source": "test = pd.read_csv('../input/test.csv', dtype=dtypes)\ntest.shape",
"execution_count": null,
"outputs": []
},
{
"metadata": {
"trusted": true,
"_uuid": "f54319b9fbe9458d26bdebb251114fc46cd2463b"
},
"cell_type": "code",
"source": "test.drop(columns=droppable_features,inplace=True)\ntest_feats = test.select_dtypes(exclude=['category'])",
"execution_count": null,
"outputs": []
},
{
"metadata": {
"trusted": true,
"_uuid": "4193953f873448790b26d0e0d41594a6703d97cf"
},
"cell_type": "code",
"source": "print(test.shape)\ntest.head()",
"execution_count": null,
"outputs": []
},
{
"metadata": {
"trusted": true,
"_uuid": "61ca09f3bc19b1e284997f60b3f50a2a970c0158"
},
"cell_type": "code",
"source": "feats.head()",
"execution_count": null,
"outputs": []
},
{
"metadata": {
"trusted": true,
"_uuid": "624cdcbac026ecdb8e530ed294941366300a4229"
},
"cell_type": "code",
"source": "apply_fun = lambda x: (x - x.mean())/(x.std() + 10**(-9))\nnorm_feats = feats.astype(float).iloc[:,:-1].apply(apply_fun, axis=0)\ntest_feats = test_feats.astype(float).apply(apply_fun, axis=0)\ntest_feats.fillna(0,inplace=True)\ntest_feats.head()",
"execution_count": null,
"outputs": []
},
{
"metadata": {
"trusted": true,
"_uuid": "59092b1c8f8b0b984b223ff38d477b568f31e69f"
},
"cell_type": "code",
"source": "print(norm_feats.head())\nnorm_feats.shape",
"execution_count": null,
"outputs": []
},
{
"metadata": {
"trusted": true,
"_uuid": "3477eb9047f6c84d1abbdcc37256260e6b6f5c8e"
},
"cell_type": "code",
"source": "clf = LogisticRegression()\nclf.fit(norm_feats,train.iloc[:,-1])",
"execution_count": null,
"outputs": []
},
{
"metadata": {
"trusted": true,
"_uuid": "9d01ee10e8b60e5fd70361939445ca95e5e1cdaf"
},
"cell_type": "code",
"source": "preds = clf.predict(test_feats)",
"execution_count": null,
"outputs": []
},
{
"metadata": {
"trusted": true,
"_uuid": "0d15c21f7ea506fff64351b3b57cd42cc8013472"
},
"cell_type": "code",
"source": "res = {test.columns[0]: test.iloc[:,0], train.columns[-1]: preds}\nresults = pd.DataFrame(res)",
"execution_count": null,
"outputs": []
},
{
"metadata": {
"trusted": true,
"_uuid": "b33fe5e201eb9f6d33863b524285ca15e9129710"
},
"cell_type": "code",
"source": "results.to_csv('predictions.csv',index=False)",
"execution_count": null,
"outputs": []
},
{
"metadata": {
"trusted": true,
"_uuid": "f95bdca711a3622beb5113009f8ec6002d644e02"
},
"cell_type": "code",
"source": "",
"execution_count": null,
"outputs": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"name": "python",
"version": "3.6.6",
"mimetype": "text/x-python",
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"pygments_lexer": "ipython3",
"nbconvert_exporter": "python",
"file_extension": ".py"
}
},
"nbformat": 4,
"nbformat_minor": 1
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment