Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save Hsankesara/a5ba2e47dfa1452ab30ad2f51c5ac441 to your computer and use it in GitHub Desktop.
Save Hsankesara/a5ba2e47dfa1452ab30ad2f51c5ac441 to your computer and use it in GitHub Desktop.
Sustainable Industry: Rinse Over Run competition. Scored 58 rank out of 1200+ participants. link: https://www.drivendata.org/competitions/56/predict-cleaning-time-series/
Sustainable Industry: Rinse Over Run competition. Scored 58 rank out of 1200+ participants. link: https://www.drivendata.org/competitions/56/predict-cleaning-time-series/
Display the source blob
Display the rendered blob
Raw
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Display the source blob
Display the rendered blob
Raw
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"metadata": {
"trusted": true,
"_uuid": "b778c78e38c014d8ba67b03d3884199f8f82e507"
},
"cell_type": "code",
"source": "!ls",
"execution_count": 3,
"outputs": [
{
"output_type": "stream",
"text": "__notebook_source__.ipynb\r\n",
"name": "stdout"
}
]
},
{
"metadata": {
"id": "ldX6xHXQUXip",
"colab_type": "code",
"outputId": "c2325f9f-3601-48da-83c7-5a3d5e5af44e",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 34
},
"trusted": true,
"_uuid": "672962fb990c848adcf502d762c660006ab71711"
},
"cell_type": "code",
"source": "import pandas as pd\nimport numpy as np\nfrom matplotlib import pyplot as plt\nimport seaborn as sns\nfrom tqdm import tqdm\nimport gc\ntqdm.pandas()\ngc.collect()",
"execution_count": 4,
"outputs": [
{
"output_type": "execute_result",
"execution_count": 4,
"data": {
"text/plain": "11"
},
"metadata": {}
}
]
},
{
"metadata": {
"id": "N_uO6NYwUu6z",
"colab_type": "text",
"_uuid": "c43fdafc5668da60601b3619b84b4adefc433194"
},
"cell_type": "markdown",
"source": "## Getting Data"
},
{
"metadata": {
"id": "uAudXj1GUx77",
"colab_type": "code",
"outputId": "3763ec6c-c566-4de4-802d-09c2e436a185",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 765
},
"trusted": true,
"_uuid": "0e0b1c43bbeadee6f0f3b74db8659b609434146d"
},
"cell_type": "code",
"source": "!wget https://s3.amazonaws.com/drivendata/data/56/public/train_values.zip\n!wget https://s3.amazonaws.com/drivendata/data/56/public/test_values.zip\n!wget https://s3.amazonaws.com/drivendata/data/56/public/submission_format.csv\n!wget https://s3.amazonaws.com/drivendata/data/56/public/train_labels.csv",
"execution_count": 5,
"outputs": [
{
"output_type": "stream",
"text": "--2019-01-29 04:10:21-- https://s3.amazonaws.com/drivendata/data/56/public/train_values.zip\nResolving s3.amazonaws.com (s3.amazonaws.com)... 52.216.177.109\nConnecting to s3.amazonaws.com (s3.amazonaws.com)|52.216.177.109|:443... connected.\nHTTP request sent, awaiting response... 200 OK\nLength: 334639180 (319M) [application/zip]\nSaving to: ‘train_values.zip’\n\ntrain_values.zip 100%[===================>] 319.14M 80.5MB/s in 3.9s \n\n2019-01-29 04:10:25 (82.0 MB/s) - ‘train_values.zip’ saved [334639180/334639180]\n\n--2019-01-29 04:10:25-- https://s3.amazonaws.com/drivendata/data/56/public/test_values.zip\nResolving s3.amazonaws.com (s3.amazonaws.com)... 52.216.81.43\nConnecting to s3.amazonaws.com (s3.amazonaws.com)|52.216.81.43|:443... connected.\nHTTP request sent, awaiting response... 200 OK\nLength: 106633113 (102M) [application/zip]\nSaving to: ‘test_values.zip’\n\ntest_values.zip 100%[===================>] 101.69M 86.5MB/s in 1.2s \n\n2019-01-29 04:10:27 (86.5 MB/s) - ‘test_values.zip’ saved [106633113/106633113]\n\n--2019-01-29 04:10:27-- https://s3.amazonaws.com/drivendata/data/56/public/submission_format.csv\nResolving s3.amazonaws.com (s3.amazonaws.com)... 52.216.101.93\nConnecting to s3.amazonaws.com (s3.amazonaws.com)|52.216.101.93|:443... connected.\nHTTP request sent, awaiting response... 200 OK\nLength: 29715 (29K) [text/csv]\nSaving to: ‘submission_format.csv’\n\nsubmission_format.c 100%[===================>] 29.02K --.-KB/s in 0.02s \n\n2019-01-29 04:10:27 (1.57 MB/s) - ‘submission_format.csv’ saved [29715/29715]\n\n--2019-01-29 04:10:28-- https://s3.amazonaws.com/drivendata/data/56/public/train_labels.csv\nResolving s3.amazonaws.com (s3.amazonaws.com)... 52.216.170.61\nConnecting to s3.amazonaws.com (s3.amazonaws.com)|52.216.170.61|:443... connected.\nHTTP request sent, awaiting response... 200 OK\nLength: 122192 (119K) [text/csv]\nSaving to: ‘train_labels.csv’\n\ntrain_labels.csv 100%[===================>] 119.33K --.-KB/s in 0.04s \n\n2019-01-29 04:10:28 (2.82 MB/s) - ‘train_labels.csv’ saved [122192/122192]\n\n",
"name": "stdout"
}
]
},
{
"metadata": {
"id": "pGmeEpy0U3mf",
"colab_type": "code",
"colab": {},
"trusted": true,
"_uuid": "9df58bfc359434662a820cc538946fa51f98e19a"
},
"cell_type": "code",
"source": "!mkdir data",
"execution_count": 6,
"outputs": []
},
{
"metadata": {
"id": "oXgGSzb5U9w-",
"colab_type": "code",
"outputId": "120088c0-5a87-4f88-f6b8-b9f2944ff359",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 85
},
"trusted": true,
"_uuid": "73a5379f050e9710c7cc3c0a296655c22ca6a97d"
},
"cell_type": "code",
"source": "!unzip train_values.zip -d data/\n!unzip test_values.zip -d data/\n!mv train_labels.csv data/\n!mv submission_format.csv data/",
"execution_count": 7,
"outputs": [
{
"output_type": "stream",
"text": "Archive: train_values.zip\n inflating: data/train_values.csv \nArchive: test_values.zip\n inflating: data/test_values.csv \n",
"name": "stdout"
}
]
},
{
"metadata": {
"id": "Envm7L7YVOye",
"colab_type": "code",
"colab": {},
"trusted": true,
"_uuid": "7b3e4eef5b155888faf42dedcc55f16dad5145c7"
},
"cell_type": "code",
"source": "!rm train_values.zip\n!rm test_values.zip",
"execution_count": 8,
"outputs": []
},
{
"metadata": {
"id": "cqKvAzDMV8Xl",
"colab_type": "text",
"_uuid": "35368c2062a156d2133dfd6dd8d54a49422ff8b0"
},
"cell_type": "markdown",
"source": "## EDA"
},
{
"metadata": {
"id": "9vyBgnibV7ZN",
"colab_type": "code",
"colab": {},
"trusted": true,
"_uuid": "226bdebe5096408d18562c8e3b63abcc984d8578"
},
"cell_type": "code",
"source": "df_train = pd.read_csv('data/train_values.csv')\ndf_test = pd.read_csv('data/test_values.csv')\ndf_labels = pd.read_csv('data/train_labels.csv')\ndf_sub = pd.read_csv('data/submission_format.csv')",
"execution_count": 9,
"outputs": []
},
{
"metadata": {
"id": "LHgpEdNyWN9b",
"colab_type": "code",
"outputId": "bf5d0d64-1aea-4234-d9cf-d8e032542594",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 423
},
"trusted": true,
"_uuid": "ea528886d5fec2b316676466b74de6a3e01ee8b6"
},
"cell_type": "code",
"source": "df_train.head()",
"execution_count": null,
"outputs": []
},
{
"metadata": {
"id": "kTo6BzErWUEy",
"colab_type": "code",
"outputId": "a309b441-2090-4515-84e3-10978250fc91",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 714
},
"trusted": true,
"_uuid": "bbf585c1a6d13b968bb88c5bda1dbdb1f3e4a7a1"
},
"cell_type": "code",
"source": "df_train.info()",
"execution_count": null,
"outputs": []
},
{
"metadata": {
"id": "DDwkKT5wWXIk",
"colab_type": "code",
"outputId": "57047c70-a3da-42e4-e5b5-0ab67c5117cb",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 317
},
"trusted": true,
"_uuid": "deaf7dc2bc026a33772200363adc1065fce0bef1"
},
"cell_type": "code",
"source": "df_train.describe()",
"execution_count": null,
"outputs": []
},
{
"metadata": {
"id": "7kxVbnBUWbPI",
"colab_type": "code",
"outputId": "0ac0716f-7b0e-4aed-9f7a-a3bcc2cd091d",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 34
},
"trusted": true,
"_uuid": "c71cabf8280026e25d2bb1255a386b6467ac07ad"
},
"cell_type": "code",
"source": "df_train.process_id.unique().shape",
"execution_count": null,
"outputs": []
},
{
"metadata": {
"id": "CWN-osf_Wi6m",
"colab_type": "code",
"outputId": "a750888a-4b89-4611-e1d0-36d8b02e2c51",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 34
},
"trusted": false,
"_uuid": "734b0308a15ce2008c2706cc8b206b62e49aad95"
},
"cell_type": "code",
"source": "df_test.process_id.unique().shape",
"execution_count": null,
"outputs": []
},
{
"metadata": {
"id": "Pilo2uFEXs4A",
"colab_type": "code",
"outputId": "7cbb0e2a-1067-41f1-a2d4-207b2ba16b89",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 34
},
"trusted": false,
"_uuid": "95138830b448308ca88c0ac4723dda2246a5c609"
},
"cell_type": "code",
"source": "np.intersect1d(df_train.process_id.unique(), df_test.process_id.unique())",
"execution_count": null,
"outputs": []
},
{
"metadata": {
"id": "OBj-as5PX9Kv",
"colab_type": "code",
"colab": {},
"trusted": false,
"_uuid": "b7bd24ea34447fff0f9811d63ef922ca461be3d1"
},
"cell_type": "code",
"source": "process_sample = df_train[df_train.process_id == 20001]",
"execution_count": null,
"outputs": []
},
{
"metadata": {
"id": "cRuvGFakfAvc",
"colab_type": "code",
"outputId": "9d4ef450-0dc4-42a0-c5ee-47d1b8da6e90",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 423
},
"trusted": false,
"_uuid": "62b4097bd05581ea92f8af7bda8066850da2beb4"
},
"cell_type": "code",
"source": "process_sample.head()",
"execution_count": null,
"outputs": []
},
{
"metadata": {
"id": "_t-tCHYkfp3s",
"colab_type": "code",
"outputId": "2a74ea2e-a863-472e-b910-38c41bdc2edf",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 102
},
"trusted": false,
"_uuid": "4b44d94275418fbf0ef8092a39e1e3deb1264a2a"
},
"cell_type": "code",
"source": "process_sample.drop(['row_id', 'process_id', 'object_id'],axis=1, inplace=True)",
"execution_count": null,
"outputs": []
},
{
"metadata": {
"id": "KaMbopYUf5F1",
"colab_type": "code",
"outputId": "ab04a521-786c-4313-b9f5-dd29bc42ab7b",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 423
},
"trusted": false,
"_uuid": "f9f41de98f417b7454c1196e419b377a2c8354f7"
},
"cell_type": "code",
"source": "process_sample.head()",
"execution_count": null,
"outputs": []
},
{
"metadata": {
"id": "SxxlnIw4gC9b",
"colab_type": "code",
"outputId": "25b5dfc3-5da0-4932-e6d2-a8130eacc8e9",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 119
},
"trusted": false,
"_uuid": "9a5e906c40c76eaab94b6bd62c7370fc29f99768"
},
"cell_type": "code",
"source": "process_sample.timestamp = (pd.to_datetime(process_sample.timestamp, format=\"%Y-%m-%d %H:%M:%S\") - pd.datetime.now()).dt.total_seconds()\nprocess_sample.timestamp = process_sample.timestamp-process_sample.timestamp.min()",
"execution_count": null,
"outputs": []
},
{
"metadata": {
"id": "ihztnfrogMVT",
"colab_type": "code",
"outputId": "a29df3fe-2a88-426a-fcf8-69dc6f29544f",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 253
},
"trusted": false,
"_uuid": "c08ca5aeb483143bcfd01c67b4ff82b54ed7cd77"
},
"cell_type": "code",
"source": "process_sample.head()",
"execution_count": null,
"outputs": []
},
{
"metadata": {
"id": "EEBhCmbQgyv3",
"colab_type": "code",
"outputId": "db000e13-4acb-4171-d9a2-49b8b666b2f9",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 34
},
"trusted": false,
"_uuid": "5d079ee96ab44c8e51b0894453aa3a6e6903f24c"
},
"cell_type": "code",
"source": "process_sample.supply_flow.isna().any()",
"execution_count": null,
"outputs": []
},
{
"metadata": {
"id": "Xxh70o3Fi5Xr",
"colab_type": "code",
"outputId": "202c3c8a-60da-4b6f-b355-26273f8bec95",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 429
},
"trusted": false,
"_uuid": "d4bb0c9967ccf48e24bdea8cea6daca7bd0e42a2"
},
"cell_type": "code",
"source": "sns.distplot(process_sample.supply_flow)",
"execution_count": null,
"outputs": []
},
{
"metadata": {
"id": "2PwjAV8ujc3n",
"colab_type": "code",
"outputId": "9f779bd4-d19f-4535-b480-fcb9b203b271",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 429
},
"trusted": false,
"_uuid": "cfd261afbfbda7b8b8a276fc8b087c6fb31e79c1"
},
"cell_type": "code",
"source": "sns.distplot(process_sample.supply_pressure)",
"execution_count": null,
"outputs": []
},
{
"metadata": {
"id": "_FzDQwsMjyik",
"colab_type": "code",
"outputId": "9f201562-ddad-4f19-8b4c-781a12cc4002",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 429
},
"trusted": false,
"_uuid": "9fda2b8c52cd3c6fe2c983bcc09727abc97488d9"
},
"cell_type": "code",
"source": "sns.distplot(process_sample.return_temperature)",
"execution_count": null,
"outputs": []
},
{
"metadata": {
"id": "PKL3i_Hjj3US",
"colab_type": "code",
"outputId": "913480ef-4e82-4a01-c853-95e67e831b45",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 429
},
"trusted": false,
"_uuid": "bacb972235cbfa812bcfc3cae9d0eeef2dba9f73"
},
"cell_type": "code",
"source": "sns.distplot(np.square(process_sample.return_temperature))",
"execution_count": null,
"outputs": []
},
{
"metadata": {
"id": "eyMC2XQckga2",
"colab_type": "code",
"outputId": "c6ca2043-49d6-4b3c-afc0-c7154ee531d9",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 429
},
"trusted": false,
"_uuid": "2f25412ea1b156230bdf7fe76715ef3d9974bed9"
},
"cell_type": "code",
"source": "sns.distplot(process_sample.return_conductivity)",
"execution_count": null,
"outputs": []
},
{
"metadata": {
"id": "Zs9TeFFQlHbx",
"colab_type": "code",
"outputId": "1ff464d9-ae35-4f5a-d2b0-1b36758331f3",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 429
},
"trusted": false,
"_uuid": "7bec4d2d3414c270251350c43fb0eff769b25cd7"
},
"cell_type": "code",
"source": "sns.distplot(process_sample.return_turbidity)",
"execution_count": null,
"outputs": []
},
{
"metadata": {
"id": "gdMuNaCUlOYY",
"colab_type": "code",
"outputId": "d82b5a64-2d17-43ef-ea3b-a8d914bbeead",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 431
},
"trusted": false,
"_uuid": "9e67097bd59999f30754b62159773f67a82815ab"
},
"cell_type": "code",
"source": "sns.distplot(np.log(process_sample.return_turbidity))",
"execution_count": null,
"outputs": []
},
{
"metadata": {
"id": "k6CMzPZnlSIK",
"colab_type": "code",
"outputId": "9f7e5013-8f35-475f-e6f9-60a895354a26",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 429
},
"trusted": false,
"_uuid": "86fe3f9f56e07932966a804c77f8e2bad6ee7257"
},
"cell_type": "code",
"source": "sns.distplot(process_sample.return_flow)",
"execution_count": null,
"outputs": []
},
{
"metadata": {
"id": "GJiAn_n7lb3L",
"colab_type": "code",
"outputId": "ec8d530e-f24f-4ba7-8ed1-0e0953c7c3ab",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 440
},
"trusted": false,
"_uuid": "c4097031ce8c4801d146f164fa9feeed33509881"
},
"cell_type": "code",
"source": "sns.distplot(np.square(process_sample.return_flow))",
"execution_count": null,
"outputs": []
},
{
"metadata": {
"id": "MfEKKAlBlg5o",
"colab_type": "code",
"outputId": "cd4784e9-73ff-4011-c5c0-25bddff9cda5",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 1185
},
"trusted": false,
"_uuid": "1a7ae83cb17a12b0370ef7437576f9e46003b5d0"
},
"cell_type": "code",
"source": "fig, ax = plt.subplots(nrows=5, ncols=2, figsize=(12,20))\nsns.countplot(process_sample.supply_pump, ax=ax[0][0])\nsns.countplot(process_sample.supply_pre_rinse, ax=ax[0][1])\nsns.countplot(process_sample.supply_caustic, ax=ax[1][0])\nsns.countplot(process_sample.return_caustic, ax=ax[1][1])\nsns.countplot(process_sample.supply_acid, ax=ax[2][0])\nsns.countplot(process_sample.return_acid, ax=ax[2][1])\nsns.countplot(process_sample.supply_clean_water, ax=ax[3][0])\nsns.countplot(process_sample.return_recovery_water, ax=ax[3][1])\nsns.countplot(process_sample.return_drain, ax=ax[4][0])\nsns.countplot(process_sample.object_low_level, ax=ax[4][1])\nplt.show()",
"execution_count": null,
"outputs": []
},
{
"metadata": {
"id": "VsN4ZhyGmcLe",
"colab_type": "code",
"outputId": "5b77933d-19d8-4558-ce42-5104e618411b",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 442
},
"trusted": false,
"_uuid": "c6665d10b1115c04d9a0cf26ec59ee01022b5f45"
},
"cell_type": "code",
"source": "fig, ax = plt.subplots(nrows=1, ncols=3, figsize=(18,6))\nsns.countplot(process_sample.supply_acid, hue=process_sample.return_acid, ax=ax[0])\nsns.countplot(process_sample.supply_caustic,hue=process_sample.return_caustic,ax=ax[1])\nsns.countplot(process_sample.supply_clean_water,hue=process_sample.return_recovery_water, ax=ax[2])",
"execution_count": null,
"outputs": []
},
{
"metadata": {
"id": "8U8UUKbG5dIg",
"colab_type": "code",
"outputId": "dc1154fe-a76e-4b89-cbf8-6d1a2139ee12",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 429
},
"trusted": false,
"_uuid": "c03682cc1b647eaaafea6cb184b71a7516a62c6c"
},
"cell_type": "code",
"source": "sns.distplot(process_sample['tank_level_pre_rinse'])",
"execution_count": null,
"outputs": []
},
{
"metadata": {
"id": "pQqt51ys7UJi",
"colab_type": "code",
"outputId": "479f1217-ee9a-4412-a8a2-81965003dd3e",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 429
},
"trusted": false,
"_uuid": "59aff32940235dd6e02c37a753cf32c6c0c40d26"
},
"cell_type": "code",
"source": "sns.distplot(process_sample['tank_level_caustic'])",
"execution_count": null,
"outputs": []
},
{
"metadata": {
"id": "-PvwVE-07gG1",
"colab_type": "code",
"outputId": "72e1ec1e-04ee-4f22-9fbf-6283b427c883",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 429
},
"trusted": false,
"_uuid": "b5e0eb04cfdd2b9b7b747c59cf7d02402218c3a9"
},
"cell_type": "code",
"source": "sns.distplot(process_sample['tank_level_acid'])",
"execution_count": null,
"outputs": []
},
{
"metadata": {
"id": "MK2d7FGL7yTQ",
"colab_type": "code",
"outputId": "88eb25fb-807b-4723-b6ca-66eee518b882",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 429
},
"trusted": false,
"_uuid": "c511cbad1534067b4f3ce70b9fae4a019f743242"
},
"cell_type": "code",
"source": "sns.distplot(process_sample['tank_level_clean_water'])",
"execution_count": null,
"outputs": []
},
{
"metadata": {
"id": "uR3DRHRV7yvZ",
"colab_type": "code",
"outputId": "e5d77e15-5ffd-49e9-c21f-12737b6a7c25",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 415
},
"trusted": false,
"_uuid": "626be6db510b2d0873202d18aad0c1a8224d0795"
},
"cell_type": "code",
"source": "sns.distplot(process_sample['tank_level_pre_rinse'] - process_sample['tank_level_caustic'])",
"execution_count": null,
"outputs": []
},
{
"metadata": {
"id": "ObQqhDTn7zCY",
"colab_type": "code",
"outputId": "62175722-4570-4a78-d952-1889fe32afba",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 419
},
"trusted": false,
"_uuid": "e595736918e424e7a68ec7aa9fe928e3f6a42abe"
},
"cell_type": "code",
"source": "sns.distplot(process_sample['tank_level_caustic'] - process_sample['tank_level_acid'])",
"execution_count": null,
"outputs": []
},
{
"metadata": {
"id": "h0n6DOpu-QgV",
"colab_type": "code",
"outputId": "b0be6b9a-7e64-40a2-b011-fbab380aa009",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 415
},
"trusted": false,
"_uuid": "d52ac0eaa00a4a684a940b60d680b2598bbf3e89"
},
"cell_type": "code",
"source": "sns.distplot(process_sample['tank_level_acid'] - process_sample['tank_level_clean_water'])",
"execution_count": null,
"outputs": []
},
{
"metadata": {
"id": "oExLUVyf7y9M",
"colab_type": "code",
"outputId": "1c5f722f-e90f-4b13-9df6-858658d4fa29",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 439
},
"trusted": false,
"_uuid": "ec3caad08e6a1dc0962d8253db00140bc31636ad"
},
"cell_type": "code",
"source": "fig, ax = plt.subplots(nrows=1, ncols=3, figsize=(18,6))\nsns.distplot(process_sample.tank_temperature_pre_rinse, ax=ax[0])\nsns.distplot(process_sample.tank_temperature_caustic,ax=ax[1])\nsns.distplot(process_sample.tank_temperature_acid,ax=ax[2])\nplt.show()",
"execution_count": null,
"outputs": []
},
{
"metadata": {
"id": "L_vZdJRO7y6k",
"colab_type": "code",
"outputId": "e31a8263-937d-402d-dda9-d0596c3fb461",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 429
},
"trusted": false,
"_uuid": "af64e84d844de2ecfb90b7e6f107f6b1c17e4dd5"
},
"cell_type": "code",
"source": "fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(18,6))\nsns.distplot(process_sample.tank_temperature_pre_rinse - process_sample.tank_temperature_caustic, ax=ax[0])\nsns.distplot(process_sample.tank_temperature_caustic - process_sample.tank_temperature_acid,ax=ax[1])\nplt.show()",
"execution_count": null,
"outputs": []
},
{
"metadata": {
"id": "MZ83FoOw7y3i",
"colab_type": "code",
"outputId": "aef66784-56e7-4b87-ba7d-2c35b5d419bb",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 439
},
"trusted": false,
"_uuid": "940b8749ad2d6b772ade97194deead67c237e112"
},
"cell_type": "code",
"source": "fig, ax = plt.subplots(nrows=1, ncols=3, figsize=(18,6))\nsns.distplot(process_sample.tank_concentration_caustic, ax=ax[0])\nsns.distplot(process_sample.tank_concentration_acid,ax=ax[1])\nsns.distplot(process_sample.tank_concentration_caustic - process_sample.tank_concentration_acid,ax=ax[2])\nplt.show()",
"execution_count": null,
"outputs": []
},
{
"metadata": {
"id": "soU6BVY97y1P",
"colab_type": "code",
"outputId": "1f5a0e34-b455-4f08-afa6-b216f5e8acc7",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 531
},
"trusted": false,
"_uuid": "197d9a0804e0e1cb0c3d93d2b148d561d3d1f0bd"
},
"cell_type": "code",
"source": "fig, ax = plt.subplots(nrows=2, ncols=2, figsize=(12,8))\nsns.countplot(process_sample.tank_lsh_pre_rinse, ax=ax[0][0])\nsns.countplot(process_sample.tank_lsh_caustic, ax=ax[0][1])\nsns.countplot(process_sample.tank_lsh_acid, ax=ax[1][0])\nsns.countplot(process_sample.tank_lsh_clean_water, ax=ax[1][1])\nplt.show()",
"execution_count": null,
"outputs": []
},
{
"metadata": {
"id": "xZH6bEOZBGSH",
"colab_type": "code",
"outputId": "20c712cd-d541-4e3e-b062-1c7df593c388",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 412
},
"trusted": false,
"_uuid": "a91593c82ac2caf5f243cbd640b8fa469d5e7d64"
},
"cell_type": "code",
"source": "sns.countplot(process_sample.target_time_period)",
"execution_count": null,
"outputs": []
},
{
"metadata": {
"id": "JIl5f5Qq1iNU",
"colab_type": "code",
"outputId": "df851ff1-a579-4799-ad13-95512f0545a8",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 1258
},
"trusted": false,
"_uuid": "9fe9ca7e7d02a2ea10fac2e998f45efa652bc585"
},
"cell_type": "code",
"source": "print(df_train.isna().any(), df_test.isna().any())",
"execution_count": null,
"outputs": []
},
{
"metadata": {
"id": "fTs6FODd1kDG",
"colab_type": "code",
"outputId": "f1c34fda-6637-467c-d63f-8a2ff3b3b42f",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 204
},
"trusted": false,
"_uuid": "f3acca3d3a780224e8510e77e499de7f7144f932"
},
"cell_type": "code",
"source": "df_labels.head()",
"execution_count": null,
"outputs": []
},
{
"metadata": {
"id": "fBIv-Lxk2zM0",
"colab_type": "code",
"outputId": "fe1f1d47-26bd-4aa8-e807-7c77bfcba126",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 440
},
"trusted": false,
"_uuid": "fea36174ce3f1a5938340516e05d5bf82a795dbc"
},
"cell_type": "code",
"source": "sns.distplot(df_labels.final_rinse_total_turbidity_liter)",
"execution_count": null,
"outputs": []
},
{
"metadata": {
"id": "eXHW32EW3Bg6",
"colab_type": "code",
"outputId": "8266b707-582a-4b39-bb11-5c06fa0acf1c",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 429
},
"trusted": false,
"_uuid": "f1fffc6716af7bec2f16da32b3c2b76e59bfe726"
},
"cell_type": "code",
"source": "sns.distplot(np.log(df_labels.final_rinse_total_turbidity_liter))",
"execution_count": null,
"outputs": []
},
{
"metadata": {
"id": "2b11Acp53NAv",
"colab_type": "code",
"outputId": "155833ce-1b97-439f-c888-8c451689db28",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 423
},
"trusted": false,
"_uuid": "e1dbbed862406e62489d905634c963b991209c7a"
},
"cell_type": "code",
"source": "df_train.head()",
"execution_count": null,
"outputs": []
},
{
"metadata": {
"id": "DLsE-LRk3tPu",
"colab_type": "code",
"outputId": "1ca3fcbf-a3e2-49b8-b147-e32f89939f87",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 429
},
"trusted": false,
"_uuid": "3936c97cc354f9532af0db378908cafd50f40d94"
},
"cell_type": "code",
"source": "sns.distplot(df_train.groupby('process_id').count().row_id)",
"execution_count": null,
"outputs": []
},
{
"metadata": {
"id": "9_yPiyJ-316O",
"colab_type": "code",
"outputId": "147ff747-9dc7-46b2-becd-4f46128daf6b",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 34
},
"trusted": false,
"_uuid": "4d04a48345f13db48d6e573360c039956594bab5"
},
"cell_type": "code",
"source": "df_train.groupby('process_id').count().max()[0], df_train.groupby('process_id').count().min()[0], df_train.groupby('process_id').count().mean()[0]",
"execution_count": null,
"outputs": []
},
{
"metadata": {
"id": "1tAjPP4hrY4V",
"colab_type": "text",
"_uuid": "26a4f12df532d266ed0bfc0ac24a9ef516bb4e4e"
},
"cell_type": "markdown",
"source": "## Feature Engineering"
},
{
"metadata": {
"id": "K5E0PN8O3qUt",
"colab_type": "code",
"outputId": "d53bf399-ad7c-4f72-9d15-9746c9f41330",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 34
},
"trusted": true,
"_uuid": "ea38a076f9141e681fad95b43451a81de4d7d180"
},
"cell_type": "code",
"source": "gc.collect()",
"execution_count": 10,
"outputs": [
{
"output_type": "execute_result",
"execution_count": 10,
"data": {
"text/plain": "0"
},
"metadata": {}
}
]
},
{
"metadata": {
"id": "T4ORaJ6i6bX_",
"colab_type": "code",
"colab": {},
"trusted": true,
"_uuid": "86b28b978507391e3f9234a54402ed6f72e48442"
},
"cell_type": "code",
"source": "#df_labels.final_rinse_total_turbidity_liter = np.log(df_labels.final_rinse_total_turbidity_liter)",
"execution_count": 11,
"outputs": []
},
{
"metadata": {
"id": "K-SNw9EKcdWC",
"colab_type": "code",
"outputId": "d1a5fd07-9d16-42e3-9317-d24c79707d27",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 204
},
"trusted": true,
"_uuid": "9d0df8b4d0ab8254f6a1e0c3b3d506e50a91f9e7"
},
"cell_type": "code",
"source": "df_labels.head()",
"execution_count": 12,
"outputs": [
{
"output_type": "execute_result",
"execution_count": 12,
"data": {
"text/plain": " process_id final_rinse_total_turbidity_liter\n0 20001 4.318275e+06\n1 20002 4.375286e+05\n2 20003 4.271977e+05\n3 20004 7.197830e+05\n4 20005 4.133107e+05",
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>process_id</th>\n <th>final_rinse_total_turbidity_liter</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>0</th>\n <td>20001</td>\n <td>4.318275e+06</td>\n </tr>\n <tr>\n <th>1</th>\n <td>20002</td>\n <td>4.375286e+05</td>\n </tr>\n <tr>\n <th>2</th>\n <td>20003</td>\n <td>4.271977e+05</td>\n </tr>\n <tr>\n <th>3</th>\n <td>20004</td>\n <td>7.197830e+05</td>\n </tr>\n <tr>\n <th>4</th>\n <td>20005</td>\n <td>4.133107e+05</td>\n </tr>\n </tbody>\n</table>\n</div>"
},
"metadata": {}
}
]
},
{
"metadata": {
"id": "I8vJtlOpxTcR",
"colab_type": "code",
"colab": {},
"trusted": true,
"_uuid": "156a9ba974d55c07d2d4e8d14f29bf63cee7a555"
},
"cell_type": "code",
"source": "df_train.return_temperature = np.square(df_train.return_temperature)\ndf_test.return_temperature = np.square(df_test.return_temperature)",
"execution_count": 13,
"outputs": []
},
{
"metadata": {
"id": "5V_7N4qIzhO2",
"colab_type": "code",
"outputId": "5053c840-e200-47b5-d8f2-199d0512ddaf",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 34
},
"trusted": true,
"_uuid": "45b11fa955d66ae7cec55642b1687acf7d6723c3"
},
"cell_type": "code",
"source": "df_test.return_turbidity.min(), df_train.return_turbidity.min()",
"execution_count": 14,
"outputs": [
{
"output_type": "execute_result",
"execution_count": 14,
"data": {
"text/plain": "(-0.06872106, -0.36168979999999995)"
},
"metadata": {}
}
]
},
{
"metadata": {
"id": "F8xBwKCAx8zk",
"colab_type": "code",
"colab": {},
"trusted": true,
"_uuid": "0966c68eb83eeb287e393f7159da3160692c03d3"
},
"cell_type": "code",
"source": "df_train.return_turbidity = np.log(df_train.return_turbidity + 1)\ndf_test.return_turbidity = np.log(df_test.return_turbidity + 1)",
"execution_count": 15,
"outputs": []
},
{
"metadata": {
"id": "9nuwEYocyAuo",
"colab_type": "code",
"colab": {},
"trusted": true,
"_uuid": "5db479c627270e982067678f736254fb2c79c2ce"
},
"cell_type": "code",
"source": "df_train['tank_level_diff12'] = df_train['tank_level_pre_rinse'] - df_train['tank_level_caustic']\ndf_train['tank_level_diff23'] = df_train['tank_level_caustic'] - df_train['tank_level_acid']\ndf_train['tank_level_diff34'] = df_train['tank_level_acid'] - df_train['tank_level_clean_water']\ndf_test['tank_level_diff12'] = df_test['tank_level_pre_rinse'] - df_test['tank_level_caustic']\ndf_test['tank_level_diff23'] = df_test['tank_level_caustic'] - df_test['tank_level_acid']\ndf_test['tank_level_diff34'] = df_test['tank_level_acid'] - df_test['tank_level_clean_water']",
"execution_count": 16,
"outputs": []
},
{
"metadata": {
"id": "-44WEH5l2HuN",
"colab_type": "code",
"colab": {},
"trusted": true,
"_uuid": "caefb89e2cfb4bb3e2cf1ebdc3a29347f597cbbc"
},
"cell_type": "code",
"source": "df_train.drop(['tank_level_pre_rinse', 'tank_level_caustic', 'tank_level_acid', 'tank_level_clean_water'], axis=1, inplace=True)\ndf_test.drop(['tank_level_pre_rinse', 'tank_level_caustic', 'tank_level_acid', 'tank_level_clean_water'], axis=1, inplace=True)",
"execution_count": 17,
"outputs": []
},
{
"metadata": {
"id": "nZj9eU981GsW",
"colab_type": "code",
"outputId": "66e383a3-e330-4105-f189-b49a4db1b260",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 423
},
"trusted": true,
"_uuid": "11664523eec52be6fb6f4f19447f9104b08f9d12"
},
"cell_type": "code",
"source": "df_test.head()",
"execution_count": 18,
"outputs": [
{
"output_type": "execute_result",
"execution_count": 18,
"data": {
"text/plain": " row_id process_id ... tank_level_diff23 tank_level_diff34\n0 0 20000 ... -3.693394 2.790615\n1 1 20000 ... -3.719260 2.774164\n2 2 20000 ... -3.695744 2.788265\n3 3 20000 ... -3.665190 2.807077\n4 4 20000 ... -3.599354 2.783566\n\n[5 rows x 35 columns]",
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>row_id</th>\n <th>process_id</th>\n <th>object_id</th>\n <th>phase</th>\n <th>timestamp</th>\n <th>pipeline</th>\n <th>supply_flow</th>\n <th>supply_pressure</th>\n <th>return_temperature</th>\n <th>return_conductivity</th>\n <th>return_turbidity</th>\n <th>return_flow</th>\n <th>supply_pump</th>\n <th>supply_pre_rinse</th>\n <th>supply_caustic</th>\n <th>return_caustic</th>\n <th>supply_acid</th>\n <th>return_acid</th>\n <th>supply_clean_water</th>\n <th>return_recovery_water</th>\n <th>return_drain</th>\n <th>object_low_level</th>\n <th>tank_temperature_pre_rinse</th>\n <th>tank_temperature_caustic</th>\n <th>tank_temperature_acid</th>\n <th>tank_concentration_caustic</th>\n <th>tank_concentration_acid</th>\n <th>tank_lsh_caustic</th>\n <th>tank_lsh_acid</th>\n <th>tank_lsh_clean_water</th>\n <th>tank_lsh_pre_rinse</th>\n <th>target_time_period</th>\n <th>tank_level_diff12</th>\n <th>tank_level_diff23</th>\n <th>tank_level_diff34</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>0</th>\n <td>0</td>\n <td>20000</td>\n <td>427</td>\n <td>pre_rinse</td>\n <td>2018-04-30 21:39:21</td>\n <td>L4</td>\n <td>17039.207</td>\n <td>0.480035</td>\n <td>188.406617</td>\n <td>0.337567</td>\n <td>0.017923</td>\n <td>1580.58440</td>\n <td>True</td>\n <td>True</td>\n <td>False</td>\n <td>False</td>\n <td>False</td>\n <td>False</td>\n <td>False</td>\n <td>False</td>\n <td>True</td>\n <td>True</td>\n <td>29.922598</td>\n <td>82.94994</td>\n <td>72.526050</td>\n <td>45.378080</td>\n <td>45.124700</td>\n <td>False</td>\n <td>0.0</td>\n <td>False</td>\n <td>0.0</td>\n <td>False</td>\n <td>12.939816</td>\n <td>-3.693394</td>\n <td>2.790615</td>\n </tr>\n <tr>\n <th>1</th>\n <td>1</td>\n <td>20000</td>\n <td>427</td>\n <td>pre_rinse</td>\n <td>2018-04-30 21:39:23</td>\n <td>L4</td>\n <td>29390.912</td>\n <td>0.554253</td>\n <td>191.897851</td>\n <td>0.335876</td>\n <td>0.025003</td>\n <td>846.35420</td>\n <td>True</td>\n <td>True</td>\n <td>False</td>\n <td>False</td>\n <td>False</td>\n <td>False</td>\n <td>False</td>\n <td>False</td>\n <td>True</td>\n <td>True</td>\n <td>29.922598</td>\n <td>83.01143</td>\n <td>72.526050</td>\n <td>45.385216</td>\n <td>45.125390</td>\n <td>False</td>\n <td>0.0</td>\n <td>False</td>\n <td>0.0</td>\n <td>False</td>\n <td>12.960977</td>\n <td>-3.719260</td>\n <td>2.774164</td>\n </tr>\n <tr>\n <th>2</th>\n <td>2</td>\n <td>20000</td>\n <td>427</td>\n <td>pre_rinse</td>\n <td>2018-04-30 21:39:25</td>\n <td>L4</td>\n <td>24323.640</td>\n <td>0.657118</td>\n <td>194.310385</td>\n <td>0.335706</td>\n <td>0.025003</td>\n <td>455.72916</td>\n <td>True</td>\n <td>True</td>\n <td>False</td>\n <td>False</td>\n <td>False</td>\n <td>False</td>\n <td>False</td>\n <td>False</td>\n <td>True</td>\n <td>True</td>\n <td>29.922598</td>\n <td>83.07292</td>\n <td>72.529655</td>\n <td>45.383460</td>\n <td>45.125343</td>\n <td>False</td>\n <td>0.0</td>\n <td>False</td>\n <td>0.0</td>\n <td>False</td>\n <td>12.918656</td>\n <td>-3.695744</td>\n <td>2.788265</td>\n </tr>\n <tr>\n <th>3</th>\n <td>3</td>\n <td>20000</td>\n <td>427</td>\n <td>pre_rinse</td>\n <td>2018-04-30 21:39:27</td>\n <td>L4</td>\n <td>17180.266</td>\n <td>0.749132</td>\n <td>196.028393</td>\n <td>0.335571</td>\n <td>0.025003</td>\n <td>217.01390</td>\n <td>True</td>\n <td>True</td>\n <td>False</td>\n <td>False</td>\n <td>False</td>\n <td>False</td>\n <td>False</td>\n <td>False</td>\n <td>True</td>\n <td>True</td>\n <td>29.944302</td>\n <td>83.07292</td>\n <td>72.526050</td>\n <td>45.375385</td>\n <td>45.125360</td>\n <td>False</td>\n <td>0.0</td>\n <td>False</td>\n <td>0.0</td>\n <td>False</td>\n <td>12.859887</td>\n <td>-3.665190</td>\n <td>2.807077</td>\n </tr>\n <tr>\n <th>4</th>\n <td>4</td>\n <td>20000</td>\n <td>427</td>\n <td>pre_rinse</td>\n <td>2018-04-30 21:39:29</td>\n <td>L4</td>\n <td>11754.919</td>\n <td>0.795139</td>\n <td>196.028393</td>\n <td>0.335509</td>\n <td>0.028524</td>\n <td>115.74074</td>\n <td>True</td>\n <td>True</td>\n <td>False</td>\n <td>False</td>\n <td>False</td>\n <td>False</td>\n <td>False</td>\n <td>False</td>\n <td>True</td>\n <td>True</td>\n <td>29.922598</td>\n <td>83.08739</td>\n <td>72.529655</td>\n <td>45.374237</td>\n <td>45.121326</td>\n <td>False</td>\n <td>0.0</td>\n <td>False</td>\n <td>0.0</td>\n <td>False</td>\n <td>12.808159</td>\n <td>-3.599354</td>\n <td>2.783566</td>\n </tr>\n </tbody>\n</table>\n</div>"
},
"metadata": {}
}
]
},
{
"metadata": {
"id": "W7-JNx4X196C",
"colab_type": "code",
"colab": {},
"trusted": true,
"_uuid": "a75a4ac72f9b34f9fe9c2b5dc7258a3150171806"
},
"cell_type": "code",
"source": "df_train['tank_temp_diff12'] = df_train['tank_temperature_pre_rinse'] - df_train['tank_temperature_caustic']\ndf_train['tank_temp_diff23'] = df_train['tank_temperature_caustic'] - df_train['tank_temperature_acid']\ndf_test['tank_temp_diff12'] = df_test['tank_temperature_pre_rinse'] - df_test['tank_temperature_caustic']\ndf_test['tank_temp_diff23'] = df_test['tank_temperature_caustic'] - df_test['tank_temperature_caustic']",
"execution_count": 19,
"outputs": []
},
{
"metadata": {
"id": "n8BGdaBKhn3c",
"colab_type": "code",
"colab": {},
"trusted": true,
"_uuid": "aeef202b6857974278117fca7463db685a7d5899"
},
"cell_type": "code",
"source": "df_train.drop(['tank_temperature_pre_rinse', 'tank_temperature_caustic', 'tank_temperature_acid'], axis=1, inplace=True)\ndf_test.drop(['tank_temperature_pre_rinse', 'tank_temperature_caustic', 'tank_temperature_acid'], axis=1, inplace=True)",
"execution_count": 20,
"outputs": []
},
{
"metadata": {
"id": "Fn5PMgWUhp7f",
"colab_type": "code",
"outputId": "a082fdff-b186-4f8b-d9cb-c00209bbb469",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 423
},
"trusted": true,
"_uuid": "3dda70a287c0a7275115ccb7f8544c349d71e1e7"
},
"cell_type": "code",
"source": "df_train.head()",
"execution_count": 21,
"outputs": [
{
"output_type": "execute_result",
"execution_count": 21,
"data": {
"text/plain": " row_id process_id ... tank_temp_diff12 tank_temp_diff23\n0 0 20001 ... -50.651042 10.004340\n1 1 20001 ... -50.629337 9.982635\n2 2 20001 ... -50.629337 9.982635\n3 3 20001 ... -50.651042 10.004340\n4 4 20001 ... -50.629337 9.982635\n\n[5 rows x 34 columns]",
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>row_id</th>\n <th>process_id</th>\n <th>object_id</th>\n <th>phase</th>\n <th>timestamp</th>\n <th>pipeline</th>\n <th>supply_flow</th>\n <th>supply_pressure</th>\n <th>return_temperature</th>\n <th>return_conductivity</th>\n <th>return_turbidity</th>\n <th>return_flow</th>\n <th>supply_pump</th>\n <th>supply_pre_rinse</th>\n <th>supply_caustic</th>\n <th>return_caustic</th>\n <th>supply_acid</th>\n <th>return_acid</th>\n <th>supply_clean_water</th>\n <th>return_recovery_water</th>\n <th>return_drain</th>\n <th>object_low_level</th>\n <th>tank_concentration_caustic</th>\n <th>tank_concentration_acid</th>\n <th>tank_lsh_caustic</th>\n <th>tank_lsh_acid</th>\n <th>tank_lsh_clean_water</th>\n <th>tank_lsh_pre_rinse</th>\n <th>target_time_period</th>\n <th>tank_level_diff12</th>\n <th>tank_level_diff23</th>\n <th>tank_level_diff34</th>\n <th>tank_temp_diff12</th>\n <th>tank_temp_diff23</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>0</th>\n <td>0</td>\n <td>20001</td>\n <td>405</td>\n <td>pre_rinse</td>\n <td>2018-04-15 04:20:47</td>\n <td>L4</td>\n <td>8550.348</td>\n <td>0.615451</td>\n <td>325.611342</td>\n <td>4.990765</td>\n <td>0.163163</td>\n <td>15776.9100</td>\n <td>True</td>\n <td>True</td>\n <td>False</td>\n <td>False</td>\n <td>False</td>\n <td>False</td>\n <td>False</td>\n <td>False</td>\n <td>True</td>\n <td>True</td>\n <td>45.394646</td>\n <td>44.340126</td>\n <td>False</td>\n <td>0.0</td>\n <td>False</td>\n <td>0.0</td>\n <td>False</td>\n <td>13.943680</td>\n <td>-2.470883</td>\n <td>-5.447227</td>\n <td>-50.651042</td>\n <td>10.004340</td>\n </tr>\n <tr>\n <th>1</th>\n <td>1</td>\n <td>20001</td>\n <td>405</td>\n <td>pre_rinse</td>\n <td>2018-04-15 04:20:49</td>\n <td>L4</td>\n <td>11364.294</td>\n <td>0.654297</td>\n <td>332.302566</td>\n <td>3.749680</td>\n <td>0.115981</td>\n <td>13241.4640</td>\n <td>True</td>\n <td>True</td>\n <td>False</td>\n <td>False</td>\n <td>False</td>\n <td>False</td>\n <td>False</td>\n <td>False</td>\n <td>True</td>\n <td>True</td>\n <td>45.394447</td>\n <td>44.339380</td>\n <td>False</td>\n <td>0.0</td>\n <td>False</td>\n <td>0.0</td>\n <td>False</td>\n <td>13.863750</td>\n <td>-2.421515</td>\n <td>-5.411960</td>\n <td>-50.629337</td>\n <td>9.982635</td>\n </tr>\n <tr>\n <th>2</th>\n <td>2</td>\n <td>20001</td>\n <td>405</td>\n <td>pre_rinse</td>\n <td>2018-04-15 04:20:51</td>\n <td>L4</td>\n <td>12174.479</td>\n <td>0.699870</td>\n <td>338.396039</td>\n <td>2.783954</td>\n <td>0.327149</td>\n <td>10698.7850</td>\n <td>True</td>\n <td>True</td>\n <td>False</td>\n <td>False</td>\n <td>False</td>\n <td>False</td>\n <td>False</td>\n <td>False</td>\n <td>True</td>\n <td>True</td>\n <td>45.396280</td>\n <td>44.336735</td>\n <td>False</td>\n <td>0.0</td>\n <td>False</td>\n <td>0.0</td>\n <td>False</td>\n <td>13.837891</td>\n <td>-2.407410</td>\n <td>-5.416665</td>\n <td>-50.629337</td>\n <td>9.982635</td>\n </tr>\n <tr>\n <th>3</th>\n <td>3</td>\n <td>20001</td>\n <td>405</td>\n <td>pre_rinse</td>\n <td>2018-04-15 04:20:53</td>\n <td>L4</td>\n <td>13436.776</td>\n <td>0.761502</td>\n <td>345.351007</td>\n <td>1.769353</td>\n <td>0.193424</td>\n <td>8007.8125</td>\n <td>True</td>\n <td>True</td>\n <td>False</td>\n <td>False</td>\n <td>False</td>\n <td>False</td>\n <td>False</td>\n <td>False</td>\n <td>True</td>\n <td>True</td>\n <td>45.401875</td>\n <td>44.333110</td>\n <td>False</td>\n <td>0.0</td>\n <td>False</td>\n <td>0.0</td>\n <td>False</td>\n <td>13.823791</td>\n <td>-2.400355</td>\n <td>-5.414320</td>\n <td>-50.651042</td>\n <td>10.004340</td>\n </tr>\n <tr>\n <th>4</th>\n <td>4</td>\n <td>20001</td>\n <td>405</td>\n <td>pre_rinse</td>\n <td>2018-04-15 04:20:55</td>\n <td>L4</td>\n <td>13776.766</td>\n <td>0.837240</td>\n <td>346.966098</td>\n <td>0.904020</td>\n <td>0.138276</td>\n <td>6004.0510</td>\n <td>True</td>\n <td>True</td>\n <td>False</td>\n <td>False</td>\n <td>False</td>\n <td>False</td>\n <td>False</td>\n <td>False</td>\n <td>True</td>\n <td>True</td>\n <td>45.398197</td>\n <td>44.334373</td>\n <td>False</td>\n <td>0.0</td>\n <td>False</td>\n <td>0.0</td>\n <td>False</td>\n <td>13.804975</td>\n <td>-2.393300</td>\n <td>-5.414320</td>\n <td>-50.629337</td>\n <td>9.982635</td>\n </tr>\n </tbody>\n</table>\n</div>"
},
"metadata": {}
}
]
},
{
"metadata": {
"id": "4g_KMx0Ajfnv",
"colab_type": "text",
"_uuid": "f70a0ab2f187821490ed835d94200f49a1fb538d"
},
"cell_type": "markdown",
"source": "## Data Generation"
},
{
"metadata": {
"id": "Jxn3jFCDi0BZ",
"colab_type": "code",
"colab": {},
"trusted": true,
"_uuid": "f74e42ede2379ae7a03ccfae15641d3c6227177a"
},
"cell_type": "code",
"source": "trainy = df_train['target_time_period']\ntrainx = df_train.drop(['target_time_period'], axis=1)\ntesty = df_test['target_time_period']\ntestx = df_test.drop(['target_time_period'], axis=1)",
"execution_count": 22,
"outputs": []
},
{
"metadata": {
"id": "w_BkVYDnjxdF",
"colab_type": "code",
"colab": {},
"trusted": true,
"_uuid": "9df730f93dd228b4b481ee2f9953b3b2572fa4a6"
},
"cell_type": "code",
"source": "trainy = trainy.values * 1",
"execution_count": 23,
"outputs": []
},
{
"metadata": {
"id": "5BdfRPBIEoVe",
"colab_type": "code",
"outputId": "689ece2a-1d88-43fc-abd5-9bb4c92953e1",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 34
},
"trusted": true,
"_uuid": "8ee967b8d8ffd61d04506746781ca7382cd15097"
},
"cell_type": "code",
"source": "del df_train\ndel df_test\ngc.collect()",
"execution_count": 24,
"outputs": [
{
"output_type": "execute_result",
"execution_count": 24,
"data": {
"text/plain": "278"
},
"metadata": {}
}
]
},
{
"metadata": {
"id": "-YT_H_2-m3Wf",
"colab_type": "code",
"outputId": "4e0b1f58-b0ad-499a-de28-e2c2f15e4a68",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 423
},
"trusted": true,
"_uuid": "7497aee8668b61a2b1afbdb05e31942f8a90f192"
},
"cell_type": "code",
"source": "trainx.head()",
"execution_count": 25,
"outputs": [
{
"output_type": "execute_result",
"execution_count": 25,
"data": {
"text/plain": " row_id process_id ... tank_temp_diff12 tank_temp_diff23\n0 0 20001 ... -50.651042 10.004340\n1 1 20001 ... -50.629337 9.982635\n2 2 20001 ... -50.629337 9.982635\n3 3 20001 ... -50.651042 10.004340\n4 4 20001 ... -50.629337 9.982635\n\n[5 rows x 33 columns]",
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>row_id</th>\n <th>process_id</th>\n <th>object_id</th>\n <th>phase</th>\n <th>timestamp</th>\n <th>pipeline</th>\n <th>supply_flow</th>\n <th>supply_pressure</th>\n <th>return_temperature</th>\n <th>return_conductivity</th>\n <th>return_turbidity</th>\n <th>return_flow</th>\n <th>supply_pump</th>\n <th>supply_pre_rinse</th>\n <th>supply_caustic</th>\n <th>return_caustic</th>\n <th>supply_acid</th>\n <th>return_acid</th>\n <th>supply_clean_water</th>\n <th>return_recovery_water</th>\n <th>return_drain</th>\n <th>object_low_level</th>\n <th>tank_concentration_caustic</th>\n <th>tank_concentration_acid</th>\n <th>tank_lsh_caustic</th>\n <th>tank_lsh_acid</th>\n <th>tank_lsh_clean_water</th>\n <th>tank_lsh_pre_rinse</th>\n <th>tank_level_diff12</th>\n <th>tank_level_diff23</th>\n <th>tank_level_diff34</th>\n <th>tank_temp_diff12</th>\n <th>tank_temp_diff23</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>0</th>\n <td>0</td>\n <td>20001</td>\n <td>405</td>\n <td>pre_rinse</td>\n <td>2018-04-15 04:20:47</td>\n <td>L4</td>\n <td>8550.348</td>\n <td>0.615451</td>\n <td>325.611342</td>\n <td>4.990765</td>\n <td>0.163163</td>\n <td>15776.9100</td>\n <td>True</td>\n <td>True</td>\n <td>False</td>\n <td>False</td>\n <td>False</td>\n <td>False</td>\n <td>False</td>\n <td>False</td>\n <td>True</td>\n <td>True</td>\n <td>45.394646</td>\n <td>44.340126</td>\n <td>False</td>\n <td>0.0</td>\n <td>False</td>\n <td>0.0</td>\n <td>13.943680</td>\n <td>-2.470883</td>\n <td>-5.447227</td>\n <td>-50.651042</td>\n <td>10.004340</td>\n </tr>\n <tr>\n <th>1</th>\n <td>1</td>\n <td>20001</td>\n <td>405</td>\n <td>pre_rinse</td>\n <td>2018-04-15 04:20:49</td>\n <td>L4</td>\n <td>11364.294</td>\n <td>0.654297</td>\n <td>332.302566</td>\n <td>3.749680</td>\n <td>0.115981</td>\n <td>13241.4640</td>\n <td>True</td>\n <td>True</td>\n <td>False</td>\n <td>False</td>\n <td>False</td>\n <td>False</td>\n <td>False</td>\n <td>False</td>\n <td>True</td>\n <td>True</td>\n <td>45.394447</td>\n <td>44.339380</td>\n <td>False</td>\n <td>0.0</td>\n <td>False</td>\n <td>0.0</td>\n <td>13.863750</td>\n <td>-2.421515</td>\n <td>-5.411960</td>\n <td>-50.629337</td>\n <td>9.982635</td>\n </tr>\n <tr>\n <th>2</th>\n <td>2</td>\n <td>20001</td>\n <td>405</td>\n <td>pre_rinse</td>\n <td>2018-04-15 04:20:51</td>\n <td>L4</td>\n <td>12174.479</td>\n <td>0.699870</td>\n <td>338.396039</td>\n <td>2.783954</td>\n <td>0.327149</td>\n <td>10698.7850</td>\n <td>True</td>\n <td>True</td>\n <td>False</td>\n <td>False</td>\n <td>False</td>\n <td>False</td>\n <td>False</td>\n <td>False</td>\n <td>True</td>\n <td>True</td>\n <td>45.396280</td>\n <td>44.336735</td>\n <td>False</td>\n <td>0.0</td>\n <td>False</td>\n <td>0.0</td>\n <td>13.837891</td>\n <td>-2.407410</td>\n <td>-5.416665</td>\n <td>-50.629337</td>\n <td>9.982635</td>\n </tr>\n <tr>\n <th>3</th>\n <td>3</td>\n <td>20001</td>\n <td>405</td>\n <td>pre_rinse</td>\n <td>2018-04-15 04:20:53</td>\n <td>L4</td>\n <td>13436.776</td>\n <td>0.761502</td>\n <td>345.351007</td>\n <td>1.769353</td>\n <td>0.193424</td>\n <td>8007.8125</td>\n <td>True</td>\n <td>True</td>\n <td>False</td>\n <td>False</td>\n <td>False</td>\n <td>False</td>\n <td>False</td>\n <td>False</td>\n <td>True</td>\n <td>True</td>\n <td>45.401875</td>\n <td>44.333110</td>\n <td>False</td>\n <td>0.0</td>\n <td>False</td>\n <td>0.0</td>\n <td>13.823791</td>\n <td>-2.400355</td>\n <td>-5.414320</td>\n <td>-50.651042</td>\n <td>10.004340</td>\n </tr>\n <tr>\n <th>4</th>\n <td>4</td>\n <td>20001</td>\n <td>405</td>\n <td>pre_rinse</td>\n <td>2018-04-15 04:20:55</td>\n <td>L4</td>\n <td>13776.766</td>\n <td>0.837240</td>\n <td>346.966098</td>\n <td>0.904020</td>\n <td>0.138276</td>\n <td>6004.0510</td>\n <td>True</td>\n <td>True</td>\n <td>False</td>\n <td>False</td>\n <td>False</td>\n <td>False</td>\n <td>False</td>\n <td>False</td>\n <td>True</td>\n <td>True</td>\n <td>45.398197</td>\n <td>44.334373</td>\n <td>False</td>\n <td>0.0</td>\n <td>False</td>\n <td>0.0</td>\n <td>13.804975</td>\n <td>-2.393300</td>\n <td>-5.414320</td>\n <td>-50.629337</td>\n <td>9.982635</td>\n </tr>\n </tbody>\n</table>\n</div>"
},
"metadata": {}
}
]
},
{
"metadata": {
"id": "CKSIZBqCq7Af",
"colab_type": "code",
"outputId": "a60a0383-6344-4ba1-a359-a43931dbaa27",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 663
},
"trusted": true,
"_uuid": "b1b0eada99cca464098414c2ce08371b1d92e003"
},
"cell_type": "code",
"source": "trainx.info()",
"execution_count": 26,
"outputs": [
{
"output_type": "stream",
"text": "<class 'pandas.core.frame.DataFrame'>\nRangeIndex: 5987820 entries, 0 to 5987819\nData columns (total 33 columns):\nrow_id int64\nprocess_id int64\nobject_id int64\nphase object\ntimestamp object\npipeline object\nsupply_flow float64\nsupply_pressure float64\nreturn_temperature float64\nreturn_conductivity float64\nreturn_turbidity float64\nreturn_flow float64\nsupply_pump bool\nsupply_pre_rinse bool\nsupply_caustic bool\nreturn_caustic bool\nsupply_acid bool\nreturn_acid bool\nsupply_clean_water bool\nreturn_recovery_water bool\nreturn_drain bool\nobject_low_level bool\ntank_concentration_caustic float64\ntank_concentration_acid float64\ntank_lsh_caustic bool\ntank_lsh_acid float64\ntank_lsh_clean_water bool\ntank_lsh_pre_rinse float64\ntank_level_diff12 float64\ntank_level_diff23 float64\ntank_level_diff34 float64\ntank_temp_diff12 float64\ntank_temp_diff23 float64\ndtypes: bool(12), float64(15), int64(3), object(3)\nmemory usage: 1.0+ GB\n",
"name": "stdout"
}
]
},
{
"metadata": {
"id": "0aa0IJaoq69j",
"colab_type": "code",
"colab": {},
"trusted": true,
"_uuid": "61fadb15211fc34b23f3b679a346480d9d70d462"
},
"cell_type": "code",
"source": "trainx.loc[:, trainx.dtypes == bool] = trainx.loc[:, trainx.dtypes == bool].astype('int')",
"execution_count": 27,
"outputs": []
},
{
"metadata": {
"id": "V_Ho2yrXq660",
"colab_type": "code",
"outputId": "437a5caf-34ba-4f06-83b1-f74983ef741e",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 663
},
"trusted": true,
"_uuid": "c86234ec9c50450785f30852c877234badab90ff"
},
"cell_type": "code",
"source": "trainx.info()",
"execution_count": 28,
"outputs": [
{
"output_type": "stream",
"text": "<class 'pandas.core.frame.DataFrame'>\nRangeIndex: 5987820 entries, 0 to 5987819\nData columns (total 33 columns):\nrow_id int64\nprocess_id int64\nobject_id int64\nphase object\ntimestamp object\npipeline object\nsupply_flow float64\nsupply_pressure float64\nreturn_temperature float64\nreturn_conductivity float64\nreturn_turbidity float64\nreturn_flow float64\nsupply_pump int64\nsupply_pre_rinse int64\nsupply_caustic int64\nreturn_caustic int64\nsupply_acid int64\nreturn_acid int64\nsupply_clean_water int64\nreturn_recovery_water int64\nreturn_drain int64\nobject_low_level int64\ntank_concentration_caustic float64\ntank_concentration_acid float64\ntank_lsh_caustic int64\ntank_lsh_acid float64\ntank_lsh_clean_water int64\ntank_lsh_pre_rinse float64\ntank_level_diff12 float64\ntank_level_diff23 float64\ntank_level_diff34 float64\ntank_temp_diff12 float64\ntank_temp_diff23 float64\ndtypes: float64(15), int64(15), object(3)\nmemory usage: 1.5+ GB\n",
"name": "stdout"
}
]
},
{
"metadata": {
"id": "GVirhdwS1yHZ",
"colab_type": "code",
"colab": {},
"trusted": true,
"_uuid": "8370608b615f1edf10095d5a04b00867b870116f"
},
"cell_type": "code",
"source": "trainx.timestamp = (pd.to_datetime(trainx.timestamp, format=\"%Y-%m-%d %H:%M:%S\") - pd.datetime.now()).dt.total_seconds()",
"execution_count": 29,
"outputs": []
},
{
"metadata": {
"id": "V29mS62kq635",
"colab_type": "code",
"outputId": "2452bac1-b6d8-44d5-8348-56145d8674e8",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 34
},
"trusted": true,
"_uuid": "74a22a80c65753dd1169b85eb1332736335c9540"
},
"cell_type": "code",
"source": "pids = trainx.process_id.unique()\nfor pid in tqdm(pids):\n process_sample = trainx[trainx.process_id == pid].timestamp\n process_sample = process_sample -process_sample.min()\n trainx.loc[trainx.process_id==pid, 'timestamp'] = process_sample",
"execution_count": 30,
"outputs": [
{
"output_type": "stream",
"text": "100%|██████████| 5021/5021 [06:04<00:00, 13.79it/s]\n",
"name": "stderr"
}
]
},
{
"metadata": {
"id": "zS8dKgPiwF0U",
"colab_type": "code",
"outputId": "4edc6ca6-7e66-410d-e62b-e6f2916b166f",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 253
},
"trusted": true,
"_uuid": "53f1492c58003e94b4047922774a03e560dbff51"
},
"cell_type": "code",
"source": "trainx.head()",
"execution_count": 31,
"outputs": [
{
"output_type": "execute_result",
"execution_count": 31,
"data": {
"text/plain": " row_id process_id ... tank_temp_diff12 tank_temp_diff23\n0 0 20001 ... -50.651042 10.004340\n1 1 20001 ... -50.629337 9.982635\n2 2 20001 ... -50.629337 9.982635\n3 3 20001 ... -50.651042 10.004340\n4 4 20001 ... -50.629337 9.982635\n\n[5 rows x 33 columns]",
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>row_id</th>\n <th>process_id</th>\n <th>object_id</th>\n <th>phase</th>\n <th>timestamp</th>\n <th>pipeline</th>\n <th>supply_flow</th>\n <th>supply_pressure</th>\n <th>return_temperature</th>\n <th>return_conductivity</th>\n <th>return_turbidity</th>\n <th>return_flow</th>\n <th>supply_pump</th>\n <th>supply_pre_rinse</th>\n <th>supply_caustic</th>\n <th>return_caustic</th>\n <th>supply_acid</th>\n <th>return_acid</th>\n <th>supply_clean_water</th>\n <th>return_recovery_water</th>\n <th>return_drain</th>\n <th>object_low_level</th>\n <th>tank_concentration_caustic</th>\n <th>tank_concentration_acid</th>\n <th>tank_lsh_caustic</th>\n <th>tank_lsh_acid</th>\n <th>tank_lsh_clean_water</th>\n <th>tank_lsh_pre_rinse</th>\n <th>tank_level_diff12</th>\n <th>tank_level_diff23</th>\n <th>tank_level_diff34</th>\n <th>tank_temp_diff12</th>\n <th>tank_temp_diff23</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>0</th>\n <td>0</td>\n <td>20001</td>\n <td>405</td>\n <td>pre_rinse</td>\n <td>0.0</td>\n <td>L4</td>\n <td>8550.348</td>\n <td>0.615451</td>\n <td>325.611342</td>\n <td>4.990765</td>\n <td>0.163163</td>\n <td>15776.9100</td>\n <td>1</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>1</td>\n <td>45.394646</td>\n <td>44.340126</td>\n <td>0</td>\n <td>0.0</td>\n <td>0</td>\n <td>0.0</td>\n <td>13.943680</td>\n <td>-2.470883</td>\n <td>-5.447227</td>\n <td>-50.651042</td>\n <td>10.004340</td>\n </tr>\n <tr>\n <th>1</th>\n <td>1</td>\n <td>20001</td>\n <td>405</td>\n <td>pre_rinse</td>\n <td>2.0</td>\n <td>L4</td>\n <td>11364.294</td>\n <td>0.654297</td>\n <td>332.302566</td>\n <td>3.749680</td>\n <td>0.115981</td>\n <td>13241.4640</td>\n <td>1</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>1</td>\n <td>45.394447</td>\n <td>44.339380</td>\n <td>0</td>\n <td>0.0</td>\n <td>0</td>\n <td>0.0</td>\n <td>13.863750</td>\n <td>-2.421515</td>\n <td>-5.411960</td>\n <td>-50.629337</td>\n <td>9.982635</td>\n </tr>\n <tr>\n <th>2</th>\n <td>2</td>\n <td>20001</td>\n <td>405</td>\n <td>pre_rinse</td>\n <td>4.0</td>\n <td>L4</td>\n <td>12174.479</td>\n <td>0.699870</td>\n <td>338.396039</td>\n <td>2.783954</td>\n <td>0.327149</td>\n <td>10698.7850</td>\n <td>1</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>1</td>\n <td>45.396280</td>\n <td>44.336735</td>\n <td>0</td>\n <td>0.0</td>\n <td>0</td>\n <td>0.0</td>\n <td>13.837891</td>\n <td>-2.407410</td>\n <td>-5.416665</td>\n <td>-50.629337</td>\n <td>9.982635</td>\n </tr>\n <tr>\n <th>3</th>\n <td>3</td>\n <td>20001</td>\n <td>405</td>\n <td>pre_rinse</td>\n <td>6.0</td>\n <td>L4</td>\n <td>13436.776</td>\n <td>0.761502</td>\n <td>345.351007</td>\n <td>1.769353</td>\n <td>0.193424</td>\n <td>8007.8125</td>\n <td>1</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>1</td>\n <td>45.401875</td>\n <td>44.333110</td>\n <td>0</td>\n <td>0.0</td>\n <td>0</td>\n <td>0.0</td>\n <td>13.823791</td>\n <td>-2.400355</td>\n <td>-5.414320</td>\n <td>-50.651042</td>\n <td>10.004340</td>\n </tr>\n <tr>\n <th>4</th>\n <td>4</td>\n <td>20001</td>\n <td>405</td>\n <td>pre_rinse</td>\n <td>8.0</td>\n <td>L4</td>\n <td>13776.766</td>\n <td>0.837240</td>\n <td>346.966098</td>\n <td>0.904020</td>\n <td>0.138276</td>\n <td>6004.0510</td>\n <td>1</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>1</td>\n <td>45.398197</td>\n <td>44.334373</td>\n <td>0</td>\n <td>0.0</td>\n <td>0</td>\n <td>0.0</td>\n <td>13.804975</td>\n <td>-2.393300</td>\n <td>-5.414320</td>\n <td>-50.629337</td>\n <td>9.982635</td>\n </tr>\n </tbody>\n</table>\n</div>"
},
"metadata": {}
}
]
},
{
"metadata": {
"id": "SdkG9c31xJVn",
"colab_type": "code",
"outputId": "63b09a6f-819e-436d-e5a6-900407b47f9d",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 34
},
"trusted": true,
"_uuid": "9d62387eb6314c582291d2a5388575a3f44eb6da"
},
"cell_type": "code",
"source": "pids = testx.process_id.unique()\ntestx.timestamp = (pd.to_datetime(testx.timestamp, format=\"%Y-%m-%d %H:%M:%S\") - pd.datetime.now()).dt.total_seconds()\nfor pid in tqdm(pids):\n process_sample = testx[testx.process_id == pid].timestamp\n process_sample = process_sample - process_sample.min()\n testx.loc[testx.process_id==pid, 'timestamp'] = process_sample",
"execution_count": 32,
"outputs": [
{
"output_type": "stream",
"text": "100%|██████████| 2967/2967 [01:21<00:00, 36.57it/s]\n",
"name": "stderr"
}
]
},
{
"metadata": {
"id": "76vmXSWRzQz8",
"colab_type": "code",
"outputId": "b1e3d669-6596-40b7-97c7-3639fedcc558",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 253
},
"trusted": true,
"_uuid": "3302ba234933fd86472163722920747617c2a134"
},
"cell_type": "code",
"source": "testx.head()",
"execution_count": 33,
"outputs": [
{
"output_type": "execute_result",
"execution_count": 33,
"data": {
"text/plain": " row_id process_id ... tank_temp_diff12 tank_temp_diff23\n0 0 20000 ... -53.027342 0.0\n1 1 20000 ... -53.088832 0.0\n2 2 20000 ... -53.150322 0.0\n3 3 20000 ... -53.128618 0.0\n4 4 20000 ... -53.164792 0.0\n\n[5 rows x 33 columns]",
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>row_id</th>\n <th>process_id</th>\n <th>object_id</th>\n <th>phase</th>\n <th>timestamp</th>\n <th>pipeline</th>\n <th>supply_flow</th>\n <th>supply_pressure</th>\n <th>return_temperature</th>\n <th>return_conductivity</th>\n <th>return_turbidity</th>\n <th>return_flow</th>\n <th>supply_pump</th>\n <th>supply_pre_rinse</th>\n <th>supply_caustic</th>\n <th>return_caustic</th>\n <th>supply_acid</th>\n <th>return_acid</th>\n <th>supply_clean_water</th>\n <th>return_recovery_water</th>\n <th>return_drain</th>\n <th>object_low_level</th>\n <th>tank_concentration_caustic</th>\n <th>tank_concentration_acid</th>\n <th>tank_lsh_caustic</th>\n <th>tank_lsh_acid</th>\n <th>tank_lsh_clean_water</th>\n <th>tank_lsh_pre_rinse</th>\n <th>tank_level_diff12</th>\n <th>tank_level_diff23</th>\n <th>tank_level_diff34</th>\n <th>tank_temp_diff12</th>\n <th>tank_temp_diff23</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>0</th>\n <td>0</td>\n <td>20000</td>\n <td>427</td>\n <td>pre_rinse</td>\n <td>0.0</td>\n <td>L4</td>\n <td>17039.207</td>\n <td>0.480035</td>\n <td>188.406617</td>\n <td>0.337567</td>\n <td>0.017923</td>\n <td>1580.58440</td>\n <td>True</td>\n <td>True</td>\n <td>False</td>\n <td>False</td>\n <td>False</td>\n <td>False</td>\n <td>False</td>\n <td>False</td>\n <td>True</td>\n <td>True</td>\n <td>45.378080</td>\n <td>45.124700</td>\n <td>False</td>\n <td>0.0</td>\n <td>False</td>\n <td>0.0</td>\n <td>12.939816</td>\n <td>-3.693394</td>\n <td>2.790615</td>\n <td>-53.027342</td>\n <td>0.0</td>\n </tr>\n <tr>\n <th>1</th>\n <td>1</td>\n <td>20000</td>\n <td>427</td>\n <td>pre_rinse</td>\n <td>2.0</td>\n <td>L4</td>\n <td>29390.912</td>\n <td>0.554253</td>\n <td>191.897851</td>\n <td>0.335876</td>\n <td>0.025003</td>\n <td>846.35420</td>\n <td>True</td>\n <td>True</td>\n <td>False</td>\n <td>False</td>\n <td>False</td>\n <td>False</td>\n <td>False</td>\n <td>False</td>\n <td>True</td>\n <td>True</td>\n <td>45.385216</td>\n <td>45.125390</td>\n <td>False</td>\n <td>0.0</td>\n <td>False</td>\n <td>0.0</td>\n <td>12.960977</td>\n <td>-3.719260</td>\n <td>2.774164</td>\n <td>-53.088832</td>\n <td>0.0</td>\n </tr>\n <tr>\n <th>2</th>\n <td>2</td>\n <td>20000</td>\n <td>427</td>\n <td>pre_rinse</td>\n <td>4.0</td>\n <td>L4</td>\n <td>24323.640</td>\n <td>0.657118</td>\n <td>194.310385</td>\n <td>0.335706</td>\n <td>0.025003</td>\n <td>455.72916</td>\n <td>True</td>\n <td>True</td>\n <td>False</td>\n <td>False</td>\n <td>False</td>\n <td>False</td>\n <td>False</td>\n <td>False</td>\n <td>True</td>\n <td>True</td>\n <td>45.383460</td>\n <td>45.125343</td>\n <td>False</td>\n <td>0.0</td>\n <td>False</td>\n <td>0.0</td>\n <td>12.918656</td>\n <td>-3.695744</td>\n <td>2.788265</td>\n <td>-53.150322</td>\n <td>0.0</td>\n </tr>\n <tr>\n <th>3</th>\n <td>3</td>\n <td>20000</td>\n <td>427</td>\n <td>pre_rinse</td>\n <td>6.0</td>\n <td>L4</td>\n <td>17180.266</td>\n <td>0.749132</td>\n <td>196.028393</td>\n <td>0.335571</td>\n <td>0.025003</td>\n <td>217.01390</td>\n <td>True</td>\n <td>True</td>\n <td>False</td>\n <td>False</td>\n <td>False</td>\n <td>False</td>\n <td>False</td>\n <td>False</td>\n <td>True</td>\n <td>True</td>\n <td>45.375385</td>\n <td>45.125360</td>\n <td>False</td>\n <td>0.0</td>\n <td>False</td>\n <td>0.0</td>\n <td>12.859887</td>\n <td>-3.665190</td>\n <td>2.807077</td>\n <td>-53.128618</td>\n <td>0.0</td>\n </tr>\n <tr>\n <th>4</th>\n <td>4</td>\n <td>20000</td>\n <td>427</td>\n <td>pre_rinse</td>\n <td>8.0</td>\n <td>L4</td>\n <td>11754.919</td>\n <td>0.795139</td>\n <td>196.028393</td>\n <td>0.335509</td>\n <td>0.028524</td>\n <td>115.74074</td>\n <td>True</td>\n <td>True</td>\n <td>False</td>\n <td>False</td>\n <td>False</td>\n <td>False</td>\n <td>False</td>\n <td>False</td>\n <td>True</td>\n <td>True</td>\n <td>45.374237</td>\n <td>45.121326</td>\n <td>False</td>\n <td>0.0</td>\n <td>False</td>\n <td>0.0</td>\n <td>12.808159</td>\n <td>-3.599354</td>\n <td>2.783566</td>\n <td>-53.164792</td>\n <td>0.0</td>\n </tr>\n </tbody>\n</table>\n</div>"
},
"metadata": {}
}
]
},
{
"metadata": {
"id": "Ws2E22vV1Pu4",
"colab_type": "code",
"colab": {},
"trusted": true,
"_uuid": "0e50d315e4e34a2f9cee97350129bf3a6e3794ee"
},
"cell_type": "code",
"source": "from sklearn.preprocessing import OneHotEncoder, LabelEncoder\nfrom sklearn.pipeline import Pipeline",
"execution_count": 34,
"outputs": []
},
{
"metadata": {
"id": "InXy5HA26-bJ",
"colab_type": "code",
"colab": {},
"trusted": true,
"_uuid": "00f4d6810924818c9fafcd19f8594c8e4508fd77"
},
"cell_type": "code",
"source": "le = LabelEncoder()\nenc = OneHotEncoder()",
"execution_count": 35,
"outputs": []
},
{
"metadata": {
"id": "Jp17q_c06-1z",
"colab_type": "code",
"colab": {},
"trusted": true,
"_uuid": "19b5c86ce039dc295e53deed2d63ac4298202694"
},
"cell_type": "code",
"source": "#le_encoded = trainx[['phase', 'pipeline']].apply(le.fit_transform)\n#enc.fit(le_encoded)\n#onehotlabels = enc.transform(le_encode).toarray()\n#onehotlabels.shape",
"execution_count": 36,
"outputs": []
},
{
"metadata": {
"id": "3t3YNFeZ6-8z",
"colab_type": "code",
"colab": {},
"trusted": true,
"_uuid": "cfda21025effef3542c0745e7f4ce2cb6b4c086a"
},
"cell_type": "code",
"source": "trainx[['phase']] = trainx[['phase']].apply(le.fit_transform)\ntestx[['phase']] = testx[['phase']].apply(le.transform)\ntrainx[['pipeline']] = trainx[['pipeline']].apply(le.fit_transform)\ntestx[['pipeline']] = testx[['pipeline']].apply(le.transform)",
"execution_count": 37,
"outputs": []
},
{
"metadata": {
"id": "hLBJ1l9i6_Cg",
"colab_type": "code",
"outputId": "bc7f13f2-22c0-4697-efa1-585047e12f90",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 253
},
"trusted": true,
"_uuid": "6d36096d35b81bdcd86bee81b5715de949a2ae13"
},
"cell_type": "code",
"source": "trainx.head()",
"execution_count": 38,
"outputs": [
{
"output_type": "execute_result",
"execution_count": 38,
"data": {
"text/plain": " row_id process_id ... tank_temp_diff12 tank_temp_diff23\n0 0 20001 ... -50.651042 10.004340\n1 1 20001 ... -50.629337 9.982635\n2 2 20001 ... -50.629337 9.982635\n3 3 20001 ... -50.651042 10.004340\n4 4 20001 ... -50.629337 9.982635\n\n[5 rows x 33 columns]",
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>row_id</th>\n <th>process_id</th>\n <th>object_id</th>\n <th>phase</th>\n <th>timestamp</th>\n <th>pipeline</th>\n <th>supply_flow</th>\n <th>supply_pressure</th>\n <th>return_temperature</th>\n <th>return_conductivity</th>\n <th>return_turbidity</th>\n <th>return_flow</th>\n <th>supply_pump</th>\n <th>supply_pre_rinse</th>\n <th>supply_caustic</th>\n <th>return_caustic</th>\n <th>supply_acid</th>\n <th>return_acid</th>\n <th>supply_clean_water</th>\n <th>return_recovery_water</th>\n <th>return_drain</th>\n <th>object_low_level</th>\n <th>tank_concentration_caustic</th>\n <th>tank_concentration_acid</th>\n <th>tank_lsh_caustic</th>\n <th>tank_lsh_acid</th>\n <th>tank_lsh_clean_water</th>\n <th>tank_lsh_pre_rinse</th>\n <th>tank_level_diff12</th>\n <th>tank_level_diff23</th>\n <th>tank_level_diff34</th>\n <th>tank_temp_diff12</th>\n <th>tank_temp_diff23</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>0</th>\n <td>0</td>\n <td>20001</td>\n <td>405</td>\n <td>4</td>\n <td>0.0</td>\n <td>6</td>\n <td>8550.348</td>\n <td>0.615451</td>\n <td>325.611342</td>\n <td>4.990765</td>\n <td>0.163163</td>\n <td>15776.9100</td>\n <td>1</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>1</td>\n <td>45.394646</td>\n <td>44.340126</td>\n <td>0</td>\n <td>0.0</td>\n <td>0</td>\n <td>0.0</td>\n <td>13.943680</td>\n <td>-2.470883</td>\n <td>-5.447227</td>\n <td>-50.651042</td>\n <td>10.004340</td>\n </tr>\n <tr>\n <th>1</th>\n <td>1</td>\n <td>20001</td>\n <td>405</td>\n <td>4</td>\n <td>2.0</td>\n <td>6</td>\n <td>11364.294</td>\n <td>0.654297</td>\n <td>332.302566</td>\n <td>3.749680</td>\n <td>0.115981</td>\n <td>13241.4640</td>\n <td>1</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>1</td>\n <td>45.394447</td>\n <td>44.339380</td>\n <td>0</td>\n <td>0.0</td>\n <td>0</td>\n <td>0.0</td>\n <td>13.863750</td>\n <td>-2.421515</td>\n <td>-5.411960</td>\n <td>-50.629337</td>\n <td>9.982635</td>\n </tr>\n <tr>\n <th>2</th>\n <td>2</td>\n <td>20001</td>\n <td>405</td>\n <td>4</td>\n <td>4.0</td>\n <td>6</td>\n <td>12174.479</td>\n <td>0.699870</td>\n <td>338.396039</td>\n <td>2.783954</td>\n <td>0.327149</td>\n <td>10698.7850</td>\n <td>1</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>1</td>\n <td>45.396280</td>\n <td>44.336735</td>\n <td>0</td>\n <td>0.0</td>\n <td>0</td>\n <td>0.0</td>\n <td>13.837891</td>\n <td>-2.407410</td>\n <td>-5.416665</td>\n <td>-50.629337</td>\n <td>9.982635</td>\n </tr>\n <tr>\n <th>3</th>\n <td>3</td>\n <td>20001</td>\n <td>405</td>\n <td>4</td>\n <td>6.0</td>\n <td>6</td>\n <td>13436.776</td>\n <td>0.761502</td>\n <td>345.351007</td>\n <td>1.769353</td>\n <td>0.193424</td>\n <td>8007.8125</td>\n <td>1</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>1</td>\n <td>45.401875</td>\n <td>44.333110</td>\n <td>0</td>\n <td>0.0</td>\n <td>0</td>\n <td>0.0</td>\n <td>13.823791</td>\n <td>-2.400355</td>\n <td>-5.414320</td>\n <td>-50.651042</td>\n <td>10.004340</td>\n </tr>\n <tr>\n <th>4</th>\n <td>4</td>\n <td>20001</td>\n <td>405</td>\n <td>4</td>\n <td>8.0</td>\n <td>6</td>\n <td>13776.766</td>\n <td>0.837240</td>\n <td>346.966098</td>\n <td>0.904020</td>\n <td>0.138276</td>\n <td>6004.0510</td>\n <td>1</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>1</td>\n <td>45.398197</td>\n <td>44.334373</td>\n <td>0</td>\n <td>0.0</td>\n <td>0</td>\n <td>0.0</td>\n <td>13.804975</td>\n <td>-2.393300</td>\n <td>-5.414320</td>\n <td>-50.629337</td>\n <td>9.982635</td>\n </tr>\n </tbody>\n</table>\n</div>"
},
"metadata": {}
}
]
},
{
"metadata": {
"id": "pE4EDyMwMlja",
"colab_type": "code",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 34
},
"outputId": "1a8ee748-7dfd-45bd-8de2-fb4924468769",
"trusted": true,
"_uuid": "7537afdb4ec2a55f495abe11d8fcf2d7199902a9"
},
"cell_type": "code",
"source": "trainx_meta = trainx[['row_id', 'process_id', 'object_id', 'return_turbidity', 'return_flow']]\ntrainx.drop(['row_id', 'process_id', 'object_id'], axis=1, inplace=True)\ntestx_meta = testx[['row_id', 'process_id', 'object_id', 'return_turbidity', 'return_flow']]\ntestx.drop(['row_id', 'process_id', 'object_id'], axis=1, inplace=True)\ngc.collect()",
"execution_count": 39,
"outputs": [
{
"output_type": "execute_result",
"execution_count": 39,
"data": {
"text/plain": "222"
},
"metadata": {}
}
]
},
{
"metadata": {
"id": "XALfVBtHOC4s",
"colab_type": "code",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 1241
},
"outputId": "05cecf7d-696c-45b1-b0dd-a536091d1b52",
"trusted": true,
"_uuid": "32f012e217d74ac2e74564ddbff75bcb6bfba8c4"
},
"cell_type": "code",
"source": "print(trainx.head(), testx.head())",
"execution_count": 40,
"outputs": [
{
"output_type": "stream",
"text": " phase timestamp ... tank_temp_diff12 tank_temp_diff23\n0 4 0.0 ... -50.651042 10.004340\n1 4 2.0 ... -50.629337 9.982635\n2 4 4.0 ... -50.629337 9.982635\n3 4 6.0 ... -50.651042 10.004340\n4 4 8.0 ... -50.629337 9.982635\n\n[5 rows x 30 columns] phase timestamp ... tank_temp_diff12 tank_temp_diff23\n0 4 0.0 ... -53.027342 0.0\n1 4 2.0 ... -53.088832 0.0\n2 4 4.0 ... -53.150322 0.0\n3 4 6.0 ... -53.128618 0.0\n4 4 8.0 ... -53.164792 0.0\n\n[5 rows x 30 columns]\n",
"name": "stdout"
}
]
},
{
"metadata": {
"id": "snCGsSUZHYW3",
"colab_type": "text",
"_uuid": "05cc4cb1481e531015bbab6cb2a5568204286c0c"
},
"cell_type": "markdown",
"source": "## Modeling"
},
{
"metadata": {
"id": "fMPonhvwHXsI",
"colab_type": "code",
"colab": {},
"trusted": true,
"_uuid": "8bda2fdd94f168c1bb4c024125a6806d5ddf440c"
},
"cell_type": "code",
"source": "import lightgbm as lgb\nfrom sklearn.metrics import f1_score\nfrom sklearn.model_selection import StratifiedKFold",
"execution_count": 41,
"outputs": []
},
{
"metadata": {
"id": "z-Zaa3Sng_SY",
"colab_type": "code",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 34
},
"outputId": "f5d9c7ac-37cc-4b05-8c45-89308f7fa4c0",
"trusted": true,
"_uuid": "7d34342456f407cc7420a22b082125ffa153bf1f"
},
"cell_type": "code",
"source": "gc.collect()",
"execution_count": 42,
"outputs": [
{
"output_type": "execute_result",
"execution_count": 42,
"data": {
"text/plain": "0"
},
"metadata": {}
}
]
},
{
"metadata": {
"id": "RR08NsmY6_An",
"colab_type": "code",
"outputId": "7daf3c9c-e03d-40f4-8434-016c184bc249",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 292
},
"trusted": true,
"_uuid": "29c45c6b6dad6ddf20bc735da7bf5dad7619f627"
},
"cell_type": "code",
"source": "from sklearn.metrics import roc_curve, precision_recall_curve\ndef threshold_search(y_true, y_proba, plot=False):\n precision, recall, thresholds = precision_recall_curve(y_true, y_proba)\n thresholds = np.append(thresholds, 1.001) \n F = 2 / (1/precision + 1/recall)\n best_score = np.max(F)\n best_th = thresholds[np.argmax(F)]\n if plot:\n plt.plot(thresholds, F, '-b')\n plt.plot([best_th], [best_score], '*r')\n plt.show()\n search_result = {'threshold': best_th , 'f1': best_score}\n return search_result \n\n\ndef run_cv_model(train, test, target, model_fn, params={}, eval_fn=None, label='model'):\n kf = StratifiedKFold(n_splits=5, random_state=42, shuffle=True)\n fold_splits = kf.split(train, target)\n pred_train = np.zeros((train.shape[0], 1))\n all_coefficients = np.zeros((5, 4))\n feature_importance_df = pd.DataFrame()\n pred_full_test = 0\n cv_scores = []\n i = 1\n for dev_index, val_index in fold_splits:\n print('Started ' + label + ' fold ' + str(i) + '/5')\n if isinstance(train, pd.DataFrame):\n dev_X, val_X = train.iloc[dev_index], train.iloc[val_index]\n dev_y, val_y = target[dev_index], target[val_index]\n else:\n dev_X, val_X = train[dev_index], train[val_index]\n dev_y, val_y = target[dev_index], target[val_index]\n params2 = params.copy()\n pred_val_y, pred_test_y, importances = model_fn(dev_X, dev_y, val_X, val_y, test, params2)\n gc.collect()\n pred_full_test = pred_full_test + pred_test_y\n pred_train[val_index] = pred_val_y\n if eval_fn is not None:\n current_f1_result = threshold_search(val_y, pred_val_y)\n cv_score = current_f1_result['f1']\n cv_scores.append(cv_score)\n print(label + ' cv score {}: F1 {}'.format(i, cv_score))\n fold_importance_df = pd.DataFrame()\n fold_importance_df['feature'] = train.columns.values\n fold_importance_df['importance'] = importances\n fold_importance_df['fold'] = i\n feature_importance_df = pd.concat([feature_importance_df, fold_importance_df], axis=0) \n i += 1\n print('{} cv F1 scores : {}'.format(label, cv_scores))\n print('{} cv mean F1 score : {}'.format(label, np.mean(cv_scores)))\n print('{} cv std F1 score : {}'.format(label, np.mean(cv_scores)))\n pred_full_test = pred_full_test / 5.0\n results = {'label': label,\n 'train': pred_train, 'test': pred_full_test,\n 'cv': cv_scores,\n 'importance': feature_importance_df}\n return results\n\nparams = {\n 'objective' :'binary',\n 'learning_rate' : 0.02,\n 'num_leaves' : 76,\n 'feature_fraction': 0.64, \n 'bagging_fraction': 0.8, \n 'bagging_freq':1,\n 'boosting_type' : 'gbdt',\n 'metric': 'binary_logloss',\n 'min_split_gain': 0.01,\n 'min_child_samples': 150,\n 'min_child_weight': 0.1,\n 'verbosity': -1,\n 'data_random_seed': 3,\n 'early_stop': 100,\n 'verbose_eval': 100,\n 'num_rounds': 1000\n}\n\ndef runLGB(train_X, train_y, test_X, test_y, test_X2, params):\n print('Prep LGB')\n d_train = lgb.Dataset(train_X, label=train_y)\n d_valid = lgb.Dataset(test_X, label=test_y)\n watchlist = [d_train, d_valid]\n print('Train LGB')\n num_rounds = params.pop('num_rounds')\n verbose_eval = params.pop('verbose_eval')\n early_stop = None\n if params.get('early_stop'):\n early_stop = params.pop('early_stop')\n model = lgb.train(params,\n train_set=d_train,\n num_boost_round=num_rounds,\n valid_sets=watchlist,\n verbose_eval=verbose_eval,\n early_stopping_rounds=early_stop)\n print('Predict 1/2')\n pred_test_y = model.predict(test_X, num_iteration=model.best_iteration)\n print(test_y, pred_test_y, pred_test_y > 0.33)\n f1 = f1_score(test_y, pred_test_y > 0.33)\n print(\"f1 score = \", f1)\n print('Predict 2/2')\n pred_test_y2 = model.predict(test_X2, num_iteration=model.best_iteration)\n return pred_test_y.reshape(-1, 1), pred_test_y2.reshape(-1, 1), model.feature_importance()\n\nresults = run_cv_model(trainx, testx, trainy, runLGB, params, label='lgb', eval_fn=f1_score)",
"execution_count": null,
"outputs": []
},
{
"metadata": {
"id": "HPqmtRA4TT8C",
"colab_type": "code",
"colab": {},
"trusted": true,
"_uuid": "a786b0e836c3fff441b50a9b3249cfc5b7ad7e02"
},
"cell_type": "code",
"source": "thresh_res = threshold_search(trainy, results['train'])\nthresh_res",
"execution_count": 44,
"outputs": [
{
"output_type": "stream",
"text": "/opt/conda/lib/python3.6/site-packages/ipykernel_launcher.py:5: RuntimeWarning: divide by zero encountered in true_divide\n \"\"\"\n",
"name": "stderr"
},
{
"output_type": "execute_result",
"execution_count": 44,
"data": {
"text/plain": "{'threshold': 0.28166676369629823, 'f1': 0.9999524337833269}"
},
"metadata": {}
}
]
},
{
"metadata": {
"id": "nPMR51ZRTUIQ",
"colab_type": "code",
"colab": {},
"trusted": true,
"_uuid": "7711b8cb9921e48dd4fb36572de8d551bc6c0ea9"
},
"cell_type": "code",
"source": "trainx_meta = trainx_meta.join(pd.Series((results['train'] > thresh_res['threshold'])[:,0], name='expected_y'))",
"execution_count": 45,
"outputs": []
},
{
"metadata": {
"trusted": true,
"_uuid": "7a1a7ec84e5bfc0cf8652c9717f7f4978e45c96b"
},
"cell_type": "code",
"source": "from sklearn.utils import check_array\ndef mean_absolute_percentage_error(y_true, y_pred): \n #y_true, y_pred = check_array(y_true, y_pred)\n return np.mean(np.abs((y_true - y_pred) / np.maximum(y_true, 290000)))",
"execution_count": 75,
"outputs": []
},
{
"metadata": {
"trusted": true,
"_uuid": "3a6695b783db894ad17e1c2b5b92b02d38a532b6"
},
"cell_type": "code",
"source": "pids = trainx_meta.process_id.unique()\nvalues = np.zeros(pids.shape[0])\nfor idx, pid in tqdm(enumerate(pids)):\n process_sample = trainx_meta[trainx_meta.process_id==pid]\n values[idx] = np.sum(process_sample[(process_sample.expected_y == True) & (process_sample.return_flow > 0)]['return_flow'] * process_sample[(process_sample.expected_y == True) & (process_sample.return_flow > 0)].return_turbidity)",
"execution_count": 53,
"outputs": [
{
"output_type": "stream",
"text": "5021it [01:14, 67.66it/s]\n",
"name": "stderr"
}
]
},
{
"metadata": {
"trusted": true,
"_uuid": "25914df106491cd7fe70571cdd72ba8655a44f67"
},
"cell_type": "code",
"source": "values = values[:,]",
"execution_count": 73,
"outputs": []
},
{
"metadata": {
"trusted": true,
"_uuid": "b7325b07926ce175369f0739b37d1ef588c2cf9a"
},
"cell_type": "code",
"source": "print(mean_absolute_percentage_error(df_labels.final_rinse_total_turbidity_liter.values, values))",
"execution_count": 76,
"outputs": [
{
"output_type": "stream",
"text": "1.565986418091599\n",
"name": "stdout"
}
]
},
{
"metadata": {
"id": "E0tq80b2TUFJ",
"colab_type": "code",
"colab": {},
"trusted": true,
"_uuid": "fcdcf62c0cdaf06341369446ae9646227f922314"
},
"cell_type": "code",
"source": "testx_meta = testx_meta.join(pd.Series((results['test'] > thresh_res['threshold'])[:,0], name='expected_y'))",
"execution_count": 77,
"outputs": []
},
{
"metadata": {
"id": "n4n-Hb0vTUCv",
"colab_type": "code",
"colab": {},
"trusted": true,
"_uuid": "75e8b51fa6afe866424416cc395b1764b00883c2"
},
"cell_type": "code",
"source": "pids = testx_meta.process_id.unique()\nvalues = np.zeros(pids.shape[0])\nfor idx, pid in tqdm(enumerate(pids)):\n process_sample = testx_meta[testx_meta.process_id==pid]\n values[idx] = np.sum(process_sample[(process_sample.expected_y == True) & (process_sample.return_flow > 0)]['return_flow'] * process_sample[(process_sample.expected_y == True) & (process_sample.return_flow > 0)].return_turbidity)",
"execution_count": 79,
"outputs": [
{
"output_type": "stream",
"text": "2967it [00:19, 149.05it/s]\n",
"name": "stderr"
}
]
},
{
"metadata": {
"id": "Mw1NraVMTUAa",
"colab_type": "code",
"colab": {},
"trusted": true,
"_uuid": "c8aeabd71f53bc86e9a77c6dd41f740345fd9c2f"
},
"cell_type": "code",
"source": "df_sub.head()",
"execution_count": 80,
"outputs": [
{
"output_type": "execute_result",
"execution_count": 80,
"data": {
"text/plain": " process_id final_rinse_total_turbidity_liter\n0 20000 1.0\n1 20006 1.0\n2 20007 1.0\n3 20009 1.0\n4 20010 1.0",
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>process_id</th>\n <th>final_rinse_total_turbidity_liter</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>0</th>\n <td>20000</td>\n <td>1.0</td>\n </tr>\n <tr>\n <th>1</th>\n <td>20006</td>\n <td>1.0</td>\n </tr>\n <tr>\n <th>2</th>\n <td>20007</td>\n <td>1.0</td>\n </tr>\n <tr>\n <th>3</th>\n <td>20009</td>\n <td>1.0</td>\n </tr>\n <tr>\n <th>4</th>\n <td>20010</td>\n <td>1.0</td>\n </tr>\n </tbody>\n</table>\n</div>"
},
"metadata": {}
}
]
},
{
"metadata": {
"trusted": true,
"_uuid": "2fd02e6621299ea54e781bd24a76638624065338"
},
"cell_type": "code",
"source": "submission = pd.DataFrame({'process_id':pids, 'final_rinse_total_turbidity_liter':values})",
"execution_count": 83,
"outputs": []
},
{
"metadata": {
"trusted": true,
"_uuid": "9c558925c6ba0e40b8b3529c048738a20e91e390"
},
"cell_type": "code",
"source": "submission.to_csv('submission.csv', index=False)",
"execution_count": 84,
"outputs": []
},
{
"metadata": {
"trusted": true,
"_uuid": "b2b80021b4c34a5411b28ba366e28c8fe4a7ff07"
},
"cell_type": "code",
"source": "testx_meta[testx_meta['expected_y'] == True]",
"execution_count": 87,
"outputs": [
{
"output_type": "execute_result",
"execution_count": 87,
"data": {
"text/plain": "Empty DataFrame\nColumns: [row_id, process_id, object_id, return_turbidity, return_flow, expected_y]\nIndex: []",
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>row_id</th>\n <th>process_id</th>\n <th>object_id</th>\n <th>return_turbidity</th>\n <th>return_flow</th>\n <th>expected_y</th>\n </tr>\n </thead>\n <tbody>\n </tbody>\n</table>\n</div>"
},
"metadata": {}
}
]
},
{
"metadata": {
"trusted": true,
"_uuid": "1d1b556ec1838fa6e034872ca93bd3b185a43a8f"
},
"cell_type": "code",
"source": "gc.collect()",
"execution_count": 89,
"outputs": [
{
"output_type": "execute_result",
"execution_count": 89,
"data": {
"text/plain": "208"
},
"metadata": {}
}
]
},
{
"metadata": {
"trusted": true,
"_uuid": "541225de7a03311b0ee239e9faa16aec89aa7e84"
},
"cell_type": "code",
"source": "",
"execution_count": null,
"outputs": []
}
],
"metadata": {
"colab": {
"name": "Sustainable Industry: Rinse Over Run #1.ipynb",
"version": "0.3.2",
"provenance": [],
"collapsed_sections": [
"N_uO6NYwUu6z",
"cqKvAzDMV8Xl"
]
},
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"name": "python",
"version": "3.6.6",
"mimetype": "text/x-python",
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"pygments_lexer": "ipython3",
"nbconvert_exporter": "python",
"file_extension": ".py"
}
},
"nbformat": 4,
"nbformat_minor": 1
}
Display the source blob
Display the rendered blob
Raw
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Display the source blob
Display the rendered blob
Raw
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment