rahman19dec/accident-severity.ipynb

## accident-severity.ipynb
{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "kernelspec": {
      "display_name": "Python 3",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.8.2"
    },
    "colab": {
      "name": "Accident Severity.ipynb",
      "provenance": [],
      "include_colab_link": true
    }
  },
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "view-in-github",
        "colab_type": "text"
      },
      "source": [
        "<a href=\"https://colab.research.google.com/gist/muzahid19dec/3c021e2987a14168a3f270df91d99903/accident-severity.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "XdRgdfRDykCl",
        "colab_type": "text"
      },
      "source": [
        "## Table of contents\n",
        "* [Introduction: Business Problem](#introduction)\n",
        "* [Data](#data)\n",
        "* [Data Preprocessing](#data_pre)\n",
        "* [Methodology](#Methodology)\n",
        "* [Logistic Regression](#regression)\n",
        "* [KNN](#knn)\n",
        "* [Decision Tree](#tree)\n",
        "* [Conclusion](#concllusion)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "IioDagOfykCm",
        "colab_type": "text"
      },
      "source": [
        "## Introduction <a name=\"introduction\"></a>"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "F4iJiC0rykCn",
        "colab_type": "text"
      },
      "source": [
        "In this project, we will find the most favorable condition for severe traffic accidents. This report will help us be more cautious in that situation to avoid them. \n",
        "Human is prone to mistakes and very often we encounter accidents, but some accidents are severe, damaging lives and properties and others are not.\n",
        "Using data science we will explore and try to find the patterns in the conditions and accident severity. The probability of severe accidents will then be clearly expressed so that we can avoid the optimal condition of severe accidents. Everyone will be benefited from these findings if we take caution accordingly. "
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "b519ciy8ykCn",
        "colab_type": "text"
      },
      "source": [
        "## Data <a name=\"data\"></a>"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "Xik8Y18pykCo",
        "colab_type": "text"
      },
      "source": [
        "Based on the definition of our problem, the factors that will influence our decision are:\n",
        "Location: the road is crowdy or not, Date and time of the accident, Collusion type, Number of people involved in the accident, Number of pedestrians involved, Number and type of vehicles involved, Hit parked car or not, Intentional or not, Driver was drunk or not, Weather condition, Road condition, Junction type, Speed of vehicles, Light condition, Collision from the front or behind, Severity rank: how severe the accident is, Number and condition of injuries and fatality.\n",
        "\n",
        "We collect a dataset from https://www.arcgis.com/. In the dataset, there is more information like collision code and if the accident occurred at a crosswalk or not.  \n",
        "\n",
        "The following data sources will be needed to extract/generate the required information:\n",
        "centers of candidate areas will be generated algorithmically and approximate addresses of centers of those areas will be obtained using Google Maps API.\n",
        "\n",
        "At primary observation, we can see the night time collisions tend to be less severe. But we need to explore and analyze the whole data to be sure about the favorable condition for severe accidents. "
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "aAvEi0kkykCo",
        "colab_type": "code",
        "colab": {},
        "outputId": "fb5d10d6-c120-48bc-fa36-5c32b11fcf50"
      },
      "source": [
        "import pandas as pd\n",
        "from sklearn.utils import resample\n",
        "pre_df = pd.read_csv('Data-Collisions.csv')"
      ],
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "c:\\users\\muzah\\appdata\\local\\programs\\python\\python38\\lib\\site-packages\\IPython\\core\\interactiveshell.py:3062: DtypeWarning: Columns (33) have mixed types.Specify dtype option on import or set low_memory=False.\n",
            "  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,\n"
          ],
          "name": "stderr"
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "BFK6d0SRykCt",
        "colab_type": "text"
      },
      "source": [
        "## Data Preprocessing <a name=\"data_pre\"></a>"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "m8V3kTEVykCt",
        "colab_type": "text"
      },
      "source": [
        "In order to prepare the data for analysis we need to identify the relevent columns that we want to analysis. And for now we will drop other columns. And we need to convert the nonnumaric values to numaric ones. \n",
        "After analyzing th dta set, we want to keep our analysis on only 4 features, severity, weather condition, road condition and light condition.\n",
        "Our target variable is severity code which express how severe the accident is. So let us balance the data set first."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "4lL8NWehykCu",
        "colab_type": "code",
        "colab": {},
        "outputId": "2ed75cd5-efce-490b-b2cf-2b264fd2e7c7"
      },
      "source": [
        "pre_df['SEVERITYCODE'].value_counts()"
      ],
      "execution_count": null,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "1    105184\n",
              "2     43617\n",
              "Name: SEVERITYCODE, dtype: int64"
            ]
          },
          "metadata": {
            "tags": []
          },
          "execution_count": 2
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "ast-1_sLykCy",
        "colab_type": "code",
        "colab": {},
        "outputId": "3d894262-a9f2-4bc9-f54b-cf45808502ea"
      },
      "source": [
        "pre_df_1 = pre_df[pre_df.SEVERITYCODE==1]\n",
        "pre_df_2 = pre_df[pre_df.SEVERITYCODE==2]\n",
        "\n",
        "pre_df_dsample = resample(pre_df_1, replace =False,\n",
        "                         n_samples = 43617,\n",
        "                         random_state = 321)\n",
        "b_df = pd.concat([pre_df_dsample, pre_df_2])\n",
        "b_df.SEVERITYCODE.value_counts()"
      ],
      "execution_count": null,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "2    43617\n",
              "1    43617\n",
              "Name: SEVERITYCODE, dtype: int64"
            ]
          },
          "metadata": {
            "tags": []
          },
          "execution_count": 3
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "awLAJfleykC0",
        "colab_type": "text"
      },
      "source": [
        "Now the dataset is balanced and we can start working on it."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "3O8-VyveykC5",
        "colab_type": "code",
        "colab": {},
        "outputId": "e8802f55-87de-4a96-ef29-88348c1a9d27"
      },
      "source": [
        "b_df['ROADCOND'].value_counts()"
      ],
      "execution_count": null,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "Dry               57422\n",
              "Wet               21853\n",
              "Unknown            5818\n",
              "Ice                 570\n",
              "Snow/Slush          403\n",
              "Other                57\n",
              "Standing Water       52\n",
              "Sand/Mud/Dirt        36\n",
              "Oil                  34\n",
              "Name: ROADCOND, dtype: int64"
            ]
          },
          "metadata": {
            "tags": []
          },
          "execution_count": 4
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "LJARrPQDykC-",
        "colab_type": "code",
        "colab": {},
        "outputId": "5ecc9ea5-a7e3-4e66-80b2-2cb294766f1e"
      },
      "source": [
        "b_df['LIGHTCOND'].value_counts()"
      ],
      "execution_count": null,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "Daylight                    54067\n",
              "Dark - Street Lights On     21792\n",
              "Unknown                      5305\n",
              "Dusk                         2732\n",
              "Dawn                         1068\n",
              "Dark - No Street Lights       615\n",
              "Dark - Street Lights Off      536\n",
              "Other                          84\n",
              "Name: LIGHTCOND, dtype: int64"
            ]
          },
          "metadata": {
            "tags": []
          },
          "execution_count": 5
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "bbuDRZS8ykDE",
        "colab_type": "code",
        "colab": {},
        "outputId": "bcdc7edd-272b-482a-eb4b-c19a210a018a"
      },
      "source": [
        "b_df['WEATHER'].value_counts()"
      ],
      "execution_count": null,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "Clear                       51009\n",
              "Raining                     15438\n",
              "Overcast                    12866\n",
              "Unknown                      5920\n",
              "Snowing                       356\n",
              "Other                         315\n",
              "Fog/Smog/Smoke                241\n",
              "Sleet/Hail/Freezing Rain       58\n",
              "Blowing Sand/Dirt              22\n",
              "Severe Crosswind               11\n",
              "Name: WEATHER, dtype: int64"
            ]
          },
          "metadata": {
            "tags": []
          },
          "execution_count": 6
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "KbBgXopIykDJ",
        "colab_type": "code",
        "colab": {},
        "outputId": "403736ca-1a0b-4910-8af0-cb2694bddbce"
      },
      "source": [
        "#Selecting Features\n",
        "y = b_df.iloc[:,0]\n",
        "X = b_df[['WEATHER','ROADCOND','LIGHTCOND']]\n",
        "X['WEATHER']"
      ],
      "execution_count": null,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "96637     Overcast\n",
              "41663        Clear\n",
              "9545         Clear\n",
              "57918        Clear\n",
              "63370      Unknown\n",
              "            ...   \n",
              "148783     Raining\n",
              "148787       Clear\n",
              "148791       Clear\n",
              "148799    Overcast\n",
              "148800    Overcast\n",
              "Name: WEATHER, Length: 87234, dtype: object"
            ]
          },
          "metadata": {
            "tags": []
          },
          "execution_count": 7
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "GD2Ded5hykDM",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "X = pd.get_dummies(X, columns=['WEATHER', 'LIGHTCOND', 'ROADCOND'], drop_first=True)"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "zC_6NQLtykDO",
        "colab_type": "text"
      },
      "source": [
        "### Splitting train and test set"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "lPB4gjWgykDP",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "#Splitting Train Test\n",
        "from sklearn.model_selection import train_test_split\n",
        "X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 123)"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "lDHDVuTWykDR",
        "colab_type": "text"
      },
      "source": [
        "## Methodology <a name=\"Methodology\"></a>\n",
        "For implementing the solution, I have used Github as a repository and running Jupyter Notebook to preprocess data and build Machine Learning models. Regarding coding, I have used Python and its popular packages such as Pandas, NumPy and Sklearn."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "lUbTIqC9ykDS",
        "colab_type": "text"
      },
      "source": [
        "Once I have load data into Pandas Dataframe, used ‘dtypes’ attribute to check the feature names and their data types. Then I have selected the most important features to predict the severity of accidents in Seattle. Among all the features, the following features have the most influence in the accuracy of the predictions:\n",
        "\n",
        "WEATHER\n",
        "\n",
        "ROADCOND\n",
        "\n",
        "LIGHTCOND\n",
        "\n",
        "Also, as I mentioned earlier, “SEVERITYCODE” is the target variable.\n",
        "I have run a value count on road (‘ROADCOND’) and weather condition (‘WEATHER’) to get ideas of the different road and weather conditions. I also have run a value count on light condition (’LIGHTCOND’), to see the breakdowns of accidents occurring during the different light conditions. The results can be seen below:"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "ZIT7x48YykDS",
        "colab_type": "text"
      },
      "source": [
        "After balancing SEVERITYCODE feature, and standardizing the input feature, the data has been ready for building machine learning models.\n",
        "I have employed three machine learning models:\n",
        "\n",
        "Linear Regression\n",
        "\n",
        "K Nearest Neighbour (KNN)\n",
        "\n",
        "Decision Tree\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "nF8CqNZBykDU",
        "colab_type": "text"
      },
      "source": [
        "After importing necessary packages and splitting preprocessed data into test and train sets, for each machine learning model, I have built and evaluated the model and shown the results as follow:"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "6pREVu5WykDU",
        "colab_type": "text"
      },
      "source": [
        "## Logistic Regression  <a name=\"regression\"></a>"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "0_9ChZI9ykDV",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "from sklearn.linear_model import LogisticRegression\n",
        "from sklearn.metrics import f1_score, log_loss\n",
        "LR = LogisticRegression(C = 6, solver = 'liblinear').fit(X_train, y_train)"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "daf6YEe0ykDX",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "LR_pred = LR.predict(X_test)\n",
        "LR_proba = LR.predict_proba(X_test)"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "HLnG7crnykDa",
        "colab_type": "text"
      },
      "source": [
        "#### Let us measure log loss and f1 score"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "iATdrHSqykDa",
        "colab_type": "code",
        "colab": {},
        "outputId": "a35b60ac-5ff0-4a26-cb43-3094d4288d99"
      },
      "source": [
        "print('log loss: ',log_loss(y_test, LR_proba))\n",
        "print('f1_score: ', f1_score(y_test, LR_pred))"
      ],
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "log loss:  0.6619198339076018\n",
            "f1_score:  0.3232180663373324\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "WbV4Hk94ykDd",
        "colab_type": "text"
      },
      "source": [
        "## KNN <a name=\"knn\"></a>"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "TeI8qk9JykDd",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "from sklearn.neighbors import KNeighborsClassifier\n",
        "from sklearn.metrics import f1_score\n",
        "k = 17\n",
        "knn = KNeighborsClassifier(n_neighbors = k).fit(X_train, y_train)\n",
        "knn_pred = knn.predict(X_test)"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "qNXOZO2YykDg",
        "colab_type": "text"
      },
      "source": [
        "#### Evaluation"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "d21jysH_ykDg",
        "colab_type": "code",
        "colab": {},
        "outputId": "5dfb8eff-2862-4474-b4cb-682304eb3ef3"
      },
      "source": [
        "f1_score(y_test, knn_pred, average = 'macro')"
      ],
      "execution_count": null,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "0.5426484037355158"
            ]
          },
          "metadata": {
            "tags": []
          },
          "execution_count": 15
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "5dBoAjZKykDj",
        "colab_type": "text"
      },
      "source": [
        "## Decision Tree  <a name=\"tree\"></a>"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "4SbgdvL1ykDk",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "from sklearn.tree import DecisionTreeClassifier\n",
        "dt = DecisionTreeClassifier(criterion = 'entropy', max_depth = 7).fit(X_train, y_train)"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "cxUWcdARykDn",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "dt_pred = dt.predict(X_test)"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "LgoBaPRXykDp",
        "colab_type": "text"
      },
      "source": [
        "#### Evaluation"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "zswAa3CjykDp",
        "colab_type": "code",
        "colab": {},
        "outputId": "92c5489e-07ac-4a07-e4ec-348a4c454d22"
      },
      "source": [
        "f1_score(y_test, dt_pred)"
      ],
      "execution_count": null,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "0.44739424020315116"
            ]
          },
          "metadata": {
            "tags": []
          },
          "execution_count": 18
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "Tu35OT6LykDr",
        "colab_type": "text"
      },
      "source": [
        "## Conclusion <a name=\"concllusion\"></a>"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "09egVtj0ykDs",
        "colab_type": "text"
      },
      "source": [
        "Based on the observation, KNN is the best modewl to predict car accident sevirity"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "h7aPuHYgykDs",
        "colab_type": "text"
      },
      "source": [
        "We can say, road, and light conditions pointing to certain classes, we can conclude that particular conditions have a somewhat impact on whether or not travel could result in property damage (class 1) or injury (class 2)."
      ]
    }
  ]
}
	{
	"nbformat": 4,
	"nbformat_minor": 0,
	"metadata": {
	"kernelspec": {
	"display_name": "Python 3",
	"language": "python",
	"name": "python3"
	},
	"language_info": {
	"codemirror_mode": {
	"name": "ipython",
	"version": 3
	},
	"file_extension": ".py",
	"mimetype": "text/x-python",
	"name": "python",
	"nbconvert_exporter": "python",
	"pygments_lexer": "ipython3",
	"version": "3.8.2"
	},
	"colab": {
	"name": "Accident Severity.ipynb",
	"provenance": [],
	"include_colab_link": true
	}
	},
	"cells": [
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "view-in-github",
	"colab_type": "text"
	},
	"source": [
	"<a href=\"https://colab.research.google.com/gist/muzahid19dec/3c021e2987a14168a3f270df91d99903/accident-severity.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "XdRgdfRDykCl",
	"colab_type": "text"
	},
	"source": [
	"## Table of contents\n",
	"* [Introduction: Business Problem](#introduction)\n",
	"* [Data](#data)\n",
	"* [Data Preprocessing](#data_pre)\n",
	"* [Methodology](#Methodology)\n",
	"* [Logistic Regression](#regression)\n",
	"* [KNN](#knn)\n",
	"* [Decision Tree](#tree)\n",
	"* [Conclusion](#concllusion)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "IioDagOfykCm",
	"colab_type": "text"
	},
	"source": [
	"## Introduction <a name=\"introduction\"></a>"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "F4iJiC0rykCn",
	"colab_type": "text"
	},
	"source": [
	"In this project, we will find the most favorable condition for severe traffic accidents. This report will help us be more cautious in that situation to avoid them. \n",
	"Human is prone to mistakes and very often we encounter accidents, but some accidents are severe, damaging lives and properties and others are not.\n",
	"Using data science we will explore and try to find the patterns in the conditions and accident severity. The probability of severe accidents will then be clearly expressed so that we can avoid the optimal condition of severe accidents. Everyone will be benefited from these findings if we take caution accordingly. "
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "b519ciy8ykCn",
	"colab_type": "text"
	},
	"source": [
	"## Data <a name=\"data\"></a>"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "Xik8Y18pykCo",
	"colab_type": "text"
	},
	"source": [
	"Based on the definition of our problem, the factors that will influence our decision are:\n",
	"Location: the road is crowdy or not, Date and time of the accident, Collusion type, Number of people involved in the accident, Number of pedestrians involved, Number and type of vehicles involved, Hit parked car or not, Intentional or not, Driver was drunk or not, Weather condition, Road condition, Junction type, Speed of vehicles, Light condition, Collision from the front or behind, Severity rank: how severe the accident is, Number and condition of injuries and fatality.\n",
	"\n",
	"We collect a dataset from https://www.arcgis.com/. In the dataset, there is more information like collision code and if the accident occurred at a crosswalk or not. \n",
	"\n",
	"The following data sources will be needed to extract/generate the required information:\n",
	"centers of candidate areas will be generated algorithmically and approximate addresses of centers of those areas will be obtained using Google Maps API.\n",
	"\n",
	"At primary observation, we can see the night time collisions tend to be less severe. But we need to explore and analyze the whole data to be sure about the favorable condition for severe accidents. "
	]
	},
	{
	"cell_type": "code",
	"metadata": {
	"id": "aAvEi0kkykCo",
	"colab_type": "code",
	"colab": {},
	"outputId": "fb5d10d6-c120-48bc-fa36-5c32b11fcf50"
	},
	"source": [
	"import pandas as pd\n",
	"from sklearn.utils import resample\n",
	"pre_df = pd.read_csv('Data-Collisions.csv')"
	],
	"execution_count": null,
	"outputs": [
	{
	"output_type": "stream",
	"text": [
	"c:\\users\\muzah\\appdata\\local\\programs\\python\\python38\\lib\\site-packages\\IPython\\core\\interactiveshell.py:3062: DtypeWarning: Columns (33) have mixed types.Specify dtype option on import or set low_memory=False.\n",
	" has_raised = await self.run_ast_nodes(code_ast.body, cell_name,\n"
	],
	"name": "stderr"
	}
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "BFK6d0SRykCt",
	"colab_type": "text"
	},
	"source": [
	"## Data Preprocessing <a name=\"data_pre\"></a>"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "m8V3kTEVykCt",
	"colab_type": "text"
	},
	"source": [
	"In order to prepare the data for analysis we need to identify the relevent columns that we want to analysis. And for now we will drop other columns. And we need to convert the nonnumaric values to numaric ones. \n",
	"After analyzing th dta set, we want to keep our analysis on only 4 features, severity, weather condition, road condition and light condition.\n",
	"Our target variable is severity code which express how severe the accident is. So let us balance the data set first."
	]
	},
	{
	"cell_type": "code",
	"metadata": {
	"id": "4lL8NWehykCu",
	"colab_type": "code",
	"colab": {},
	"outputId": "2ed75cd5-efce-490b-b2cf-2b264fd2e7c7"
	},
	"source": [
	"pre_df['SEVERITYCODE'].value_counts()"
	],
	"execution_count": null,
	"outputs": [
	{
	"output_type": "execute_result",
	"data": {
	"text/plain": [
	"1 105184\n",
	"2 43617\n",
	"Name: SEVERITYCODE, dtype: int64"
	]
	},
	"metadata": {
	"tags": []
	},
	"execution_count": 2
	}
	]
	},
	{
	"cell_type": "code",
	"metadata": {
	"id": "ast-1_sLykCy",
	"colab_type": "code",
	"colab": {},
	"outputId": "3d894262-a9f2-4bc9-f54b-cf45808502ea"
	},
	"source": [
	"pre_df_1 = pre_df[pre_df.SEVERITYCODE==1]\n",
	"pre_df_2 = pre_df[pre_df.SEVERITYCODE==2]\n",
	"\n",
	"pre_df_dsample = resample(pre_df_1, replace =False,\n",
	" n_samples = 43617,\n",
	" random_state = 321)\n",
	"b_df = pd.concat([pre_df_dsample, pre_df_2])\n",
	"b_df.SEVERITYCODE.value_counts()"
	],
	"execution_count": null,
	"outputs": [
	{
	"output_type": "execute_result",
	"data": {
	"text/plain": [
	"2 43617\n",
	"1 43617\n",
	"Name: SEVERITYCODE, dtype: int64"
	]
	},
	"metadata": {
	"tags": []
	},
	"execution_count": 3
	}
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "awLAJfleykC0",
	"colab_type": "text"
	},
	"source": [
	"Now the dataset is balanced and we can start working on it."
	]
	},
	{
	"cell_type": "code",
	"metadata": {
	"id": "3O8-VyveykC5",
	"colab_type": "code",
	"colab": {},
	"outputId": "e8802f55-87de-4a96-ef29-88348c1a9d27"
	},
	"source": [
	"b_df['ROADCOND'].value_counts()"
	],
	"execution_count": null,
	"outputs": [
	{
	"output_type": "execute_result",
	"data": {
	"text/plain": [
	"Dry 57422\n",
	"Wet 21853\n",
	"Unknown 5818\n",
	"Ice 570\n",
	"Snow/Slush 403\n",
	"Other 57\n",
	"Standing Water 52\n",
	"Sand/Mud/Dirt 36\n",
	"Oil 34\n",
	"Name: ROADCOND, dtype: int64"
	]
	},
	"metadata": {
	"tags": []
	},
	"execution_count": 4
	}
	]
	},
	{
	"cell_type": "code",
	"metadata": {
	"id": "LJARrPQDykC-",
	"colab_type": "code",
	"colab": {},
	"outputId": "5ecc9ea5-a7e3-4e66-80b2-2cb294766f1e"
	},
	"source": [
	"b_df['LIGHTCOND'].value_counts()"
	],
	"execution_count": null,
	"outputs": [
	{
	"output_type": "execute_result",
	"data": {
	"text/plain": [
	"Daylight 54067\n",
	"Dark - Street Lights On 21792\n",
	"Unknown 5305\n",
	"Dusk 2732\n",
	"Dawn 1068\n",
	"Dark - No Street Lights 615\n",
	"Dark - Street Lights Off 536\n",
	"Other 84\n",
	"Name: LIGHTCOND, dtype: int64"
	]
	},
	"metadata": {
	"tags": []
	},
	"execution_count": 5
	}
	]
	},
	{
	"cell_type": "code",
	"metadata": {
	"id": "bbuDRZS8ykDE",
	"colab_type": "code",
	"colab": {},
	"outputId": "bcdc7edd-272b-482a-eb4b-c19a210a018a"
	},
	"source": [
	"b_df['WEATHER'].value_counts()"
	],
	"execution_count": null,
	"outputs": [
	{
	"output_type": "execute_result",
	"data": {
	"text/plain": [
	"Clear 51009\n",
	"Raining 15438\n",
	"Overcast 12866\n",
	"Unknown 5920\n",
	"Snowing 356\n",
	"Other 315\n",
	"Fog/Smog/Smoke 241\n",
	"Sleet/Hail/Freezing Rain 58\n",
	"Blowing Sand/Dirt 22\n",
	"Severe Crosswind 11\n",
	"Name: WEATHER, dtype: int64"
	]
	},
	"metadata": {
	"tags": []
	},
	"execution_count": 6
	}
	]
	},
	{
	"cell_type": "code",
	"metadata": {
	"id": "KbBgXopIykDJ",
	"colab_type": "code",
	"colab": {},
	"outputId": "403736ca-1a0b-4910-8af0-cb2694bddbce"
	},
	"source": [
	"#Selecting Features\n",
	"y = b_df.iloc[:,0]\n",
	"X = b_df[['WEATHER','ROADCOND','LIGHTCOND']]\n",
	"X['WEATHER']"
	],
	"execution_count": null,
	"outputs": [
	{
	"output_type": "execute_result",
	"data": {
	"text/plain": [
	"96637 Overcast\n",
	"41663 Clear\n",
	"9545 Clear\n",
	"57918 Clear\n",
	"63370 Unknown\n",
	" ... \n",
	"148783 Raining\n",
	"148787 Clear\n",
	"148791 Clear\n",
	"148799 Overcast\n",
	"148800 Overcast\n",
	"Name: WEATHER, Length: 87234, dtype: object"
	]
	},
	"metadata": {
	"tags": []
	},
	"execution_count": 7
	}
	]
	},
	{
	"cell_type": "code",
	"metadata": {
	"id": "GD2Ded5hykDM",
	"colab_type": "code",
	"colab": {}
	},
	"source": [
	"X = pd.get_dummies(X, columns=['WEATHER', 'LIGHTCOND', 'ROADCOND'], drop_first=True)"
	],
	"execution_count": null,
	"outputs": []
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "zC_6NQLtykDO",
	"colab_type": "text"
	},
	"source": [
	"### Splitting train and test set"
	]
	},
	{
	"cell_type": "code",
	"metadata": {
	"id": "lPB4gjWgykDP",
	"colab_type": "code",
	"colab": {}
	},
	"source": [
	"#Splitting Train Test\n",
	"from sklearn.model_selection import train_test_split\n",
	"X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 123)"
	],
	"execution_count": null,
	"outputs": []
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "lDHDVuTWykDR",
	"colab_type": "text"
	},
	"source": [
	"## Methodology <a name=\"Methodology\"></a>\n",
	"For implementing the solution, I have used Github as a repository and running Jupyter Notebook to preprocess data and build Machine Learning models. Regarding coding, I have used Python and its popular packages such as Pandas, NumPy and Sklearn."
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "lUbTIqC9ykDS",
	"colab_type": "text"
	},
	"source": [
	"Once I have load data into Pandas Dataframe, used ‘dtypes’ attribute to check the feature names and their data types. Then I have selected the most important features to predict the severity of accidents in Seattle. Among all the features, the following features have the most influence in the accuracy of the predictions:\n",
	"\n",
	"WEATHER\n",
	"\n",
	"ROADCOND\n",
	"\n",
	"LIGHTCOND\n",
	"\n",
	"Also, as I mentioned earlier, “SEVERITYCODE” is the target variable.\n",
	"I have run a value count on road (‘ROADCOND’) and weather condition (‘WEATHER’) to get ideas of the different road and weather conditions. I also have run a value count on light condition (’LIGHTCOND’), to see the breakdowns of accidents occurring during the different light conditions. The results can be seen below:"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "ZIT7x48YykDS",
	"colab_type": "text"
	},
	"source": [
	"After balancing SEVERITYCODE feature, and standardizing the input feature, the data has been ready for building machine learning models.\n",
	"I have employed three machine learning models:\n",
	"\n",
	"Linear Regression\n",
	"\n",
	"K Nearest Neighbour (KNN)\n",
	"\n",
	"Decision Tree\n"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "nF8CqNZBykDU",
	"colab_type": "text"
	},
	"source": [
	"After importing necessary packages and splitting preprocessed data into test and train sets, for each machine learning model, I have built and evaluated the model and shown the results as follow:"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "6pREVu5WykDU",
	"colab_type": "text"
	},
	"source": [
	"## Logistic Regression <a name=\"regression\"></a>"
	]
	},
	{
	"cell_type": "code",
	"metadata": {
	"id": "0_9ChZI9ykDV",
	"colab_type": "code",
	"colab": {}
	},
	"source": [
	"from sklearn.linear_model import LogisticRegression\n",
	"from sklearn.metrics import f1_score, log_loss\n",
	"LR = LogisticRegression(C = 6, solver = 'liblinear').fit(X_train, y_train)"
	],
	"execution_count": null,
	"outputs": []
	},
	{
	"cell_type": "code",
	"metadata": {
	"id": "daf6YEe0ykDX",
	"colab_type": "code",
	"colab": {}
	},
	"source": [
	"LR_pred = LR.predict(X_test)\n",
	"LR_proba = LR.predict_proba(X_test)"
	],
	"execution_count": null,
	"outputs": []
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "HLnG7crnykDa",
	"colab_type": "text"
	},
	"source": [
	"#### Let us measure log loss and f1 score"
	]
	},
	{
	"cell_type": "code",
	"metadata": {
	"id": "iATdrHSqykDa",
	"colab_type": "code",
	"colab": {},
	"outputId": "a35b60ac-5ff0-4a26-cb43-3094d4288d99"
	},
	"source": [
	"print('log loss: ',log_loss(y_test, LR_proba))\n",
	"print('f1_score: ', f1_score(y_test, LR_pred))"
	],
	"execution_count": null,
	"outputs": [
	{
	"output_type": "stream",
	"text": [
	"log loss: 0.6619198339076018\n",
	"f1_score: 0.3232180663373324\n"
	],
	"name": "stdout"
	}
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "WbV4Hk94ykDd",
	"colab_type": "text"
	},
	"source": [
	"## KNN <a name=\"knn\"></a>"
	]
	},
	{
	"cell_type": "code",
	"metadata": {
	"id": "TeI8qk9JykDd",
	"colab_type": "code",
	"colab": {}
	},
	"source": [
	"from sklearn.neighbors import KNeighborsClassifier\n",
	"from sklearn.metrics import f1_score\n",
	"k = 17\n",
	"knn = KNeighborsClassifier(n_neighbors = k).fit(X_train, y_train)\n",
	"knn_pred = knn.predict(X_test)"
	],
	"execution_count": null,
	"outputs": []
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "qNXOZO2YykDg",
	"colab_type": "text"
	},
	"source": [
	"#### Evaluation"
	]
	},
	{
	"cell_type": "code",
	"metadata": {
	"id": "d21jysH_ykDg",
	"colab_type": "code",
	"colab": {},
	"outputId": "5dfb8eff-2862-4474-b4cb-682304eb3ef3"
	},
	"source": [
	"f1_score(y_test, knn_pred, average = 'macro')"
	],
	"execution_count": null,
	"outputs": [
	{
	"output_type": "execute_result",
	"data": {
	"text/plain": [
	"0.5426484037355158"
	]
	},
	"metadata": {
	"tags": []
	},
	"execution_count": 15
	}
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "5dBoAjZKykDj",
	"colab_type": "text"
	},
	"source": [
	"## Decision Tree <a name=\"tree\"></a>"
	]
	},
	{
	"cell_type": "code",
	"metadata": {
	"id": "4SbgdvL1ykDk",
	"colab_type": "code",
	"colab": {}
	},
	"source": [
	"from sklearn.tree import DecisionTreeClassifier\n",
	"dt = DecisionTreeClassifier(criterion = 'entropy', max_depth = 7).fit(X_train, y_train)"
	],
	"execution_count": null,
	"outputs": []
	},
	{
	"cell_type": "code",
	"metadata": {
	"id": "cxUWcdARykDn",
	"colab_type": "code",
	"colab": {}
	},
	"source": [
	"dt_pred = dt.predict(X_test)"
	],
	"execution_count": null,
	"outputs": []
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "LgoBaPRXykDp",
	"colab_type": "text"
	},
	"source": [
	"#### Evaluation"
	]
	},
	{
	"cell_type": "code",
	"metadata": {
	"id": "zswAa3CjykDp",
	"colab_type": "code",
	"colab": {},
	"outputId": "92c5489e-07ac-4a07-e4ec-348a4c454d22"
	},
	"source": [
	"f1_score(y_test, dt_pred)"
	],
	"execution_count": null,
	"outputs": [
	{
	"output_type": "execute_result",
	"data": {
	"text/plain": [
	"0.44739424020315116"
	]
	},
	"metadata": {
	"tags": []
	},
	"execution_count": 18
	}
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "Tu35OT6LykDr",
	"colab_type": "text"
	},
	"source": [
	"## Conclusion <a name=\"concllusion\"></a>"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "09egVtj0ykDs",
	"colab_type": "text"
	},
	"source": [
	"Based on the observation, KNN is the best modewl to predict car accident sevirity"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "h7aPuHYgykDs",
	"colab_type": "text"
	},
	"source": [
	"We can say, road, and light conditions pointing to certain classes, we can conclude that particular conditions have a somewhat impact on whether or not travel could result in property damage (class 1) or injury (class 2)."
	]
	}
	]
	}