jkuruzovich/xmb.ipynb

## xmb.ipynb
{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "colab": {
      "name": "midterm_student.ipynb",
      "provenance": [],
      "collapsed_sections": [],
      "include_colab_link": true
    },
    "kernelspec": {
      "display_name": "Python [conda env:cadre]",
      "language": "python",
      "name": "conda-env-cadre-py"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.6.8"
    }
  },
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "view-in-github",
        "colab_type": "text"
      },
      "source": [
        "<a href=\"https://colab.research.google.com/gist/jkuruzovich/dae388ffc37389e0b6aba669ad156780/midterm_student.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "vZfe_BrN2782"
      },
      "source": [
        "## Midterm\n",
        "\n",
        "![](https://github.com/rpi-techfundamentals/hm-01-starter/blob/master/notsaved.png?raw=1)\n",
        "\n",
        "**WARNING!!!  If you see this icon on the top of your COLAB sesssion, your work is not saved automatically.**\n",
        "\n",
        "**Do not manually upload any files.  Use the `wget` command to retreive files.**\n",
        "\n",
        "**Save your working file in Google drive so that all changes will be saved as you work. MAKE SURE that your final version is saved to GitHub.** \n",
        "\n",
        "Before you turn this in, make sure everything runs as expected. First, restart the kernel (in the menu, select Kernel → Restart) and then run all cells (in the menubar, select Cell → Run All). They should run completely without intervention...**i.e., DO NOT not manually upload any files.  Use the `wget` command to retreive files as necesssary.**\n",
        "\n",
        "\n",
        "### This is a 55 point assignment.\n",
        "\n",
        "**You may find it useful to go through the notebooks from the course materials when doing these exercises.**\n",
        "\n",
        "**If you receive assistance from anyone in the class it it will be considered an ethical violation and referred to associate dean.**"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "colab_type": "code",
        "id": "wAwXOu5EwrwL",
        "colab": {}
      },
      "source": [
        "#get the data\n",
        "!wget https://www.dropbox.com/s/tg5sq8202u9zcoq/2020_v2.csv\n"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "colab_type": "code",
        "id": "DlqW9xLjyRhV",
        "colab": {}
      },
      "source": [
        "import pandas as pd\n",
        "df  = pd.read_csv('2020_v2.csv')\n",
        "df.head()"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "1rX8YKAq0vfJ"
      },
      "source": [
        "### (15 points) 1. Basics (3 points each)\n",
        "\n",
        "The dataset is a simple simulated dataset representing customers (`id`) and whether they make a purchase (`purchase`) over a period of time (`ymr`).\n",
        "\n",
        "1a. Set `rows` as an *integer* equal to the number of rows in the `df` dataframe. \n",
        "\n",
        "1b. Set `columns` as an *integer* equal to the number of columns in the `df` dataframe. \n",
        "\n",
        "1c. Set `udates` as a *list* with the unique dates of the `df` dataframe. \n",
        "\n",
        "1d. Set `total_nulls` as the total nulls in the `df` dataframe. \n",
        "\n",
        "1e. Set `mean_target` as the mean value for the `purchase` column in the `df` dataframe."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "colab_type": "code",
        "id": "RM2YgR6V1wC2",
        "colab": {}
      },
      "source": [
        "\n",
        "#1a"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "ppjtcZAPMChn",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "#1b"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "A1KOB7rCMDy6",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "#1c"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "qAHvhDQMME4J",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "#1d"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "zxb_jgQQMFxe",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "#1e"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "plPZBhRq-DND"
      },
      "source": [
        "### (10 points) 2. Null/Naive Model\n",
        "Set the null model equavalent to everyone purchases in every time period by creating a column in the dataframe `df` called `pred_null`.\n",
        "\n",
        "Calculate the accuracy of the null model and assign it to `acc_null`.\n"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "colab_type": "code",
        "id": "GVxdxFNq8dFN",
        "colab": {}
      },
      "source": [
        "#2\n"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "aSQrbxpmLQ_j",
        "colab_type": "text"
      },
      "source": [
        "### (10 points) 3. Train Test Split\n",
        "Using the `random_state=99` do a 50 50 train test split of only variables `v1-v11` and the `purchase` for y.   Your split should create the following: \n",
        "\n",
        "`train_X`, `test_X`, `train_y`, `test_y`\n",
        "\n",
        "**The split should not be stratified. \n",
        "The columns `id` and `yrm` should not be included.** "
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "jABNcKvqLQ_k",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "#3"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "zBeAGSgcAVGc"
      },
      "source": [
        "### (10 points) 3. Perform Classification Using SciKit Learn. \n",
        "\n",
        "Perform classification analysis. Please note that you are not using the training and test sets created above.  This should be for the entire dataset. \n",
        "\n",
        "Classification 1\n",
        "\n",
        "2a. Use logistic regression to predict the `purchase` variable from `v1`-`v11`. \n",
        "\n",
        "2b. Assess the prediction accuracy and set it equal to `acc_log1`.   \n"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "h7UmG8NtLQ_q",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "#2a"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "kSOFBXcIMba0",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "#2b"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "XXVgKrBnLQ_t",
        "colab_type": "text"
      },
      "source": [
        "### (10 points) 4. Perform Classification Using SciKit Learn. \n",
        "\n",
        "Perform classification analysis. Please note that you are not using the training and test sets created above.  This should be for the entire dataset. \n",
        "\n",
        "Classification 2. \n",
        "\n",
        "4a. Transform the variable `myr` into n-1 dummy variables which have the prefix `myr`. \n",
        "\n",
        "4b. Use logistic regression to predict the `purchase` variable from just the dummy variables. \n",
        "\n",
        "4c. Assess the prediction accuracy and set it equal to `acc_log2`.   \n"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "colab_type": "code",
        "id": "x3BdJIB50LEt",
        "colab": {}
      },
      "source": [
        "#4a"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "QW588WFJMk51",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "#4b"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "sWcFzSgHMk-8",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "#4c"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "qt120gz7ORgB",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        ""
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "F0BY7kssOSIf",
        "colab_type": "text"
      },
      "source": [
        "### Submit your Assignment\n",
        "Click here to clone a repo to submit your assignment. \n",
        "\n",
        "[https://classroom.github.com/a/DT0sBQbX](https://classroom.github.com/a/DT0sBQbX)"
      ]
    }
  ]
}
	{
	"nbformat": 4,
	"nbformat_minor": 0,
	"metadata": {
	"colab": {
	"name": "midterm_student.ipynb",
	"provenance": [],
	"collapsed_sections": [],
	"include_colab_link": true
	},
	"kernelspec": {
	"display_name": "Python [conda env:cadre]",
	"language": "python",
	"name": "conda-env-cadre-py"
	},
	"language_info": {
	"codemirror_mode": {
	"name": "ipython",
	"version": 3
	},
	"file_extension": ".py",
	"mimetype": "text/x-python",
	"name": "python",
	"nbconvert_exporter": "python",
	"pygments_lexer": "ipython3",
	"version": "3.6.8"
	}
	},
	"cells": [
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "view-in-github",
	"colab_type": "text"
	},
	"source": [
	"<a href=\"https://colab.research.google.com/gist/jkuruzovich/dae388ffc37389e0b6aba669ad156780/midterm_student.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"colab_type": "text",
	"id": "vZfe_BrN2782"
	},
	"source": [
	"## Midterm\n",
	"\n",
	"![](https://github.com/rpi-techfundamentals/hm-01-starter/blob/master/notsaved.png?raw=1)\n",
	"\n",
	"WARNING!!! If you see this icon on the top of your COLAB sesssion, your work is not saved automatically.\n",
	"\n",
	"Do not manually upload any files. Use the `wget` command to retreive files.\n",
	"\n",
	"Save your working file in Google drive so that all changes will be saved as you work. MAKE SURE that your final version is saved to GitHub. \n",
	"\n",
	"Before you turn this in, make sure everything runs as expected. First, restart the kernel (in the menu, select Kernel → Restart) and then run all cells (in the menubar, select Cell → Run All). They should run completely without intervention...i.e., DO NOT not manually upload any files. Use the `wget` command to retreive files as necesssary.\n",
	"\n",
	"\n",
	"### This is a 55 point assignment.\n",
	"\n",
	"You may find it useful to go through the notebooks from the course materials when doing these exercises.\n",
	"\n",
	"If you receive assistance from anyone in the class it it will be considered an ethical violation and referred to associate dean."
	]
	},
	{
	"cell_type": "code",
	"metadata": {
	"colab_type": "code",
	"id": "wAwXOu5EwrwL",
	"colab": {}
	},
	"source": [
	"#get the data\n",
	"!wget https://www.dropbox.com/s/tg5sq8202u9zcoq/2020_v2.csv\n"
	],
	"execution_count": 0,
	"outputs": []
	},
	{
	"cell_type": "code",
	"metadata": {
	"colab_type": "code",
	"id": "DlqW9xLjyRhV",
	"colab": {}
	},
	"source": [
	"import pandas as pd\n",
	"df = pd.read_csv('2020_v2.csv')\n",
	"df.head()"
	],
	"execution_count": 0,
	"outputs": []
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"colab_type": "text",
	"id": "1rX8YKAq0vfJ"
	},
	"source": [
	"### (15 points) 1. Basics (3 points each)\n",
	"\n",
	"The dataset is a simple simulated dataset representing customers (`id`) and whether they make a purchase (`purchase`) over a period of time (`ymr`).\n",
	"\n",
	"1a. Set `rows` as an integer equal to the number of rows in the `df` dataframe. \n",
	"\n",
	"1b. Set `columns` as an integer equal to the number of columns in the `df` dataframe. \n",
	"\n",
	"1c. Set `udates` as a list with the unique dates of the `df` dataframe. \n",
	"\n",
	"1d. Set `total_nulls` as the total nulls in the `df` dataframe. \n",
	"\n",
	"1e. Set `mean_target` as the mean value for the `purchase` column in the `df` dataframe."
	]
	},
	{
	"cell_type": "code",
	"metadata": {
	"colab_type": "code",
	"id": "RM2YgR6V1wC2",
	"colab": {}
	},
	"source": [
	"\n",
	"#1a"
	],
	"execution_count": 0,
	"outputs": []
	},
	{
	"cell_type": "code",
	"metadata": {
	"id": "ppjtcZAPMChn",
	"colab_type": "code",
	"colab": {}
	},
	"source": [
	"#1b"
	],
	"execution_count": 0,
	"outputs": []
	},
	{
	"cell_type": "code",
	"metadata": {
	"id": "A1KOB7rCMDy6",
	"colab_type": "code",
	"colab": {}
	},
	"source": [
	"#1c"
	],
	"execution_count": 0,
	"outputs": []
	},
	{
	"cell_type": "code",
	"metadata": {
	"id": "qAHvhDQMME4J",
	"colab_type": "code",
	"colab": {}
	},
	"source": [
	"#1d"
	],
	"execution_count": 0,
	"outputs": []
	},
	{
	"cell_type": "code",
	"metadata": {
	"id": "zxb_jgQQMFxe",
	"colab_type": "code",
	"colab": {}
	},
	"source": [
	"#1e"
	],
	"execution_count": 0,
	"outputs": []
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"colab_type": "text",
	"id": "plPZBhRq-DND"
	},
	"source": [
	"### (10 points) 2. Null/Naive Model\n",
	"Set the null model equavalent to everyone purchases in every time period by creating a column in the dataframe `df` called `pred_null`.\n",
	"\n",
	"Calculate the accuracy of the null model and assign it to `acc_null`.\n"
	]
	},
	{
	"cell_type": "code",
	"metadata": {
	"colab_type": "code",
	"id": "GVxdxFNq8dFN",
	"colab": {}
	},
	"source": [
	"#2\n"
	],
	"execution_count": 0,
	"outputs": []
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "aSQrbxpmLQ_j",
	"colab_type": "text"
	},
	"source": [
	"### (10 points) 3. Train Test Split\n",
	"Using the `random_state=99` do a 50 50 train test split of only variables `v1-v11` and the `purchase` for y. Your split should create the following: \n",
	"\n",
	"`train_X`, `test_X`, `train_y`, `test_y`\n",
	"\n",
	"**The split should not be stratified. \n",
	"The columns `id` and `yrm` should not be included.** "
	]
	},
	{
	"cell_type": "code",
	"metadata": {
	"id": "jABNcKvqLQ_k",
	"colab_type": "code",
	"colab": {}
	},
	"source": [
	"#3"
	],
	"execution_count": 0,
	"outputs": []
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"colab_type": "text",
	"id": "zBeAGSgcAVGc"
	},
	"source": [
	"### (10 points) 3. Perform Classification Using SciKit Learn. \n",
	"\n",
	"Perform classification analysis. Please note that you are not using the training and test sets created above. This should be for the entire dataset. \n",
	"\n",
	"Classification 1\n",
	"\n",
	"2a. Use logistic regression to predict the `purchase` variable from `v1`-`v11`. \n",
	"\n",
	"2b. Assess the prediction accuracy and set it equal to `acc_log1`. \n"
	]
	},
	{
	"cell_type": "code",
	"metadata": {
	"id": "h7UmG8NtLQ_q",
	"colab_type": "code",
	"colab": {}
	},
	"source": [
	"#2a"
	],
	"execution_count": 0,
	"outputs": []
	},
	{
	"cell_type": "code",
	"metadata": {
	"id": "kSOFBXcIMba0",
	"colab_type": "code",
	"colab": {}
	},
	"source": [
	"#2b"
	],
	"execution_count": 0,
	"outputs": []
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "XXVgKrBnLQ_t",
	"colab_type": "text"
	},
	"source": [
	"### (10 points) 4. Perform Classification Using SciKit Learn. \n",
	"\n",
	"Perform classification analysis. Please note that you are not using the training and test sets created above. This should be for the entire dataset. \n",
	"\n",
	"Classification 2. \n",
	"\n",
	"4a. Transform the variable `myr` into n-1 dummy variables which have the prefix `myr`. \n",
	"\n",
	"4b. Use logistic regression to predict the `purchase` variable from just the dummy variables. \n",
	"\n",
	"4c. Assess the prediction accuracy and set it equal to `acc_log2`. \n"
	]
	},
	{
	"cell_type": "code",
	"metadata": {
	"colab_type": "code",
	"id": "x3BdJIB50LEt",
	"colab": {}
	},
	"source": [
	"#4a"
	],
	"execution_count": 0,
	"outputs": []
	},
	{
	"cell_type": "code",
	"metadata": {
	"id": "QW588WFJMk51",
	"colab_type": "code",
	"colab": {}
	},
	"source": [
	"#4b"
	],
	"execution_count": 0,
	"outputs": []
	},
	{
	"cell_type": "code",
	"metadata": {
	"id": "sWcFzSgHMk-8",
	"colab_type": "code",
	"colab": {}
	},
	"source": [
	"#4c"
	],
	"execution_count": 0,
	"outputs": []
	},
	{
	"cell_type": "code",
	"metadata": {
	"id": "qt120gz7ORgB",
	"colab_type": "code",
	"colab": {}
	},
	"source": [
	""
	],
	"execution_count": 0,
	"outputs": []
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "F0BY7kssOSIf",
	"colab_type": "text"
	},
	"source": [
	"### Submit your Assignment\n",
	"Click here to clone a repo to submit your assignment. \n",
	"\n",
	"[https://classroom.github.com/a/DT0sBQbX](https://classroom.github.com/a/DT0sBQbX)"
	]
	}
	]
	}