vumaasha/python-mrjob-demo.ipynb

## python-mrjob-demo.ipynb
{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "colab": {
      "provenance": [],
      "include_colab_link": true
    },
    "kernelspec": {
      "name": "python3",
      "display_name": "Python 3"
    }
  },
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "view-in-github",
        "colab_type": "text"
      },
      "source": [
        "<a href=\"https://colab.research.google.com/gist/vumaasha/f00d42a8de7a51b461f0f6458a1460e3/python-mrjob-demo.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "Nr6-ZTWCrmIf"
      },
      "source": [
        "# Midwest Big Data Summer School 2019\n",
        "## Python MRJob Demo - Wed. May 22, 2019\n",
        "**Dr. Robert Dyer**\n",
        "\n",
        "**Assistant Professor, Dept. of Computer Science**\n",
        "\n",
        "**Bowling Green State University**\n",
        "\n",
        "### NOTE: click \"open in playground mode\" in the File menu above so that you can run this notebook!\n",
        "\n",
        "In this notebook, I will show basic use of MRJob (MapReduce) inside Python.\n",
        "\n",
        "First, we need to install a few Python packages into the system."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "5bxjSggnqiPR",
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "outputId": "7b3df0b6-9097-4d36-ee7b-ca18ad515f89"
      },
      "source": [
        "!pip install --quiet mrjob"
      ],
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "\u001b[?25l     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m0.0/439.6 kB\u001b[0m \u001b[31m?\u001b[0m eta \u001b[36m-:--:--\u001b[0m\r\u001b[2K     \u001b[91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m\u001b[90m╺\u001b[0m \u001b[32m430.1/439.6 kB\u001b[0m \u001b[31m15.2 MB/s\u001b[0m eta \u001b[36m0:00:01\u001b[0m\r\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m439.6/439.6 kB\u001b[0m \u001b[31m9.3 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[?25h"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "aMs7x0rYsXs1"
      },
      "source": [
        "If there are no errors above, then MRJob is properly installed in the system and ready to use.  Let's create a simple MapReduce program to test.  This will save the contents of the cell into a file named wordcount.py so that we can execute it later."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "H5ZpJ_NMsn6P",
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "outputId": "b0c10367-e30c-4173-84d9-dfd304cd1ef7"
      },
      "source": [
        "%%file wordcount.py\n",
        "from mrjob.job import MRJob\n",
        "import re\n",
        "\n",
        "class WordCount(MRJob):\n",
        "    def mapper(self, key, value):\n",
        "      words = [s.strip() for s in re.split('[\\s]', value) if s]\n",
        "      for word in words:\n",
        "        yield word, 1\n",
        "\n",
        "    def reducer(self, key, values):\n",
        "        yield key, sum(values)\n",
        "\n",
        "if __name__ == '__main__':\n",
        "     WordCount.run()"
      ],
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "Writing wordcount.py\n"
          ]
        }
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "!wget https://www.gutenberg.org/files/98/98-0.txt"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "7vQ-tlzJTHKx",
        "outputId": "1cbf12ee-8a17-41ae-96cf-6db5bf870e92"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "--2023-05-03 11:08:24--  https://www.gutenberg.org/files/98/98-0.txt\n",
            "Resolving www.gutenberg.org (www.gutenberg.org)... 152.19.134.47, 2610:28:3090:3000:0:bad:cafe:47\n",
            "Connecting to www.gutenberg.org (www.gutenberg.org)|152.19.134.47|:443... connected.\n",
            "HTTP request sent, awaiting response... 200 OK\n",
            "Length: 807231 (788K) [text/plain]\n",
            "Saving to: ‘98-0.txt’\n",
            "\n",
            "98-0.txt            100%[===================>] 788.31K  2.38MB/s    in 0.3s    \n",
            "\n",
            "2023-05-03 11:08:25 (2.38 MB/s) - ‘98-0.txt’ saved [807231/807231]\n",
            "\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "KzJ3Xe6z3YCu"
      },
      "source": [
        "Now that the code is saved to a file, we can run it.  This will run it locally (not on Hadoop) and process any file you pass in as the first argument.  The result will simply print to the console."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "ygjvuNoMz4Ez",
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "outputId": "4f8675a8-3fce-4b4f-d04f-1c2fe0096321"
      },
      "source": [
        "!python wordcount.py 98-0.txt > word-freq.out"
      ],
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "No configs found; falling back on auto-configuration\n",
            "No configs specified for inline runner\n",
            "Creating temp directory /tmp/wordcount.root.20230503.110919.841814\n",
            "Running step 1 of 1...\n",
            "job output is in /tmp/wordcount.root.20230503.110919.841814/output\n",
            "Streaming final output from /tmp/wordcount.root.20230503.110919.841814/output...\n",
            "Removing temp directory /tmp/wordcount.root.20230503.110919.841814...\n"
          ]
        }
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "!head word-freq.out"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "-U-OBOenTXhn",
        "outputId": "0b0d5ffb-e5b9-4535-f26d-176cd758ca69"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "\"breath,\"\t7\n",
            "\"breath--\\u201ca\"\t1\n",
            "\"breath.\"\t2\n",
            "\"breathe\"\t3\n",
            "\"breathed!\"\t1\n",
            "\"breathed\"\t2\n",
            "\"breathing\"\t5\n",
            "\"breathing,\"\t3\n",
            "\"breathing.\"\t1\n",
            "\"breathing.\\u201d\"\t1\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "Gq2pN8lv3itA"
      },
      "source": [
        "As you can see, it lists all the unique words in the source code and how often each one occured."
      ]
    }
  ]
}
	{
	"nbformat": 4,
	"nbformat_minor": 0,
	"metadata": {
	"colab": {
	"provenance": [],
	"include_colab_link": true
	},
	"kernelspec": {
	"name": "python3",
	"display_name": "Python 3"
	}
	},
	"cells": [
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "view-in-github",
	"colab_type": "text"
	},
	"source": [
	"<a href=\"https://colab.research.google.com/gist/vumaasha/f00d42a8de7a51b461f0f6458a1460e3/python-mrjob-demo.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "Nr6-ZTWCrmIf"
	},
	"source": [
	"# Midwest Big Data Summer School 2019\n",
	"## Python MRJob Demo - Wed. May 22, 2019\n",
	"Dr. Robert Dyer\n",
	"\n",
	"Assistant Professor, Dept. of Computer Science\n",
	"\n",
	"Bowling Green State University\n",
	"\n",
	"### NOTE: click \"open in playground mode\" in the File menu above so that you can run this notebook!\n",
	"\n",
	"In this notebook, I will show basic use of MRJob (MapReduce) inside Python.\n",
	"\n",
	"First, we need to install a few Python packages into the system."
	]
	},
	{
	"cell_type": "code",
	"metadata": {
	"id": "5bxjSggnqiPR",
	"colab": {
	"base_uri": "https://localhost:8080/"
	},
	"outputId": "7b3df0b6-9097-4d36-ee7b-ca18ad515f89"
	},
	"source": [
	"!pip install --quiet mrjob"
	],
	"execution_count": null,
	"outputs": [
	{
	"output_type": "stream",
	"name": "stdout",
	"text": [
	"\u001b[?25l \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m0.0/439.6 kB\u001b[0m \u001b[31m?\u001b[0m eta \u001b[36m-:--:--\u001b[0m\r\u001b[2K \u001b[91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m\u001b[90m╺\u001b[0m \u001b[32m430.1/439.6 kB\u001b[0m \u001b[31m15.2 MB/s\u001b[0m eta \u001b[36m0:00:01\u001b[0m\r\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m439.6/439.6 kB\u001b[0m \u001b[31m9.3 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
	"\u001b[?25h"
	]
	}
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "aMs7x0rYsXs1"
	},
	"source": [
	"If there are no errors above, then MRJob is properly installed in the system and ready to use. Let's create a simple MapReduce program to test. This will save the contents of the cell into a file named wordcount.py so that we can execute it later."
	]
	},
	{
	"cell_type": "code",
	"metadata": {
	"id": "H5ZpJ_NMsn6P",
	"colab": {
	"base_uri": "https://localhost:8080/"
	},
	"outputId": "b0c10367-e30c-4173-84d9-dfd304cd1ef7"
	},
	"source": [
	"%%file wordcount.py\n",
	"from mrjob.job import MRJob\n",
	"import re\n",
	"\n",
	"class WordCount(MRJob):\n",
	" def mapper(self, key, value):\n",
	" words = [s.strip() for s in re.split('[\\s]', value) if s]\n",
	" for word in words:\n",
	" yield word, 1\n",
	"\n",
	" def reducer(self, key, values):\n",
	" yield key, sum(values)\n",
	"\n",
	"if __name__ == '__main__':\n",
	" WordCount.run()"
	],
	"execution_count": null,
	"outputs": [
	{
	"output_type": "stream",
	"name": "stdout",
	"text": [
	"Writing wordcount.py\n"
	]
	}
	]
	},
	{
	"cell_type": "code",
	"source": [
	"!wget https://www.gutenberg.org/files/98/98-0.txt"
	],
	"metadata": {
	"colab": {
	"base_uri": "https://localhost:8080/"
	},
	"id": "7vQ-tlzJTHKx",
	"outputId": "1cbf12ee-8a17-41ae-96cf-6db5bf870e92"
	},
	"execution_count": null,
	"outputs": [
	{
	"output_type": "stream",
	"name": "stdout",
	"text": [
	"--2023-05-03 11:08:24-- https://www.gutenberg.org/files/98/98-0.txt\n",
	"Resolving www.gutenberg.org (www.gutenberg.org)... 152.19.134.47, 2610:28:3090:3000:0:bad:cafe:47\n",
	"Connecting to www.gutenberg.org (www.gutenberg.org)\|152.19.134.47\|:443... connected.\n",
	"HTTP request sent, awaiting response... 200 OK\n",
	"Length: 807231 (788K) [text/plain]\n",
	"Saving to: ‘98-0.txt’\n",
	"\n",
	"98-0.txt 100%[===================>] 788.31K 2.38MB/s in 0.3s \n",
	"\n",
	"2023-05-03 11:08:25 (2.38 MB/s) - ‘98-0.txt’ saved [807231/807231]\n",
	"\n"
	]
	}
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "KzJ3Xe6z3YCu"
	},
	"source": [
	"Now that the code is saved to a file, we can run it. This will run it locally (not on Hadoop) and process any file you pass in as the first argument. The result will simply print to the console."
	]
	},
	{
	"cell_type": "code",
	"metadata": {
	"id": "ygjvuNoMz4Ez",
	"colab": {
	"base_uri": "https://localhost:8080/"
	},
	"outputId": "4f8675a8-3fce-4b4f-d04f-1c2fe0096321"
	},
	"source": [
	"!python wordcount.py 98-0.txt > word-freq.out"
	],
	"execution_count": null,
	"outputs": [
	{
	"output_type": "stream",
	"name": "stdout",
	"text": [
	"No configs found; falling back on auto-configuration\n",
	"No configs specified for inline runner\n",
	"Creating temp directory /tmp/wordcount.root.20230503.110919.841814\n",
	"Running step 1 of 1...\n",
	"job output is in /tmp/wordcount.root.20230503.110919.841814/output\n",
	"Streaming final output from /tmp/wordcount.root.20230503.110919.841814/output...\n",
	"Removing temp directory /tmp/wordcount.root.20230503.110919.841814...\n"
	]
	}
	]
	},
	{
	"cell_type": "code",
	"source": [
	"!head word-freq.out"
	],
	"metadata": {
	"colab": {
	"base_uri": "https://localhost:8080/"
	},
	"id": "-U-OBOenTXhn",
	"outputId": "0b0d5ffb-e5b9-4535-f26d-176cd758ca69"
	},
	"execution_count": null,
	"outputs": [
	{
	"output_type": "stream",
	"name": "stdout",
	"text": [
	"\"breath,\"\t7\n",
	"\"breath--\\u201ca\"\t1\n",
	"\"breath.\"\t2\n",
	"\"breathe\"\t3\n",
	"\"breathed!\"\t1\n",
	"\"breathed\"\t2\n",
	"\"breathing\"\t5\n",
	"\"breathing,\"\t3\n",
	"\"breathing.\"\t1\n",
	"\"breathing.\\u201d\"\t1\n"
	]
	}
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "Gq2pN8lv3itA"
	},
	"source": [
	"As you can see, it lists all the unique words in the source code and how often each one occured."
	]
	}
	]
	}