Skip to content

Instantly share code, notes, and snippets.

@vumaasha
Created May 3, 2023 11:10
Show Gist options
  • Save vumaasha/f00d42a8de7a51b461f0f6458a1460e3 to your computer and use it in GitHub Desktop.
Save vumaasha/f00d42a8de7a51b461f0f6458a1460e3 to your computer and use it in GitHub Desktop.
python-mrjob-demo.ipynb
Display the source blob
Display the rendered blob
Raw
{
"nbformat": 4,
"nbformat_minor": 0,
"metadata": {
"colab": {
"provenance": [],
"include_colab_link": true
},
"kernelspec": {
"name": "python3",
"display_name": "Python 3"
}
},
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "view-in-github",
"colab_type": "text"
},
"source": [
"<a href=\"https://colab.research.google.com/gist/vumaasha/f00d42a8de7a51b461f0f6458a1460e3/python-mrjob-demo.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "Nr6-ZTWCrmIf"
},
"source": [
"# Midwest Big Data Summer School 2019\n",
"## Python MRJob Demo - Wed. May 22, 2019\n",
"**Dr. Robert Dyer**\n",
"\n",
"**Assistant Professor, Dept. of Computer Science**\n",
"\n",
"**Bowling Green State University**\n",
"\n",
"### NOTE: click \"open in playground mode\" in the File menu above so that you can run this notebook!\n",
"\n",
"In this notebook, I will show basic use of MRJob (MapReduce) inside Python.\n",
"\n",
"First, we need to install a few Python packages into the system."
]
},
{
"cell_type": "code",
"metadata": {
"id": "5bxjSggnqiPR",
"colab": {
"base_uri": "https://localhost:8080/"
},
"outputId": "7b3df0b6-9097-4d36-ee7b-ca18ad515f89"
},
"source": [
"!pip install --quiet mrjob"
],
"execution_count": null,
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"\u001b[?25l \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m0.0/439.6 kB\u001b[0m \u001b[31m?\u001b[0m eta \u001b[36m-:--:--\u001b[0m\r\u001b[2K \u001b[91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m\u001b[90m╺\u001b[0m \u001b[32m430.1/439.6 kB\u001b[0m \u001b[31m15.2 MB/s\u001b[0m eta \u001b[36m0:00:01\u001b[0m\r\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m439.6/439.6 kB\u001b[0m \u001b[31m9.3 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
"\u001b[?25h"
]
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "aMs7x0rYsXs1"
},
"source": [
"If there are no errors above, then MRJob is properly installed in the system and ready to use. Let's create a simple MapReduce program to test. This will save the contents of the cell into a file named wordcount.py so that we can execute it later."
]
},
{
"cell_type": "code",
"metadata": {
"id": "H5ZpJ_NMsn6P",
"colab": {
"base_uri": "https://localhost:8080/"
},
"outputId": "b0c10367-e30c-4173-84d9-dfd304cd1ef7"
},
"source": [
"%%file wordcount.py\n",
"from mrjob.job import MRJob\n",
"import re\n",
"\n",
"class WordCount(MRJob):\n",
" def mapper(self, key, value):\n",
" words = [s.strip() for s in re.split('[\\s]', value) if s]\n",
" for word in words:\n",
" yield word, 1\n",
"\n",
" def reducer(self, key, values):\n",
" yield key, sum(values)\n",
"\n",
"if __name__ == '__main__':\n",
" WordCount.run()"
],
"execution_count": null,
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"Writing wordcount.py\n"
]
}
]
},
{
"cell_type": "code",
"source": [
"!wget https://www.gutenberg.org/files/98/98-0.txt"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "7vQ-tlzJTHKx",
"outputId": "1cbf12ee-8a17-41ae-96cf-6db5bf870e92"
},
"execution_count": null,
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"--2023-05-03 11:08:24-- https://www.gutenberg.org/files/98/98-0.txt\n",
"Resolving www.gutenberg.org (www.gutenberg.org)... 152.19.134.47, 2610:28:3090:3000:0:bad:cafe:47\n",
"Connecting to www.gutenberg.org (www.gutenberg.org)|152.19.134.47|:443... connected.\n",
"HTTP request sent, awaiting response... 200 OK\n",
"Length: 807231 (788K) [text/plain]\n",
"Saving to: ‘98-0.txt’\n",
"\n",
"98-0.txt 100%[===================>] 788.31K 2.38MB/s in 0.3s \n",
"\n",
"2023-05-03 11:08:25 (2.38 MB/s) - ‘98-0.txt’ saved [807231/807231]\n",
"\n"
]
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "KzJ3Xe6z3YCu"
},
"source": [
"Now that the code is saved to a file, we can run it. This will run it locally (not on Hadoop) and process any file you pass in as the first argument. The result will simply print to the console."
]
},
{
"cell_type": "code",
"metadata": {
"id": "ygjvuNoMz4Ez",
"colab": {
"base_uri": "https://localhost:8080/"
},
"outputId": "4f8675a8-3fce-4b4f-d04f-1c2fe0096321"
},
"source": [
"!python wordcount.py 98-0.txt > word-freq.out"
],
"execution_count": null,
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"No configs found; falling back on auto-configuration\n",
"No configs specified for inline runner\n",
"Creating temp directory /tmp/wordcount.root.20230503.110919.841814\n",
"Running step 1 of 1...\n",
"job output is in /tmp/wordcount.root.20230503.110919.841814/output\n",
"Streaming final output from /tmp/wordcount.root.20230503.110919.841814/output...\n",
"Removing temp directory /tmp/wordcount.root.20230503.110919.841814...\n"
]
}
]
},
{
"cell_type": "code",
"source": [
"!head word-freq.out"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "-U-OBOenTXhn",
"outputId": "0b0d5ffb-e5b9-4535-f26d-176cd758ca69"
},
"execution_count": null,
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"\"breath,\"\t7\n",
"\"breath--\\u201ca\"\t1\n",
"\"breath.\"\t2\n",
"\"breathe\"\t3\n",
"\"breathed!\"\t1\n",
"\"breathed\"\t2\n",
"\"breathing\"\t5\n",
"\"breathing,\"\t3\n",
"\"breathing.\"\t1\n",
"\"breathing.\\u201d\"\t1\n"
]
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "Gq2pN8lv3itA"
},
"source": [
"As you can see, it lists all the unique words in the source code and how often each one occured."
]
}
]
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment