Created
May 3, 2023 11:10
-
-
Save vumaasha/f00d42a8de7a51b461f0f6458a1460e3 to your computer and use it in GitHub Desktop.
python-mrjob-demo.ipynb
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"nbformat": 4, | |
"nbformat_minor": 0, | |
"metadata": { | |
"colab": { | |
"provenance": [], | |
"include_colab_link": true | |
}, | |
"kernelspec": { | |
"name": "python3", | |
"display_name": "Python 3" | |
} | |
}, | |
"cells": [ | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "view-in-github", | |
"colab_type": "text" | |
}, | |
"source": [ | |
"<a href=\"https://colab.research.google.com/gist/vumaasha/f00d42a8de7a51b461f0f6458a1460e3/python-mrjob-demo.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "Nr6-ZTWCrmIf" | |
}, | |
"source": [ | |
"# Midwest Big Data Summer School 2019\n", | |
"## Python MRJob Demo - Wed. May 22, 2019\n", | |
"**Dr. Robert Dyer**\n", | |
"\n", | |
"**Assistant Professor, Dept. of Computer Science**\n", | |
"\n", | |
"**Bowling Green State University**\n", | |
"\n", | |
"### NOTE: click \"open in playground mode\" in the File menu above so that you can run this notebook!\n", | |
"\n", | |
"In this notebook, I will show basic use of MRJob (MapReduce) inside Python.\n", | |
"\n", | |
"First, we need to install a few Python packages into the system." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "5bxjSggnqiPR", | |
"colab": { | |
"base_uri": "https://localhost:8080/" | |
}, | |
"outputId": "7b3df0b6-9097-4d36-ee7b-ca18ad515f89" | |
}, | |
"source": [ | |
"!pip install --quiet mrjob" | |
], | |
"execution_count": null, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"name": "stdout", | |
"text": [ | |
"\u001b[?25l \u001b[90mββββββββββββββββββββββββββββββββββββββββ\u001b[0m \u001b[32m0.0/439.6 kB\u001b[0m \u001b[31m?\u001b[0m eta \u001b[36m-:--:--\u001b[0m\r\u001b[2K \u001b[91mβββββββββββββββββββββββββββββββββββββ\u001b[0m\u001b[90mβΊ\u001b[0m \u001b[32m430.1/439.6 kB\u001b[0m \u001b[31m15.2 MB/s\u001b[0m eta \u001b[36m0:00:01\u001b[0m\r\u001b[2K \u001b[90mβββββββββββββββββββββββββββββββββββββββ\u001b[0m \u001b[32m439.6/439.6 kB\u001b[0m \u001b[31m9.3 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", | |
"\u001b[?25h" | |
] | |
} | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "aMs7x0rYsXs1" | |
}, | |
"source": [ | |
"If there are no errors above, then MRJob is properly installed in the system and ready to use. Let's create a simple MapReduce program to test. This will save the contents of the cell into a file named wordcount.py so that we can execute it later." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "H5ZpJ_NMsn6P", | |
"colab": { | |
"base_uri": "https://localhost:8080/" | |
}, | |
"outputId": "b0c10367-e30c-4173-84d9-dfd304cd1ef7" | |
}, | |
"source": [ | |
"%%file wordcount.py\n", | |
"from mrjob.job import MRJob\n", | |
"import re\n", | |
"\n", | |
"class WordCount(MRJob):\n", | |
" def mapper(self, key, value):\n", | |
" words = [s.strip() for s in re.split('[\\s]', value) if s]\n", | |
" for word in words:\n", | |
" yield word, 1\n", | |
"\n", | |
" def reducer(self, key, values):\n", | |
" yield key, sum(values)\n", | |
"\n", | |
"if __name__ == '__main__':\n", | |
" WordCount.run()" | |
], | |
"execution_count": null, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"name": "stdout", | |
"text": [ | |
"Writing wordcount.py\n" | |
] | |
} | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"source": [ | |
"!wget https://www.gutenberg.org/files/98/98-0.txt" | |
], | |
"metadata": { | |
"colab": { | |
"base_uri": "https://localhost:8080/" | |
}, | |
"id": "7vQ-tlzJTHKx", | |
"outputId": "1cbf12ee-8a17-41ae-96cf-6db5bf870e92" | |
}, | |
"execution_count": null, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"name": "stdout", | |
"text": [ | |
"--2023-05-03 11:08:24-- https://www.gutenberg.org/files/98/98-0.txt\n", | |
"Resolving www.gutenberg.org (www.gutenberg.org)... 152.19.134.47, 2610:28:3090:3000:0:bad:cafe:47\n", | |
"Connecting to www.gutenberg.org (www.gutenberg.org)|152.19.134.47|:443... connected.\n", | |
"HTTP request sent, awaiting response... 200 OK\n", | |
"Length: 807231 (788K) [text/plain]\n", | |
"Saving to: β98-0.txtβ\n", | |
"\n", | |
"98-0.txt 100%[===================>] 788.31K 2.38MB/s in 0.3s \n", | |
"\n", | |
"2023-05-03 11:08:25 (2.38 MB/s) - β98-0.txtβ saved [807231/807231]\n", | |
"\n" | |
] | |
} | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "KzJ3Xe6z3YCu" | |
}, | |
"source": [ | |
"Now that the code is saved to a file, we can run it. This will run it locally (not on Hadoop) and process any file you pass in as the first argument. The result will simply print to the console." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "ygjvuNoMz4Ez", | |
"colab": { | |
"base_uri": "https://localhost:8080/" | |
}, | |
"outputId": "4f8675a8-3fce-4b4f-d04f-1c2fe0096321" | |
}, | |
"source": [ | |
"!python wordcount.py 98-0.txt > word-freq.out" | |
], | |
"execution_count": null, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"name": "stdout", | |
"text": [ | |
"No configs found; falling back on auto-configuration\n", | |
"No configs specified for inline runner\n", | |
"Creating temp directory /tmp/wordcount.root.20230503.110919.841814\n", | |
"Running step 1 of 1...\n", | |
"job output is in /tmp/wordcount.root.20230503.110919.841814/output\n", | |
"Streaming final output from /tmp/wordcount.root.20230503.110919.841814/output...\n", | |
"Removing temp directory /tmp/wordcount.root.20230503.110919.841814...\n" | |
] | |
} | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"source": [ | |
"!head word-freq.out" | |
], | |
"metadata": { | |
"colab": { | |
"base_uri": "https://localhost:8080/" | |
}, | |
"id": "-U-OBOenTXhn", | |
"outputId": "0b0d5ffb-e5b9-4535-f26d-176cd758ca69" | |
}, | |
"execution_count": null, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"name": "stdout", | |
"text": [ | |
"\"breath,\"\t7\n", | |
"\"breath--\\u201ca\"\t1\n", | |
"\"breath.\"\t2\n", | |
"\"breathe\"\t3\n", | |
"\"breathed!\"\t1\n", | |
"\"breathed\"\t2\n", | |
"\"breathing\"\t5\n", | |
"\"breathing,\"\t3\n", | |
"\"breathing.\"\t1\n", | |
"\"breathing.\\u201d\"\t1\n" | |
] | |
} | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "Gq2pN8lv3itA" | |
}, | |
"source": [ | |
"As you can see, it lists all the unique words in the source code and how often each one occured." | |
] | |
} | |
] | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment