Skip to content

Instantly share code, notes, and snippets.

@vumaasha
Last active May 3, 2023 13:01
Show Gist options
  • Save vumaasha/9bd455881e93473aa5abf044adeab775 to your computer and use it in GitHub Desktop.
Save vumaasha/9bd455881e93473aa5abf044adeab775 to your computer and use it in GitHub Desktop.
python-mrjob-demo.ipynb
Display the source blob
Display the rendered blob
Raw
{
"nbformat": 4,
"nbformat_minor": 0,
"metadata": {
"colab": {
"provenance": [],
"include_colab_link": true
},
"kernelspec": {
"name": "python3",
"display_name": "Python 3"
}
},
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "view-in-github",
"colab_type": "text"
},
"source": [
"<a href=\"https://colab.research.google.com/gist/vumaasha/9bd455881e93473aa5abf044adeab775/python-mrjob-demo.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "Nr6-ZTWCrmIf"
},
"source": [
"# Midwest Big Data Summer School 2019\n",
"## Python MRJob Demo - Wed. May 22, 2019\n",
"**Dr. Robert Dyer**\n",
"\n",
"**Assistant Professor, Dept. of Computer Science**\n",
"\n",
"**Bowling Green State University**\n",
"\n",
"### NOTE: click \"open in playground mode\" in the File menu above so that you can run this notebook!\n",
"\n",
"In this notebook, I will show basic use of MRJob (MapReduce) inside Python.\n",
"\n",
"First, we need to install a few Python packages into the system."
]
},
{
"cell_type": "code",
"metadata": {
"id": "5bxjSggnqiPR"
},
"source": [
"!pip install --quiet mrjob"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"!wget https://www.gutenberg.org/cache/epub/67979/pg67979.txt"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "QujUCXFhimLf",
"outputId": "01a3d1c0-8a63-43fa-8c31-4a2e5636f113"
},
"execution_count": null,
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"--2023-05-03 12:16:07-- https://www.gutenberg.org/cache/epub/67979/pg67979.txt\n",
"Resolving www.gutenberg.org (www.gutenberg.org)... 152.19.134.47, 2610:28:3090:3000:0:bad:cafe:47\n",
"Connecting to www.gutenberg.org (www.gutenberg.org)|152.19.134.47|:443... connected.\n",
"HTTP request sent, awaiting response... 200 OK\n",
"Length: 427836 (418K) [text/plain]\n",
"Saving to: ‘pg67979.txt’\n",
"\n",
"pg67979.txt 100%[===================>] 417.81K 1.16MB/s in 0.4s \n",
"\n",
"2023-05-03 12:16:08 (1.16 MB/s) - ‘pg67979.txt’ saved [427836/427836]\n",
"\n"
]
}
]
},
{
"cell_type": "code",
"source": [
"!ls pg67979.txt"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "r3z3_qByiqCt",
"outputId": "fe772be1-3cd9-43f3-8f9c-0b93356b0570"
},
"execution_count": null,
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"pg67979.txt\n"
]
}
]
},
{
"cell_type": "code",
"source": [
"!head pg67979.txt"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "pgs9I5icit0b",
"outputId": "861124dd-49db-43b6-a804-9f20d2093b4c"
},
"execution_count": null,
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"The Project Gutenberg eBook of The Blue Castle, by Lucy Maud Montgomery\r\n",
"\r\n",
"This eBook is for the use of anyone anywhere in the United States and\r\n",
"most other parts of the world at no cost and with almost no restrictions\r\n",
"whatsoever. You may copy it, give it away or re-use it under the terms\r\n",
"of the Project Gutenberg License included with this eBook or online at\r\n",
"www.gutenberg.org. If you are not located in the United States, you\r\n",
"will have to check the laws of the country where you are located before\r\n",
"using this eBook.\r\n",
"\r\n"
]
}
]
},
{
"cell_type": "code",
"source": [
"# copying this file to hdfs"
],
"metadata": {
"id": "vEDKkK7umDg2"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "aMs7x0rYsXs1"
},
"source": [
"If there are no errors above, then MRJob is properly installed in the system and ready to use. Let's create a simple MapReduce program to test. This will save the contents of the cell into a file named wordcount.py so that we can execute it later."
]
},
{
"cell_type": "code",
"source": [
"%%file letter_count.py\n",
"from mrjob.job import MRJob\n",
"import re\n",
"\n",
"class LetterCount(MRJob):\n",
" def mapper(self, key, value):\n",
" splits = re.split('[\\s]', value)\n",
" word = splits[0].lower()\n",
" count = splits[1]\n",
" starting_letter = word[1]\n",
" if starting_letter == \"x\":\n",
" yield starting_letter, int(count)\n",
"\n",
" def reducer(self, key, values):\n",
" yield key, sum(values)\n",
"\n",
"if __name__ == '__main__':\n",
" LetterCount.run()"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "KWDfXO8lpwXO",
"outputId": "32c9ccf4-2055-4bf8-973f-8bf9c918db4b"
},
"execution_count": null,
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"Overwriting letter_count.py\n"
]
}
]
},
{
"cell_type": "code",
"metadata": {
"id": "H5ZpJ_NMsn6P",
"colab": {
"base_uri": "https://localhost:8080/"
},
"outputId": "e9831107-f8e1-49b4-c90a-5c901fd6c36e"
},
"source": [
"%%file wordcount.py\n",
"from mrjob.job import MRJob\n",
"import re\n",
"\n",
"class WordCount(MRJob):\n",
" def mapper(self, key, value):\n",
" words = [s.strip() for s in re.split('[\\s]', value) if s]\n",
" for word in words:\n",
" yield word, 1\n",
"\n",
" def reducer(self, key, values):\n",
" yield key, sum(values)\n",
"\n",
"if __name__ == '__main__':\n",
" WordCount.run()"
],
"execution_count": null,
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"Overwriting wordcount.py\n"
]
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "KzJ3Xe6z3YCu"
},
"source": [
"Now that the code is saved to a file, we can run it. This will run it locally (not on Hadoop) and process any file you pass in as the first argument. The result will simply print to the console."
]
},
{
"cell_type": "code",
"metadata": {
"id": "ygjvuNoMz4Ez",
"colab": {
"base_uri": "https://localhost:8080/"
},
"outputId": "2e215bed-3e60-4f04-c950-52fa426461a1"
},
"source": [
"!python wordcount.py pg67979.txt > word-count.out"
],
"execution_count": null,
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"No configs found; falling back on auto-configuration\n",
"No configs specified for inline runner\n",
"Creating temp directory /tmp/wordcount.root.20230503.124836.656580\n",
"Running step 1 of 1...\n",
"job output is in /tmp/wordcount.root.20230503.124836.656580/output\n",
"Streaming final output from /tmp/wordcount.root.20230503.124836.656580/output...\n",
"Removing temp directory /tmp/wordcount.root.20230503.124836.656580...\n"
]
}
]
},
{
"cell_type": "code",
"source": [
"!du -hs *"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "Fq0ouEqtsz0w",
"outputId": "6725cd0a-66cd-4818-f981-997e685be5db"
},
"execution_count": null,
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"792K\t98-0.txt\n",
"792K\t98-0.txt.1\n",
"4.0K\talpha-count.out\n",
"4.0K\tletter_count.py\n",
"420K\tpg67979.txt\n",
"55M\tsample_data\n",
"176K\tword-count.out\n",
"4.0K\twordcount.py\n",
"16K\tword-freq.out\n"
]
}
]
},
{
"cell_type": "code",
"source": [
"!head -10 word-count.out"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "8GgpuygxoMCk",
"outputId": "c5dbbf52-95b6-4d89-9612-958e5ad93d4b"
},
"execution_count": null,
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"\"appreciate\"\t1\n",
"\"appreciation\"\t1\n",
"\"apprehension.\"\t1\n",
"\"approach\"\t2\n",
"\"approaching\"\t1\n",
"\"approval.\"\t2\n",
"\"approved\"\t1\n",
"\"approved.\"\t1\n",
"\"apron\"\t2\n",
"\"apt\"\t3\n"
]
}
]
},
{
"cell_type": "code",
"source": [
"!python letter_count.py word-count.out"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "6cA0oLKJn03a",
"outputId": "62f7dff2-88d9-43da-faab-60f6c4957042"
},
"execution_count": null,
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"No configs found; falling back on auto-configuration\n",
"No configs specified for inline runner\n",
"Creating temp directory /tmp/letter_count.root.20230503.125940.661568\n",
"Running step 1 of 1...\n",
"job output is in /tmp/letter_count.root.20230503.125940.661568/output\n",
"Streaming final output from /tmp/letter_count.root.20230503.125940.661568/output...\n",
"\"x\"\t72\n",
"Removing temp directory /tmp/letter_count.root.20230503.125940.661568...\n"
]
}
]
},
{
"cell_type": "code",
"source": [
"!tail -20 word-count.out"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "7liGSTGgkjhe",
"outputId": "7a490eac-e492-4be4-cbb9-532d6f1cc30f"
},
"execution_count": null,
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"\"settled\"\t1\n",
"\"settled,\"\t1\n",
"\"settled.\"\t1\n",
"\"seven\"\t1\n",
"\"seven,\"\t1\n",
"\"seven.\"\t2\n",
"\"seventeen\"\t2\n",
"\"seventeen,\\u201d\"\t1\n",
"\"seventy\"\t2\n",
"\"several\"\t11\n",
"\"severe\"\t1\n",
"\"severed\"\t1\n",
"\"severely\"\t1\n",
"\"severely.\"\t2\n",
"\"sew\"\t1\n",
"\"sew.\"\t1\n",
"\"sewing\"\t1\n",
"\"shabby\"\t4\n",
"\"shabby,\"\t2\n",
"\"shabby\\u2014nobody\"\t1\n"
]
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "Gq2pN8lv3itA"
},
"source": [
"As you can see, it lists all the unique words in the source code and how often each one occured."
]
}
]
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment