Skip to content

Instantly share code, notes, and snippets.

@nicolasazrak
Last active January 16, 2018 21:00
Show Gist options
  • Save nicolasazrak/7efb43783b197c8879eb to your computer and use it in GitHub Desktop.
Save nicolasazrak/7efb43783b197c8879eb to your computer and use it in GitHub Desktop.
Ejemplo en python de un job map reduce para el TP de SO 1C2015

Job de map reduce para contar la cantidad de repeticiones de cada palabra en un texto.

Modo de uso: cat texto.txt | ./mapper.py | sort | ./reducer.py

Puede agregarse al final cat texto.txt | ./mapper.py | sort | ./reducer.py | sort -r -n -k2 > out.txt

Para ordenar segun la cantidad de repeticiones y dejar el output en un archivo

#! /usr/bin/python2
import sys
def clean(word):
for symbol in ['\n', '\t', ',', '.', ',', '?', '-', '"', "'", '(', ')', "!", ";", ":"]:
word = word.replace(symbol, '')
return word.lower()
def print_words(words):
for word in words:
if len(word) > 0:
sys.stdout.write(word+" 1\n")
content = sys.stdin.read()
words = content.split()
words = map(clean, words)
print_words(words)
#! /usr/bin/python2
import sys
dictionary = {}
def add_word(token, repetitions):
if dictionary.has_key(token):
dictionary[token] += int(repetitions)
else:
dictionary[token] = int(repetitions)
def print_words():
for word in dictionary:
sys.stdout.write(word+" " +str(dictionary[word])+"\n")
for token in sys.stdin.readlines():
word = token.split()[0]
repetitions = token.split()[1]
add_word(word, repetitions)
print_words()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment