Skip to content

Instantly share code, notes, and snippets.

@heerensharma
Created July 6, 2017 12:27
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save heerensharma/f12afd9b1508693b144a78d61d696a96 to your computer and use it in GitHub Desktop.
Save heerensharma/f12afd9b1508693b144a78d61d696a96 to your computer and use it in GitHub Desktop.
Word Counting

Code

Takes an input file path as variables file_location

Perferable the single function should be devided into three sub modules.

  1. Reading the file and extracting one by one line e.g. read_file
  2. Function which takes one line as input and then clean the line and enrich the output dictionary (can be made another seperate function for real time counter)
  3. Outputting function. Which can display the output either on stdout or into a specific file with desired formatting.

Total time took to program the code 17 minutes (14 mins (dev) and 3 mins (test)) and writing documentation 10 mins.

Having above stated approach as quite some advantages one of them being the concurrent code. If there is a function which takes one by one line then it can be parallelized and at the same time using gevent or asyncio be done concurrently. Henceforth in line processing function will be mapper and then instead of enriching the output dictionary one by one we can simple create a reduce function as mentioned above.

Testing the code

Unit test having a testing file. Sample input line cases are stated below followed by the output

  • a sample-line working - not so fine sample-line => a (1) sample-line (2) working (1) not (1) so (1) fine (1)
    • , * , . Are all punctuations[marks] and are all not valid => are (2) all (2) punctuations (1) marks (1) and (1) not (1) valid (1)
  • another brick in 1 wall @ pink-floyd;wall => another (1) brick (1) in (1) wall (1) pink-floyd (1) wall (1)
import string
from collections import defaultdict
file_location = "data.txt"
def word_count_file(file_path):
try:
output = defaultdict(int)
with open(file_path) as f:
lines = f.readlines()
punctuations = set(string.punctuation)
# remove the letter '-' to keep dashed words together
punctuations.remove('-')
for line in lines:
line = line.strip()
if line:
#cleaned_line = line.translate(None, string.punctuation)
# replacing all the punctuations of the line
cleaned_line = line.lower()
for punctuation in punctuations:
cleaned_line = cleaned_line.replace(punctuation, " ")
splitted_line = filter(bool, cleaned_line.split()) # filtering empty strings
for word in splitted_line:
# check for solo '-' and for number only word
if word == '-' or word.isdigit():
continue
output[word] += 1
sorted_words = sorted(output, key=output.__getitem__, reverse=True)
for word in sorted_words:
print "{0} ({1})".format(word, output[word])
except Exception as e:
print "Exception File Processing: %s" % file_path
print e
if __name__ == "__main__" :
word_count_file(file_location)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment