heerensharma/description.md

## description.md

      
    Raw
  

              description.md
            
          
    Code

Takes an input file path as variables file_location
Perferable the single function should be devided into three sub modules.

Reading the file and extracting one by one line e.g. read_file
Function which takes one line as input and then clean the line and enrich the output dictionary (can be made another seperate function for real time counter)
Outputting function. Which can display the output either on stdout or into a specific file with desired formatting.

Total time took to program the code 17 minutes (14 mins (dev) and 3 mins (test)) and writing documentation 10 mins.
Having above stated approach as quite some advantages one of them being the concurrent code. If there is a function which takes one by one line then it can be parallelized and at the same time using gevent or asyncio be done concurrently.
Henceforth in line processing function will be mapper and then instead of enriching the output dictionary one by one we can
simple create a reduce function as mentioned above.
Testing the code

Unit test having a testing file.
Sample input line cases are stated below followed by the output

a sample-line working - not so fine sample-line      => a (1) sample-line (2) working (1) not (1) so (1) fine (1)


, * , . Are all punctuations[marks] and are all not valid => are (2) all (2) punctuations (1) marks (1) and (1) not (1) valid (1)


another brick in 1 wall @ pink-floyd;wall => another (1) brick (1) in (1) wall (1) pink-floyd (1) wall (1)


## word_counter.py
import string
from collections import defaultdict


file_location = "data.txt"


def word_count_file(file_path):
    try:
        output = defaultdict(int)
        with open(file_path) as f:
            lines = f.readlines()
            punctuations = set(string.punctuation)
            # remove the letter '-' to keep dashed words together
            punctuations.remove('-')
            for line in lines:
                line = line.strip()
                if line:
                    #cleaned_line = line.translate(None, string.punctuation)
                    # replacing all the punctuations of the line
                    cleaned_line = line.lower()
                    for punctuation in punctuations:
                        cleaned_line = cleaned_line.replace(punctuation, " ")
                    splitted_line = filter(bool, cleaned_line.split())  # filtering empty strings
                    for word in splitted_line:
                        # check for solo '-' and for number only word
                        if word == '-' or word.isdigit():
                            continue
                        output[word] += 1
        sorted_words = sorted(output, key=output.__getitem__, reverse=True)
        for word in sorted_words:
            print "{0} ({1})".format(word, output[word])
    except Exception as e:
        print "Exception File Processing: %s" % file_path
        print e


if __name__ == "__main__" :
    word_count_file(file_location)
	import string
	from collections import defaultdict


	file_location = "data.txt"


	def word_count_file(file_path):
	try:
	output = defaultdict(int)
	with open(file_path) as f:
	lines = f.readlines()
	punctuations = set(string.punctuation)
	# remove the letter '-' to keep dashed words together
	punctuations.remove('-')
	for line in lines:
	line = line.strip()
	if line:
	#cleaned_line = line.translate(None, string.punctuation)
	# replacing all the punctuations of the line
	cleaned_line = line.lower()
	for punctuation in punctuations:
	cleaned_line = cleaned_line.replace(punctuation, " ")
	splitted_line = filter(bool, cleaned_line.split()) # filtering empty strings
	for word in splitted_line:
	# check for solo '-' and for number only word
	if word == '-' or word.isdigit():
	continue
	output[word] += 1
	sorted_words = sorted(output, key=output.__getitem__, reverse=True)
	for word in sorted_words:
	print "{0} ({1})".format(word, output[word])
	except Exception as e:
	print "Exception File Processing: %s" % file_path
	print e


	if __name__ == "__main__" :
	word_count_file(file_location)