This program must be run with Python 3.6 or later.
What I would do in order to run it in production:
- I would make it an executable and add an argument parser to let the user select input and output files from the command line. Giving the program a more sensible name, it would perhaps look like
$ word_counter -i data.txt -o result.txt
This would also include exception handling in case the input file does not exist.
-
I would change the format of the output file to tab or comma separated values since that format is easier to process and more common. Come to think of it, why not another command line option for selecting the output format, so that the user can choose json, yaml, xml or whatever they find is the best for their needs?
-
I would revise the function clean_words. First of all I would make it comply to the unicode standard and respect that there are languages that use letters that are not included in ASCII. Second of all, there might be words that contain dashes but that still counts as a single word.
Bonus point: I would finish the docstrings, run pylint on the file and write unit and performance tests.. I actually tried to run the program with a 10^9 bytes textfile containing a dump from English Wikipedia and it finished in 19.5 seconds, so I think the performance is all right at the moment.