Skip to content

Instantly share code, notes, and snippets.

@jtaxen
Last active September 22, 2019 15:53
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save jtaxen/c3b1184c07f2433c22145cec802623e4 to your computer and use it in GitHub Desktop.
Save jtaxen/c3b1184c07f2433c22145cec802623e4 to your computer and use it in GitHub Desktop.
Word counting

This program must be run with Python 3.6 or later.

What I would do in order to run it in production:

  • I would make it an executable and add an argument parser to let the user select input and output files from the command line. Giving the program a more sensible name, it would perhaps look like
$ word_counter -i data.txt -o result.txt

This would also include exception handling in case the input file does not exist.

  • I would change the format of the output file to tab or comma separated values since that format is easier to process and more common. Come to think of it, why not another command line option for selecting the output format, so that the user can choose json, yaml, xml or whatever they find is the best for their needs?

  • I would revise the function clean_words. First of all I would make it comply to the unicode standard and respect that there are languages that use letters that are not included in ASCII. Second of all, there might be words that contain dashes but that still counts as a single word.

Bonus point: I would finish the docstrings, run pylint on the file and write unit and performance tests.. I actually tried to run the program with a 10^9 bytes textfile containing a dump from English Wikipedia and it finished in 19.5 seconds, so I think the performance is all right at the moment.

Computer Graphics
Computer graphics are graphics created using computers and the representation of image data by a computer specifically with help from specialized graphic hardware and software.
The interaction and understanding of computers and interpretation of data has been made easier because of computer graphics. Computer graphic development has had a significant impact on many types of media and have revolutionized animation, movies and the video game industry.
Overview
The term computer graphics has been used in a broad sense to describe "almost everything on computers that is not text or sound".[1] Typically, the term computer graphics refers to several different things:
- the representation and manipulation of image data by a computer
- the various technologies used to create and manipulate images
- the sub-field of computer science which studies methods for digitally synthesizing and manipulating visual content, see study of computer graphics
Computer graphics is widespread today. Computer imagery is found on television, in newspapers,
for example in weather reports, or for example in all kinds of medical investigation and surgical
procedures. A well-constructed graph can present
complex statistics in a form that is easier to understand and interpret. In the media "such graphs
are used to illustrate papers, reports, thesis", and other presentation material.[2]
Many powerful tools have been developed to visualize data. Computer generated imagery can be categorized into several different types: two dimensional (2D), three dimensional (3D), and animated graphics. As technology has improved, 3D computer graphics have become more common, but 2D computer graphics are still widely used. Computer graphics has emerged as a sub-field of computer science which studies methods for digitally synthesizing and manipulating visual content. Over the past decade, other specialized fields have been developed like information visualization, and scientific visualization more concerned with "the visualization of three dimensional phenomena (architectural, meteorological, medical, biological, etc.), where the emphasis is on realistic renderings of volumes, surfaces, illumination sources, and so forth, perhaps with a dynamic (time) component".[3]
import string
from collections import OrderedDict
def clean_words(word):
"""
Remove characters that are not letters and turn upper case letters to lower
case.
:param word: String that is to be reformatted
:return: String with a single word
"""
return "".join([c for c in word if c in string.ascii_letters]).lower()
def read_words(input_file_name):
"""
Read words from an input file and return a dictionary with word count for
each word that appears in the file.
:param input_file_name: Name of the input file
:return: Dictionary with words as keys and wordcount as value
"""
word_count = {}
with open(input_file_name) as input_file:
for line in input_file.readlines():
words = list(map(clean_words, line.split(' ')))
for word in words:
word_count[word] = word_count.get(word, 0) + 1
return word_count
def write_word_count(output_file_name, word_count):
"""
Write the
"""
with open(output_file_name, 'w') as output_file:
for word, count in word_count.items():
output_file.write(f'{word} ({count})\n')
def sort_word_count(unsorted_word_count):
return OrderedDict(sorted(unsorted_word_count.items(),
key=lambda w: w[1],
reverse=True))
def main(input_file_name='data.txt', output_file_name='result.txt'):
"""
Read words from an input file and write the word count to an output file.
:param input_file_name: Name of the input file
:param output_file_name: Name of the output file
"""
word_count = read_words(input_file_name)
sorted_word_count = sort_word_count(word_count)
write_word_count(output_file_name, sorted_word_count)
if __name__ == '__main__':
main()
computer (17)
and (16)
graphics (12)
the (12)
of (12)
a (8)
(7)
in (6)
to (6)
has (5)
is (5)
data (4)
been (4)
on (4)
have (4)
used (4)
for (4)
d (4)
are (3)
computers (3)
with (3)
dimensional (3)
visualization (3)
representation (2)
image (2)
by (2)
specialized (2)
graphic (2)
easier (2)
many (2)
types (2)
media (2)
term (2)
that (2)
or (2)
several (2)
different (2)
subfield (2)
science (2)
which (2)
studies (2)
methods (2)
digitally (2)
synthesizing (2)
manipulating (2)
visual (2)
content (2)
imagery (2)
example (2)
reports (2)
medical (2)
can (2)
other (2)
developed (2)
three (2)
as (2)
more (2)
created (1)
using (1)
specifically (1)
help (1)
from (1)
hardware (1)
software (1)
interaction (1)
understanding (1)
interpretation (1)
made (1)
because (1)
development (1)
had (1)
significant (1)
impact (1)
revolutionized (1)
animation (1)
movies (1)
video (1)
game (1)
industry (1)
overview (1)
broad (1)
sense (1)
describe (1)
almost (1)
everything (1)
not (1)
text (1)
sound (1)
typically (1)
refers (1)
things (1)
manipulation (1)
various (1)
technologies (1)
create (1)
manipulate (1)
images (1)
see (1)
study (1)
widespread (1)
today (1)
found (1)
television (1)
newspapers (1)
weather (1)
all (1)
kinds (1)
investigation (1)
surgical (1)
procedures (1)
wellconstructed (1)
graph (1)
present (1)
complex (1)
statistics (1)
form (1)
understand (1)
interpret (1)
such (1)
graphs (1)
illustrate (1)
papers (1)
thesis (1)
presentation (1)
material (1)
powerful (1)
tools (1)
visualize (1)
generated (1)
be (1)
categorized (1)
into (1)
two (1)
animated (1)
technology (1)
improved (1)
become (1)
common (1)
but (1)
still (1)
widely (1)
emerged (1)
over (1)
past (1)
decade (1)
fields (1)
like (1)
information (1)
scientific (1)
concerned (1)
phenomena (1)
architectural (1)
meteorological (1)
biological (1)
etc (1)
where (1)
emphasis (1)
realistic (1)
renderings (1)
volumes (1)
surfaces (1)
illumination (1)
sources (1)
so (1)
forth (1)
perhaps (1)
dynamic (1)
time (1)
component (1)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment