Skip to content

Instantly share code, notes, and snippets.

@kchawla-pi
Last active January 5, 2020 19:47
Show Gist options
  • Save kchawla-pi/919961240872d368064bdd244e1c7e7f to your computer and use it in GitHub Desktop.
Save kchawla-pi/919961240872d368064bdd244e1c7e7f to your computer and use it in GitHub Desktop.
# each library should be imported in a separate line and arranged alphabetically.
import csv
import glob
import hashlib
import os
# replace with file location
r'''
path = r'/mnt/chromeos/MyFiles/Downloads/CSVs/'
files = glob.glob(path + r"/*.csv")
In the original code it was written:
output_filename = r"\output.csv"
You are working on a unix system but original script used the '\' in output path which is only for windows.
These are not the best ways for path construction in Python. A better way, and there are others as well, this is where you start:
'''
base_path = os.path.join(['mnt', 'chromeos', 'MyFiles', 'Downloads', 'CSVs']) # 'path' is a generic name and maybe a Python key word.
csv_files = glob.glob(os.path.join(base_path, '*.csv')) # files is a very generic name and doesn't clarify which kind of files is being referred to.
# this is where we'll store the values
hits = []
'''
The comment above was unnecessary, the code makes it fairly clear what this is used for.
There were many in the original code and I have removed almost all of them.
Many people write unneeded comments, even experienced devs.
It takes some time, practice, and deliberate intent to recognize unneeded comments.
They are not good because they add clutter to the code, so when a genuinely useful comment shows up, it is hard to spot.
The guidelines for commenting is this:
The code should be written in such a way that reading it should be sufficient to understand what is being done and why.
Good variable, function, and class names help tremendously in this.
If one needs to add a comment to explain what is being done or why, then it is worth it to look at the code again and
check if it can be simplified.
Only use comments when doing something non-obvious, or in an unusual way.
'''
firstLetter = 't'
valid_word_lengths = set(5, 7)
# read the files
for file in csv_files:
with open(file, 'r') as f: # with context manager ensures the file will be closed automatically, even if the program crashes.
lines = list(f) # 'f' becomes a line iterator for the file and typecasting it into an iterable like list will automatically give you a list of lines.
all_words = [] # the datatype should not be part of the variable name. Just call it 'words' or 'all_words', not 'word list'.
for line_ in lines: # adding an underscore for singular elements of a plural variable reduces chances of misreading the variable.
line_ = line_.replace(',', ' ') # you had replaced the chars with empty strings in your original script, this replaces them with a space.
line_ = line_.replace('\n', ' ') # side note - if the file came from a windows system, use '\r\n' to indicate line breaks.
words_per_line = line_.split(' ')
all_words.extend(words_per_line) # Append adds an object such as dict or list etc intact. extend merges two lists, so you don't have to loop over again to make a single list.
for word in all_words:
if word.startswith(firstLetter) and len(word) in valid_word_lengths:
hits.append((word, file, hashlib.md5(file.encode('utf-8')).hexdigest())) # instead of adding a list and then converting each to a tuple in the map operation, just add a tuple here.
# remove duplicates from our hits list
hits = set(map(hits)) # if the order is unimportant, there's no need to convert a set to a list to write to a csv.
output_filename = 'output.csv' # Variable names should afford high clarity. 'output_filename' is far clearer than 'output'.
output_filepath = os.path.join(base_path, output_filename) # moved this here, so variable is declared close to where it is being used.
with open(output_filepath, 'w', newline='') as f: # 'file' is a generic word, avoid it.
writer = csv.writer(f)
writer.writerows(hits)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment