Skip to content

Instantly share code, notes, and snippets.

@JnyJny
Created March 12, 2020 16:39
Show Gist options
  • Save JnyJny/39e6cea5fec4bb523aab297c472bfb5c to your computer and use it in GitHub Desktop.
Save JnyJny/39e6cea5fec4bb523aab297c472bfb5c to your computer and use it in GitHub Desktop.
Comments for PyBites 178
from dateutil.parser import parse
import re
import datetime
def get_min_max_amount_of_commits(
commit_log: str = commits, year: int = None
) -> (str, str):
"""
Calculate the amount of inserts / deletes per month from the
provided commit log.
Takes optional year arg, if provided only look at lines for
that year, if not, use the entire file.
Returns a tuple of (least_active_month, most_active_month)
"""
with open(commits) as commit:
commit = commit.readlines()
# 1. The variable 'commit' is cobbered here, going from a file object
# to list of strings.
#
# 2. The context manager will hold the commit log file open until
# the manager exits, but you are reading all the data out of the
# file in one go.
data = [i.strip("\n").split("|") for i in commit]
# 3. Because we read all the data out of the file in one blob,
# we are now forced to perform several interations thru the
# rather than interacting with a line of data at a time.
# When the size of the file becomes large, this will become
# a performance issue that will be difficult to address with
# this design.
#
# 4. Many old-school programmers (C and Fortran especially) will
# use single letter variables to denote index variables. It is
# tempting even in comprehensions, but I would use a more descriptive
# name, e.g substitue 'line' for 'i' in the list comprehension above.
date = [parse(i[0], fuzzy=True).strftime("%Y-%m-%d %H:%M:%S") for i in data]
# 5. Here we are again traversing 'data' to parse the date out of each
# line in the file and have increased the programs memory demands by
# keeping all those dates around in memory.
# 6. Using indices (0 and 1) to reference the date section and
# the payload section of a split line is confusing to the
# reader and the programmer who needs to fix a bug.
# 7. I don't think it's necesary to use strftime on the returned
# datetime.datetime object.
changes = [i[1] for i in data]
# 7. We've just made a copy of the payload section of all the lines,
# increasing our memory footprint again. The semantic content
# has improved, 'changes' versus 'i[1]', it comes at a price.
change_count = []
for i in changes:
temp = re.findall(r"\d+", i)
change = list(map(int, temp))
change_count.append(sum(change[1:]))
# 8. If ever you have a problem and you say "I could use regex
# to solve that" you will soon find that you have two (or
# more) problems. The first problem is regular expression
# matching is slow. The next problem is crafting a regex that
# reliably matches the text you trying to parse out. The
# third problem is they can be very sensitive to changes in
# the input text.
# 9. We are relying on date, changes and change_count lists having
# the same number of items. If one of the operations that builds
# those lists results in a shorter or longer list, it will make
# matching counts to dates more error prone.
# 10. While map is part of the language, I would use a list comprehension
# instead.
commit_dict = dict(zip(date, change_count))
# 11. We've expanded our memory usage again, creating a new dictionary
# with the computed date string.
key_list = []
for key in commit_dict.keys():
if key[0:7] not in key_list:
key_list.append(key[0:7])
# 12. This construct loops thru all the keys in the dictionary
# we just made to 'unique'-ify the keys. It's confusiong
# and not obvious why would want to do this since we are
# using a slice to pull the first seven characters off the
# front of the key. We'd have to go back to the strftime
# to figure out what the slice refers to.
commit_totals = {}
for item in key_list:
val = sum([commit_dict.get(i) for i in commit_dict.keys() if item in i])
commit_totals[item] = val
# 13. We're building another dictionary (memory usage) and
# iterating thru the key_list and the list comprehension
# is pretty complicated (and another embedded interation
# which is a performance concern). While not incorrect to
# use the dictionary keys method, iterating on the bare
# dictionary returns the keys.
if year == None:
least_active_month = min(commit_totals, key=commit_totals.get)
most_active_month = max(commit_totals, key=commit_totals.get)
# 14. While not wrong, using the 'key' keyword argument with
# the get method is redundant. Min and Max both do the
# right thing for dictionaries and arrays.
return (least_active_month, most_active_month)
else:
year_dict = {k: v for (k, v) in commit_totals.items() if str(year) in k}
least_active_month = min(year_dict, key=year_dict.get)
most_active_month = max(year_dict, key=year_dict.get)
return (least_active_month, most_active_month)
# 15. If the user asked for a specific year range, then we
# create another dictionary filtered for the requested year.
# The comprehension has a performance concern where the
# the year is converted to a string for every item in
# commit_totals. This is called a 'loop invariant' since
# the value doesn't change in the body of the loop. A
# better practice would have been to convert year to a string
# value and then use that in the comprehension.
# 16. In my opinion, the else clause isn't necessary here
# since the if clause returns. I like it since it brings
# the indention level back to base level of the function
# body.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment