jdavidheiser/Answers.txt

## Answers.txt
The following questions were posted by CodeKata 4.  Answers are provided inline.

* To what extent did the design decisions you made when writing the original programs make it easier or harder to factor out common code?

Part 1 was very direct, with things like column numbers hard coded inline.  As a first pass prototype it was fastest to do this way, rather than adding a lot of variables and creating functions.  Part 2, however, took the same basic structure of part 1 and made it more abstract, such that it was extremely easy to factor out the common ideas by the time part 3 rolled around.

* Was the way you wrote the second program influenced by writing the first?

Yes - seeing the parallels between the two problems meant it made sense to already start abstracting some things in the second program, to make the code a bit more generic and less specific.

* Is factoring out as much common code as possible always a good thing? Did the readability of the programs suffer because of this requirement? How about the maintainability?

In this case, it was a good thing - the resultant program maintained good readability and was abstract and generic enough that it could be easily extended for other data files, and indeed might work for many without any changes up front.

HOWEVER, small distinctions like the extra asterisks in weather.dat, or the horizontal dividing line in football.dat informed the process used to parse the data files.  If this program were intended to eventually be expanded to data files with significantly different formats, the combined, refactored code could become cumbersome quite quickly, as various filters and hacks were worked in to 'massage' the data into a more usable form.  In this case, a more robust data parsing library would be appropriate, and possibly some sort of configuration file used to determine the data format expected.

## CodeKata4p1.py
'''
Example solution of CodeKata number Four: Data Munging, part 1
http://codekata.pragprog.com/2007/01/kata_four_data_.html#more
James Davidheiser
September 4, 2013
'''
filename = "weather.dat"

'''
The problem stated is to load a data file and determine the day which had the smallest difference
between the high and low temperatures, and print out that information.
'''

with open(filename) as delimfile:
    '''
    the data is stored in an ugly fashion, with headers, HTML tags, arbitrary blank lines, etc
    a more sophisiticated file reader add-on would be appropriate for here
    but for the purposes of the exercise, let's brute force it using only Python built-in commands
    '''
    day=[]
    spread=[]
    for row in delimfile:
        # here we want to grab only rows with information in them, and then only the rows which
        # start with a number (for the day of the month).
        if len(row) > 1:
            tmp = row.split()
            if tmp[0].isdigit():
                day.append(tmp[0])
                spread.append(float(tmp[1].strip('*'))-float(tmp[2].strip('*')))

    # Find the day that corresponds to the minimum temperature difference
    # to do this much faster, we could use Numpy, but for this short data file this works well and is readable
    min_index = spread.index(min(spread))

    print "The day with the smallest temperature spread was:" , day[min_index] , "with a spread of" , spread[min_index] , "degrees"


## CodeKata4p2.py
'''
Example solution of CodeKata number Four: Data Munging, part 2
http://codekata.pragprog.com/2007/01/kata_four_data_.html#more
James Davidheiser
September 4, 2013
'''

filename = "football.dat"

'''
The problem stated is to load a data file and determine the team which had the smallest difference between
the goals for and against, and print out that information.  Here we define which columns of the data will
contain the relevant information, after being split by whitespace.
'''
column_for = 6
column_against = 8
column_name = 1
min_data_columns = 9 # this will be a minimum of data columns, in case some rows have extra entries beyond
                     # the columns we care about

with open(filename) as delimfile:
    team = []
    spread = []

    for row in delimfile:
        tmp = row.split()
        # previous example split after checking the row length - it's more logical to check the row length after the split
        # operation, because we can identify the number of columns and more easily exclude non-data columns.
        if len(tmp) > min_data_columns:
            if tmp[column_for].isdigit():
                # add absolute value here because we care about the smallest difference, regardless of who won
                spread.append(abs(float(tmp[column_for])-float(tmp[column_against])))
                team.append(tmp[column_name])
    index_min = spread.index(min(spread))
    print "The team",team[index_min], "had the smallest difference in 'for' and 'against' goals, with a difference of",int(spread[index_min])


## CodeKata4p3.py
'''
Example solution of CodeKata number Four: Data Munging, part 3
http://codekata.pragprog.com/2007/01/kata_four_data_.html#more
James Davidheiser
September 4, 2013

The problem stated is to take the previous two examples, determining minimum scoring differential and minimum
temperature swings, and refactor them to work with some shared code.  Typically this would be placed in a separate
module file which is imported, but for the sake of brevity for this exercise, we include everything in a single
file and simply define a function that is called twice at the end.
'''

import sys

# It's possible that data files could have extraneous characters in the columns corresponding to
# data output values.  Strip those out of the column completely.

deletechars = '*'

def get_minimum_difference(filename,column_A,column_B,column_name,min_data_columns):
    '''
    get_minimum_difference finds the smallest difference between column A and column B in the text data
    file (filename), and returns a tuple containing the corresponding name from column_name, as well
    as the difference value
    '''
    with open(filename) as delimfile:
        name = []
        spread = []

        for row in delimfile:
            tmp = row.split()
            if len(tmp) > min_data_columns:

                try:
                    '''
                    rather than checking manually whether one or both of the columns contains a digit
                    use the Pythonic approach with try and except blocks
                    if we fail to turn the two entries into floats, that means one of them wasn't in a format
                    capable of converting to float and we should fail gracefully
                    HOWEVER - there is a caveat to this approach.  We could potentially skip lines that are formatted
                    differently, so let's make sure we print out those instances to stderr and warn the user
                    '''
                    spread.append(abs(float(tmp[column_A].translate(None,deletechars)) - \
                                          float(tmp[column_B].translate(None,deletechars))))
                    name.append(tmp[column_name])
                except ValueError:
                    print >> sys.stderr, "Warning, ignoring row: ", row
        index_min = spread.index(min(spread))


    return (name[index_min],spread[index_min])


if __name__ == '__main__':

    score_tuple = get_minimum_difference('football.dat',column_A=6,column_B=8,column_name=1,min_data_columns=9)
    print "The team name and smallest point differential in football.dat are:",score_tuple

    print "\n"

    temp_tuple = get_minimum_difference('weather.dat',column_A=1, column_B=2,column_name=0,min_data_columns=14)
    print "The day of June 2002 with the smallest difference between the high and low temperature is:", temp_tuple
	The following questions were posted by CodeKata 4. Answers are provided inline.

	* To what extent did the design decisions you made when writing the original programs make it easier or harder to factor out common code?

	Part 1 was very direct, with things like column numbers hard coded inline. As a first pass prototype it was fastest to do this way, rather than adding a lot of variables and creating functions. Part 2, however, took the same basic structure of part 1 and made it more abstract, such that it was extremely easy to factor out the common ideas by the time part 3 rolled around.

	* Was the way you wrote the second program influenced by writing the first?

	Yes - seeing the parallels between the two problems meant it made sense to already start abstracting some things in the second program, to make the code a bit more generic and less specific.

	* Is factoring out as much common code as possible always a good thing? Did the readability of the programs suffer because of this requirement? How about the maintainability?

	In this case, it was a good thing - the resultant program maintained good readability and was abstract and generic enough that it could be easily extended for other data files, and indeed might work for many without any changes up front.

	HOWEVER, small distinctions like the extra asterisks in weather.dat, or the horizontal dividing line in football.dat informed the process used to parse the data files. If this program were intended to eventually be expanded to data files with significantly different formats, the combined, refactored code could become cumbersome quite quickly, as various filters and hacks were worked in to 'massage' the data into a more usable form. In this case, a more robust data parsing library would be appropriate, and possibly some sort of configuration file used to determine the data format expected.
	'''
	Example solution of CodeKata number Four: Data Munging, part 1
	http://codekata.pragprog.com/2007/01/kata_four_data_.html#more
	James Davidheiser
	September 4, 2013
	'''
	filename = "weather.dat"

	'''
	The problem stated is to load a data file and determine the day which had the smallest difference
	between the high and low temperatures, and print out that information.
	'''

	with open(filename) as delimfile:
	'''
	the data is stored in an ugly fashion, with headers, HTML tags, arbitrary blank lines, etc
	a more sophisiticated file reader add-on would be appropriate for here
	but for the purposes of the exercise, let's brute force it using only Python built-in commands
	'''
	day=[]
	spread=[]
	for row in delimfile:
	# here we want to grab only rows with information in them, and then only the rows which
	# start with a number (for the day of the month).
	if len(row) > 1:
	tmp = row.split()
	if tmp[0].isdigit():
	day.append(tmp[0])
	spread.append(float(tmp[1].strip(''))-float(tmp[2].strip('')))

	# Find the day that corresponds to the minimum temperature difference
	# to do this much faster, we could use Numpy, but for this short data file this works well and is readable
	min_index = spread.index(min(spread))

	print "The day with the smallest temperature spread was:" , day[min_index] , "with a spread of" , spread[min_index] , "degrees"