Skip to content

Instantly share code, notes, and snippets.

@aled1027
Last active June 19, 2016 20:46
Show Gist options
  • Save aled1027/302e47b21146d0c9afdaf6c74888c97e to your computer and use it in GitHub Desktop.
Save aled1027/302e47b21146d0c9afdaf6c74888c97e to your computer and use it in GitHub Desktop.
Scientific Programming

Guidelines for writing Scientific Programs

Scientific programming is meaningfully different from typical software engineering. Where a web-service runs continuously, a scientific program is ran once - taking anywhere from a few seconds to a few days - and we analyze the results. We may find that the script needs to be updated, or we run the script on different data, in which case we make small changes to the script and run it again. The scientific programming process can be simplified to the repeated iteration of this process.

As a result of this process, and the fact that scientific programs tend to be relatively small (ranging from a few hundred lines of code to a few thousand), I use a unique approach to write the programs. My broad aims in writing scientific programs are the following:

  • Make the core algorithms clear, readable and easily vetted by others
  • Make the code easy to read and adopt for non-programmers
  • Make the code easy to improve or alter (e.g. adding parameters)

In order to achieve these aims, I find the following concrete guidelines to be immensely useful. I explain each of these guidelines in detail below.

  • Separate IO from algorithms
  • Use a parameters dictionary
  • Separate helper functions
  • Use namedtuple to parse files
  • Refactor

1. Separate data IO from algorithms (cite blogpost)

By separating data IO from the algorithms, our code is automatically more modular, readable and easier to test. Each function in well-written code does one thing well. In the caes of scientific programming, the first step is doing data IO well, and doing an algorithm well.

In order to isolate data IO from algorithms, I encourage you to make a function to load your data - probably from a CVS file or similar format - into a list or dictionary and return it. The algorithm function will take a list or dictionary of data, work on it, and also return a list dictionary. Finally, the data output function will be callled, which takes a list, and saves the data to disk.

Here's an example template.

def read_data(filename):
    """Reads data from file and returns a list containing the file's contents"""
    data = []
    with open(filename, 'r') as csvfile:
      reader = csv.read(cvsfile)
      data = [row for row in reader]
    return data
    
def algorithm(data):
    """Computes some algorithm that does something. 
    Returns the data in such and such form"""
    # do things to data
    return data

def save_data(data, filename):
    """save data to filename in a comma-delimited csv"""
    pass

def main(in_file, out_file):
    """Main handler for algorithm that manages the data and calls other functions"""
    data = get_data(in_file)
    processed_data = algorithm(data)
    save_data(processed_data, out_file)

Check out this blog post for more information: http://www.machinalis.com/blog/separate-io-from-algorithms/

2. Use a parameters dictionary

Often you'll have an algorithm that takes many parameters as inputs, and which paramaters are needed can vary based on the value of the parameters or model being computed. In order to prevent your code from becoming a mess of functions with many arguments, or functions with unpredictable arguments, i.e. a strange, unprediable internal interface, I recommend putting the parameters into a dictionary and passing the dictionary around. Generally, a function will receive the data that it needs to perform its specific task, and the dictionary of parameters.

The dictionary, which I usually call params, should be static or immutable: set the parameters in the dictionary at the beginning of your code, and don't change them afterwards: they are parameters, not variables. But I prefer to use the params dictionary instead of globals, because globals will make your life harder if your code happens to become more complex. If you use a parameters dictionary, and a collabor wants to add another parameter to a specific algorithm, then you can easily add this to code by just adding the parameter to the dictionary and accessing it when necessary.

Here's some example code that uses a parameter dictionary.

from other_module import get_data, save_data

# use argparse to get command line arguments
def check_params(params):
    """validates that params are valid values and set as needed"""
    assert 'param1' in params
    assert 'param2' in params
    assert 0 < params['param1'] < 1
    assert -1 < params['param2'] < 0
  
def go(data, params):
    """Computes our algorithm, taking data and parameters as input and outputing processed data"""
    check_params(params)
  
    # grab params that this function uses
    param1 = params['param1']
    param2 = params['param2']
  
    # work on data with param1 and param2
    # ...
  
    return data

if __name__ == '__main__':
    in_file = args.in_file
    out_file = args.out_file
    default_params = {'param1': args.param1,
                      'param2: args.param2}
    data = get_data(in_file)
    processed_data = go(data, params)
    save_data(processed_data, out_file)

3. Separate helper functions

If you use other packages in your code, such as numpy, scipy or networkx, you may find that you write helper functions to perform tasks on top of the package's data objects. In these cases, it's often useful to isolate this computation into a helper function which you can call as needed and easily use in your future work.

I have recently been using python networkx package, a package with Graph data objects and offers a lot of algorithms that are performed on graphs. One function that I needed but couldn't find built into networkx was a node-merge function. The node-merge function takes as input a bunch of nodes, and merges all those nodes in a single node, preserving edges and creating loops. I wrote the merge_nodes function, and put it into a module called networkx_helpers.py with other functions that I have written on top of networkx.

def merge_nodes(graph, merge_set, new_node):                                               
    """
    Squashes, or merges, all nodes in merge_set into a node called new_node
    Works for digraphs where all edges go in both directs,
    because that works best for PathLinker
    """
    graph.add_node(new_node)
    for node in merge_set:
        for edge in graph.edges(node):
            if edge[0] == node:
                graph.add_edge(new_node, edge[1])
                graph.add_edge(edge[1], new_node)
            elif edge[1] == node:
                graph.add_edge(edge[0], new_node)
                graph.add_edge(new_node, edge[0])

    # remove merge_set nodes
    for node in merge_set:
        if node in graph.nodes():
            graph.remove_node(node)

    return graph

E.g. squashes edges (Add example)

4. Use namedtuple to parse files

When parsing files, it is often useful to use a namedtuple to make the code more readable. By using nametuple, you can clarify what each column in the csv file represents and more easily access the data.

import collections

Data = collections.namedtuple('Data', ['header1', 'header2'])

def get_data(filename):
    """returns csv data as named tuples"""
    data = []
    with open(filename, 'r') as csvfile:
      reader = csv.reader(csvfile)
      data = [Data(*row) for row in reader]
    return data

5. Refactor

When you are first writing your code, don't worry too much about organization. Scientific programming is hard, and it's easy to make a mistake. But after you have a reasonable amount of code, say 100-200 lines, it's probably time to refactor.

When I refactor, I like to focus on the following things:

  • Separate IO from the algorithm (like suggestion 1)
    • To achieve this, make a new module called data_io.py and start moving io functionality in this module
  • Isolate helper functions (like suggestion 3)
    • Move these into a new module called helpers.py
  • Separate out individual functions.
    • Try to make the innards of a for loop as few lines as possible.
    • Rewrite for loops using list comprehension or generator expressions.
  • Make names clearer and more precise
    • Simple go back and rename things to reflect their actual value
    • Use namedtuple where it's helpful

More advanced

6. Use closures

7. Use generators

8. Use exceptions

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment