gwax/data_paths.md

## data_paths.md

      
    Raw
  

              data_paths.md
            
          
    A couple pointers for dealing with files on the filesystem from inside python:

don't modify sys.path
don't use relative paths unless they are relative to something
always use os.path.join
don't rely on the environment

Now for a pattern that I strong suggest:
Start with a code tree where your files are set aside but inside your code tree:
repo_root
|- clover/
   | - __init__.py
   | - some_package/
       | - __init__.py
       | - some_code.py
       | - data/
           | - some_file.csv

In this case, our data file (some_file.csv) is inside the data directory of our package (some_package)
In code that wants to access some_file, you should retrieve the data directory path relative to some_package. You can do this via import and the __file__ attribute:
import os

import clover.some_package

DATA_DIR = os.path.join(
    os.path.dirname(clover.some_package.__file__),
    'data')

with open(os.path.join(DATA_DIR, 'some_file.csv')) as some_file:
    # do something with your csv file
This works because clover.some_package.__file__ gives you the path to clover/some_package/__init__.py regardless of where you started. Now we can use os.path.join to create a constant (DATA_DIR) with the location of our data, which we can use wherever we want.

  
## input_file_paths.md

      
    Raw
  

              input_file_paths.md
            
          
    If you are running code and you want to read input file data (data not committed to the repo) you should take the data file (or path) as an argument and do everything relative to that argument.
I recommend using argparse to get arguments.
Here's a snippet for dealing with a single input file:
import argparse

def get_args():
    parser = argparse.ArgumentParser(description='Some script that takes a data file.')
    parser.add_argument('infile', type=argparse.FileType('r'))
    return parser.parse_args()

def main():
    args = get_args()
    print(args.infile.read())  # or do something more interesting with your input file

if __name__ == '__main__':
    main()
Or, if you prefer using a data folder:
import argparse
import os

def get_args():
    parser = argparse.ArgumentParser(description='Some script that takes multiple data files from a datapath.')
    parser.add_argument('datapath')
    return parser.parse_args()

def main():
    args = get_args()
    with open(os.path.join(args.datapath, 'expected_file.csv'), 'r') as csv_file:
        # do something with your file

if __name__ == '__main__':
    main()
If you do it this way, your path is specified from outside and everything you do is relative to that path. It doesn't matter where you run your script from or where the data is, you just tell your program up front and it does what you ask.