Skip to content

Instantly share code, notes, and snippets.

@gwax
Last active April 20, 2023 12:11
Show Gist options
  • Save gwax/28dc9f063d8b76ffc9bbd830998c38de to your computer and use it in GitHub Desktop.
Save gwax/28dc9f063d8b76ffc9bbd830998c38de to your computer and use it in GitHub Desktop.
handling data file paths in python

A couple pointers for dealing with files on the filesystem from inside python:

  1. don't modify sys.path
  2. don't use relative paths unless they are relative to something
  3. always use os.path.join
  4. don't rely on the environment

Now for a pattern that I strong suggest:

Start with a code tree where your files are set aside but inside your code tree:

repo_root
|- clover/
   | - __init__.py
   | - some_package/
       | - __init__.py
       | - some_code.py
       | - data/
           | - some_file.csv

In this case, our data file (some_file.csv) is inside the data directory of our package (some_package)

In code that wants to access some_file, you should retrieve the data directory path relative to some_package. You can do this via import and the __file__ attribute:

import os

import clover.some_package

DATA_DIR = os.path.join(
    os.path.dirname(clover.some_package.__file__),
    'data')

with open(os.path.join(DATA_DIR, 'some_file.csv')) as some_file:
    # do something with your csv file

This works because clover.some_package.__file__ gives you the path to clover/some_package/__init__.py regardless of where you started. Now we can use os.path.join to create a constant (DATA_DIR) with the location of our data, which we can use wherever we want.

If you are running code and you want to read input file data (data not committed to the repo) you should take the data file (or path) as an argument and do everything relative to that argument.

I recommend using argparse to get arguments.

Here's a snippet for dealing with a single input file:

import argparse

def get_args():
    parser = argparse.ArgumentParser(description='Some script that takes a data file.')
    parser.add_argument('infile', type=argparse.FileType('r'))
    return parser.parse_args()

def main():
    args = get_args()
    print(args.infile.read())  # or do something more interesting with your input file

if __name__ == '__main__':
    main()

Or, if you prefer using a data folder:

import argparse
import os

def get_args():
    parser = argparse.ArgumentParser(description='Some script that takes multiple data files from a datapath.')
    parser.add_argument('datapath')
    return parser.parse_args()

def main():
    args = get_args()
    with open(os.path.join(args.datapath, 'expected_file.csv'), 'r') as csv_file:
        # do something with your file

if __name__ == '__main__':
    main()

If you do it this way, your path is specified from outside and everything you do is relative to that path. It doesn't matter where you run your script from or where the data is, you just tell your program up front and it does what you ask.

@john-h-rogers
Copy link

This is great! Thanks George!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment