Skip to content

Instantly share code, notes, and snippets.

@jbusecke
Created September 17, 2018 19:03
Show Gist options
  • Save jbusecke/60c6c40f255e72f3a2623a06d858f461 to your computer and use it in GitHub Desktop.
Save jbusecke/60c6c40f255e72f3a2623a06d858f461 to your computer and use it in GitHub Desktop.
Dataset storage on tigress at Princeton University

Shared dataset on the princeton HPC filesystems

We often use datasets of a variety of variables (e.g. temperature, salinity, chlorophyll) to validate model runs or conduct analyses.

Ideally, we want to:

  1. avoid storing duplicates of files on the server.
  2. be able to share these datasets with others in the group (downloading and postprocessing is boring).

Our group will be able to get to cool results much quicker if each person takes a bit more time and does the postprocessing and documentation properly and in a consistent way. The next time you need a different dataset it might already be sitting there.

Steps:

  • All files should be organized into a central folder in !!!...from there folders should be labeled by variables etc so single datasets can be found easily without having to ask (need to include more specific rules)

  • Download the raw data and DO NOT TOUCH IT. Each dataset should be fetched from the source in its original file/folder structure and naming. This makes updating to newer versions and migration to different machines a lot easier.

  • Then postprocess your heart out, but SAVE + DOCUMENT! A lot of datasets are saved in unspeakably weird filestructures, begging for postprocessing. Write a script to convert them and save them back out into a processed folder. Make sure to INCLUDE a README file as well as the script/code (or where to find it), so that results are reproducable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment