jbusecke/PU_shared_data.md

## PU_shared_data.md

      
    Raw
  

              PU_shared_data.md
            
          
    Shared dataset on the princeton HPC filesystems

We often use datasets of a variety of variables (e.g. temperature, salinity, chlorophyll) to
validate model runs or conduct analyses.
Ideally, we want to:

avoid storing duplicates of files on the server.
be able to share these datasets with others in the group (downloading and postprocessing is boring).

Our group will be able to get to cool results much quicker if each person takes a bit more time and does the postprocessing
and documentation properly and in a consistent way. The next time you need a different dataset it might already be sitting there.
Steps:


All files should be organized into a central folder in !!!...from there folders should be labeled by variables etc so single
datasets can be found easily without having to ask (need to include more specific rules)


Download the raw data and DO NOT TOUCH IT. Each dataset should be fetched from the source in its
original file/folder structure and naming. This makes updating to newer versions and migration to different machines a lot easier.


Then postprocess your heart out, but SAVE + DOCUMENT! A lot of datasets are saved in unspeakably weird filestructures,
begging for postprocessing. Write a script to convert them and save them back out into a processed folder. Make sure to INCLUDE
a README file as well as the script/code (or where to find it), so that results are reproducable.