Skip to content

Instantly share code, notes, and snippets.

@kavanagh
Last active June 9, 2016 09:09
Show Gist options
  • Save kavanagh/42aa464f9f20937937687db0eb5fdeb9 to your computer and use it in GitHub Desktop.
Save kavanagh/42aa464f9f20937937687db0eb5fdeb9 to your computer and use it in GitHub Desktop.
How to include data in a project repo

Notes on how to include data in a project repo...

An ideal folder structure:

/my-project
  .gitignore
  /data
    - things.csv
    - people.csv
    - countries.csv
    - make-countries.js
    - Makefile
    - make-things.sh
    - README.md

Rules

  • The directory structure must be as flat a possible
  • Wrtie /data/README.md. See notes below about what this should contain
  • Think carefully about adding large data files. Consider making temporary files generated by a script and .gitignore'd. If you must include the file in the repo you may need Git-LFS.

Naming files

  • For data choose a name similar to what you would call a database table.
  • If a script outputs a single file then use the name of output file. eg things.sh makes things.csv
  • Use common sense for everything else.

Write a ReadMe.md

Document the data and the scripts in the data folder. Allow developers to understand what's what. Make it easier to audit and fact check the data, or remake it later.

Things to document about each data file:

  • Where the data came from. Urls are helpful.
  • Data transformations that were performed after obtaining from source
  • What each column is for
  • When a file included data stiched together from multiple source say where from.
  • Bash or command line snippets
  • List tools required to make data temporary files

Ideal file formats

Priority ordered:

Data

  • JSON
  • CSV
  • TSV
  • topojson
  • geojson

Script

  • sh, bash
  • js
  • r
  • Makefile
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment