How to organize your Python data science project

Having done a number of data projects over the years, and having seen a number of them up on GitHub, I've come to see that there's a wide range in terms of how "readable" a project is. I'd like to share some practices that I have come to adopt in my projects, which I hope will bring some organization to your projects.

Disclaimer: I'm hoping nobody takes this to be "the definitive guide" to organizing a data project; rather, I hope you, the reader, find useful tips that you can adapt to your own projects.

Disclaimer 2: What I’m writing below is primarily geared towards Python language users. Some ideas may be transferable to other languages; others may not be so. Please feel free to remix whatever you see here!

Disclaimer 3: I found the Cookiecutter Data Science page after finishing this blog post. Many ideas overlap here, though some directories are irrelevant in my work -- which is to

Installing Postgres via Brew


Brew Package Manager

In your command-line run the following commands:

  1. brew doctor
  2. brew update
package com.hrishikeshmishra.practices.string;
import java.util.Arrays;
import static com.hrishikeshmishra.practices.string.LexicographicOrder.getNextPermutation;
* Problem:
* Lexicographic Order
* Generates permutations using lexicographic ordering.
//Now with less jquery
//1) go to your my-list page, and scroll to the bottom to make sure it's all loaded:
//2) Next, paste this in your developer tools console and hit enter:
[...document.querySelectorAll('.slider [aria-label]')].map(ele => ele.getAttribute('aria-label'))
//or use this to copy the list to your clipboard:
copy([...document.querySelectorAll('.slider [aria-label]')].map(ele => ele.getAttribute('aria-label')))
* Fancy ID generator that creates 20-character string identifiers with the following properties:
* 1. They're based on timestamp so that they sort *after* any existing ids.
* 2. They contain 72-bits of random data after the timestamp so that IDs won't collide with other clients' IDs.
* 3. They sort *lexicographically* (so the timestamp is converted to characters that will sort properly).
* 4. They're monotonically increasing. Even if you generate more than one in the same timestamp, the
* latter ones will sort after the former ones. We do this by using the previous random bits
* but "incrementing" them by 1 (only in the case of a timestamp collision).