Skip to content

Instantly share code, notes, and snippets.

@jgwerner
Created June 20, 2017 17:26
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save jgwerner/714bf48e5464721bf3cb5a27cd19eba2 to your computer and use it in GitHub Desktop.
Save jgwerner/714bf48e5464721bf3cb5a27cd19eba2 to your computer and use it in GitHub Desktop.
Standard data science workflow

Data Science Workflow

Define the Problem

  • What is the problem? Provide formal and informal definitions.
  • Why does the problem need to be solved? Motivation, benefits, how it will be used.
  • How would I solve the problem? Describe how the problem would be solved manually to flush domain knowledge.

Prepare Data

  • Data Selection. Availability, what is missing, what can be removed.
  • Data Preprocessing. Organize selected data by formatting, cleaning and sampling.
  • Data Transformation. Feature engineering using scaling, attribute decomposition and attribute aggregation.
  • Data visualizations such as with histograms.

Spot Check Algorithms

  • Test harness with default values.
  • Run family of algorithms across all the transformed and scaled versions of dataset.
  • View comparisons with box plots.

Improve Results (Tuning)

  • Algorithm Tuning: discovering the best models in model parameter space. This may include hyper parameter optimizations with additional helper services.
  • Ensemble Methods: where the predictions made by multiple models are combined.
  • Feature Engineering: where the attribute decomposition and aggregation seen in data preparation is tested further.

Present Results

  • Context (Why): how the problem definition arose in the first place.
  • Problem (Question): describe the problem as a question.
  • Solution (Answer): describe the answer the the question in the previous step.
  • Findings: Bulleted lists of discoveries you made along the way that interests the audience. May include discoveries in the data, methods that did or did not work or the model performance benefits you observed.
  • Limitations: describe where the model does not work.
  • Conclusions (Why+Question+Answer)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment