Skip to content

Instantly share code, notes, and snippets.

@rebeccabilbro
Created February 25, 2016 22:32
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save rebeccabilbro/742a19e682375a77f2c4 to your computer and use it in GitHub Desktop.
Save rebeccabilbro/742a19e682375a77f2c4 to your computer and use it in GitHub Desktop.
initial outline for Brittne's blog post

Title

The top n questions data scientists ask

Introduction

Data science doesn’t start with data, it starts with a problem…

The pipeline model is useful, but data scientists progress via a series of questions - what are those questions?

Scoping

Questions data scientists ask to determine the project objective and scope

Requirements

  • Who is the client?
  • What is the desired output?
  • Is there a clear vision of what and why I need to do x?
  • How is this going to be used?
  • Who is going to use this?
  • How much time do I have?
  • What is the quantitative question?
  • What does the literature say about this?
  • Is there an existing model, algorithm, or baseline?

Data Availability

  • What data is available?
  • Is this the right data to answer the question?

Methods

Questions data scientists ask to decide which methods and tools to use

Workflow

  • Does my analysis make sense?
  • Can it work?
  • Will it scale?
  • Can I explain it clearly?
  • Is it viable?
  • How does it fit into current workflow?
  • Will my analysis actually answer the question?
  • Will it do what I want it to do?

Tools

  • What tools are available to me?
  • Do I need to supplement the data?
  • Is there other/similar/more data available somewhere else?
  • What kind of experiment can I do?

Interpretation

Questions data scientists ask to evaluate their results as they iterate

Reading Data

  • What is the size of the data?
  • What is the shape of the data?
  • Is it normalized?
  • Are things correlated?
  • How many features are there (feature discovery)?
  • How do I extract meaning from the features?

Visual Analytics

  • What does the data look like visually?
  • Are there clusters?
  • Outliers or anomalies or weird things?
  • What is the distribution?

Optimization

  • How much error will be accepted?
  • How many steps are necessary to answer the question?
  • Can I reduce those steps?
  • How does changing one part change other parts?

Conclusion

  • What does "done" look like?

Further Reading

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment