The top n questions data scientists ask
Data science doesn’t start with data, it starts with a problem…
The pipeline model is useful, but data scientists progress via a series of questions - what are those questions?
- Who is the client?
- What is the desired output?
- Is there a clear vision of what and why I need to do x?
- How is this going to be used?
- Who is going to use this?
- How much time do I have?
- What is the quantitative question?
- What does the literature say about this?
- Is there an existing model, algorithm, or baseline?
- What data is available?
- Is this the right data to answer the question?
- Does my analysis make sense?
- Can it work?
- Will it scale?
- Can I explain it clearly?
- Is it viable?
- How does it fit into current workflow?
- Will my analysis actually answer the question?
- Will it do what I want it to do?
- What tools are available to me?
- Do I need to supplement the data?
- Is there other/similar/more data available somewhere else?
- What kind of experiment can I do?
- What is the size of the data?
- What is the shape of the data?
- Is it normalized?
- Are things correlated?
- How many features are there (feature discovery)?
- How do I extract meaning from the features?
- What does the data look like visually?
- Are there clusters?
- Outliers or anomalies or weird things?
- What is the distribution?
- How much error will be accepted?
- How many steps are necessary to answer the question?
- Can I reduce those steps?
- How does changing one part change other parts?
- What does "done" look like?