Create a gist now

Instantly share code, notes, and snippets.

What would you like to do?
My advise on what you need to do to become a data scientist...

If you were to give recommendations to your "little brother/sister" on things that they need to do to become a data scientist, what would those things be?

I think the "Data Science Venn Diagram" (http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram) is a great place to start. You need three things to be a good data scientist:

  • Statistical knowledge
  • Programming/hacking skills
  • Domain expertise

Statistical knowledge

You need to be able to think "statistically": you need to be able to turn sample data into inferences about the underlying population. I'm not sure how you develop statistical thinking - I did it through a masters and then PhD in statistics, but that's obviously a big time investment!

I think you need some knowledge of specific statistical/machine learning techniques, but a deep theoretical understanding is not that important. You need to understand the strengths and weaknesses of each technique, but you don't need a deep theoretical understanding. The vast majority of data science problems can be solved by a creative assembly of off-the-shelf techniques, and don't require new theory.

I'd recommend developing a familiarity with linear models and their variations (esp. generalised linear models, splines and the lasso). Yes, they are linear, but a linear approximation is a good place to start for many problems. For problems that focus more on prediction than understanding, make sure you're familiar with the most popular ML techniques, e.g. random forests and support vector machines.

Programming skills

You need to be fluent with either R or python. There are other options, but none of them have the community that R and python have, which means you'll need to spend a lot of time reinventing tools that already exist elsewhere. Obviously, I prefer R, and unlike what some people claim it is a well founded programming language that is well tailored for its domain.

If you use R you want to be conversant with a set of packages that allows you to solve the following practical problems:

  • Ingest: how do you get your data into R?
  • Manipulation: how do you filter, summarise, mutate et?
  • Visualisation: how do you explore your data visually?
  • Modelling: once you have a precise question, how do you answer it with a model?
  • Reporting: once you've figure out the solution, how do you communicate it to others?

My recommendations for starting places are:

  • Ingest: readr (flat files), DBI (databases), tidyr (data tidying)
  • Manipulation: dplyr
  • Visualisation: ggplot2 (and ggvis in a year or two)
  • Modelling: caret
  • Reporting: Rmarkdown and shiny

You should also invest some time in learning how to be a productive R programmer (e.g. http://adv-r.had.co.nz) and learning how to write packages (http://r-pkgs.had.co.nz). Start by learning the basics of functional programming - this will have the biggest payoff for your productivity in R.

Domain knowledge

This obviously depends on the domain, but as a data scientist should be able to contribute meaningfully to any project, even if you're not intimately familiar with the specifics. I think this means you should be generally well read (e.g. at the level of New Scientist for the sciences) and an able communicator. A good data scientist will help the real domain experts refine and frame their questions in a helpful way. Unfortunately I don't know of any good resources for learning how to ask questions.

@zmjones
zmjones commented Mar 13, 2015

think you should mention or substitute mlr for caret

@smach
smach commented Mar 14, 2015

We journalists are always looking at ways to ask better questions :-) Example from Poynter, a media training institute: http://www.poynter.org/news/media-innovation/205518/how-journalists-can-become-better-interviewers/

@stonebig

reader ? (typo)

@ndaniel
ndaniel commented Mar 17, 2015

Also knowing Information Theory is very important for a data scientist!

@jonrobinson2

I'd add knitr to the reporting section (for PDF report generation).

@geneorama

I recommend the book "Applied Predictive Modeling" written by the authors of the caret package. The book is rigorous and detailed without being a math text book, and conceptually useful even if you don't use caret

@ndaniel
ndaniel commented Mar 20, 2015

I recommend also the book Optimal Parameter Estimation by Jorma Rissanen. Here is a quote from this book (from page 2):

Very few statisticians have been studying information theory, the result of which,
I think, is the disarray of the present discipline of statistics.

@springcoil

I would recommend and have already by Twitter - the excellent Thinking with data by Max Shron. This brief read is a good introduction to the challenges of asking good questions and conveying results to stakeholders. If we use OSEMN the Data Analytics taxonomy from Chris Wiggins and Hilary Mason then we can consider Max's book to be an answer to some of the 'iNterpret' part of the taxonomy. I particularly like the CoNVO framework he recommends and I often use it during the data consult period. I know Hadley mentioned some similarities between this and taking a medical history. I know when I tutor students Mathematics, I often run into the same challenges - how do you elucidate what they actually know? Since they often won't tell you explicitly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment