Skip to content

Instantly share code, notes, and snippets.

@hadley
Created March 13, 2015 18:49
Show Gist options
  • Save hadley/820f09ded347c62c2864 to your computer and use it in GitHub Desktop.
Save hadley/820f09ded347c62c2864 to your computer and use it in GitHub Desktop.
My advise on what you need to do to become a data scientist...

If you were to give recommendations to your "little brother/sister" on things that they need to do to become a data scientist, what would those things be?

I think the "Data Science Venn Diagram" (http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram) is a great place to start. You need three things to be a good data scientist:

  • Statistical knowledge
  • Programming/hacking skills
  • Domain expertise

Statistical knowledge

You need to be able to think "statistically": you need to be able to turn sample data into inferences about the underlying population. I'm not sure how you develop statistical thinking - I did it through a masters and then PhD in statistics, but that's obviously a big time investment!

I think you need some knowledge of specific statistical/machine learning techniques, but a deep theoretical understanding is not that important. You need to understand the strengths and weaknesses of each technique, but you don't need a deep theoretical understanding. The vast majority of data science problems can be solved by a creative assembly of off-the-shelf techniques, and don't require new theory.

I'd recommend developing a familiarity with linear models and their variations (esp. generalised linear models, splines and the lasso). Yes, they are linear, but a linear approximation is a good place to start for many problems. For problems that focus more on prediction than understanding, make sure you're familiar with the most popular ML techniques, e.g. random forests and support vector machines.

Programming skills

You need to be fluent with either R or python. There are other options, but none of them have the community that R and python have, which means you'll need to spend a lot of time reinventing tools that already exist elsewhere. Obviously, I prefer R, and unlike what some people claim it is a well founded programming language that is well tailored for its domain.

If you use R you want to be conversant with a set of packages that allows you to solve the following practical problems:

  • Ingest: how do you get your data into R?
  • Manipulation: how do you filter, summarise, mutate et?
  • Visualisation: how do you explore your data visually?
  • Modelling: once you have a precise question, how do you answer it with a model?
  • Reporting: once you've figure out the solution, how do you communicate it to others?

My recommendations for starting places are:

  • Ingest: readr (flat files), DBI (databases), tidyr (data tidying)
  • Manipulation: dplyr
  • Visualisation: ggplot2 (and ggvis in a year or two)
  • Modelling: caret
  • Reporting: Rmarkdown and shiny

You should also invest some time in learning how to be a productive R programmer (e.g. http://adv-r.had.co.nz) and learning how to write packages (http://r-pkgs.had.co.nz). Start by learning the basics of functional programming - this will have the biggest payoff for your productivity in R.

Domain knowledge

This obviously depends on the domain, but as a data scientist should be able to contribute meaningfully to any project, even if you're not intimately familiar with the specifics. I think this means you should be generally well read (e.g. at the level of New Scientist for the sciences) and an able communicator. A good data scientist will help the real domain experts refine and frame their questions in a helpful way. Unfortunately I don't know of any good resources for learning how to ask questions.

@zmjones
Copy link

zmjones commented Mar 13, 2015

think you should mention or substitute mlr for caret

@smach
Copy link

smach commented Mar 14, 2015

We journalists are always looking at ways to ask better questions :-) Example from Poynter, a media training institute: http://www.poynter.org/news/media-innovation/205518/how-journalists-can-become-better-interviewers/

@stonebig
Copy link

reader ? (typo)

@ndaniel
Copy link

ndaniel commented Mar 17, 2015

Also knowing Information Theory is very important for a data scientist!

@jonrobinson2
Copy link

I'd add knitr to the reporting section (for PDF report generation).

@geneorama
Copy link

I recommend the book "Applied Predictive Modeling" written by the authors of the caret package. The book is rigorous and detailed without being a math text book, and conceptually useful even if you don't use caret

@ndaniel
Copy link

ndaniel commented Mar 20, 2015

I recommend also the book Optimal Parameter Estimation by Jorma Rissanen. Here is a quote from this book (from page 2):

Very few statisticians have been studying information theory, the result of which,
I think, is the disarray of the present discipline of statistics.

@springcoil
Copy link

I would recommend and have already by Twitter - the excellent Thinking with data by Max Shron. This brief read is a good introduction to the challenges of asking good questions and conveying results to stakeholders. If we use OSEMN the Data Analytics taxonomy from Chris Wiggins and Hilary Mason then we can consider Max's book to be an answer to some of the 'iNterpret' part of the taxonomy. I particularly like the CoNVO framework he recommends and I often use it during the data consult period. I know Hadley mentioned some similarities between this and taking a medical history. I know when I tutor students Mathematics, I often run into the same challenges - how do you elucidate what they actually know? Since they often won't tell you explicitly.

@dan-reznik
Copy link

1st sentence: should it be "advice" (the noun) instead of "advise" (the verb)? cheers!

@dan-reznik
Copy link

dan-reznik commented Mar 18, 2019

Modeling, supposedly the icing on the data science cake, turns out to be grossly over-rated. Everyone wants to jump straight into it, so they too can say "Machine Learning (or even worse, AI) Specialist" in their resumes, but they ignore the vast the ocean of techniques and tools required for data ANALYSIS, cleansing, transformation, integration, and excel replacement/automation.

Mastering the tidyverse far more important than reading copious books about the 20-odd ML models out there, choosing the right lasso paramenters or the right number of trees or deep learning layers. The reason is: garbage in, garbage out.

This is why I love the path the tidyverse makes you go through. If you do it properly you will first become a master data analyst, cleanser, preparer, integrator. You will have mastered GitHub, packages, and so many other crucial techniques. Only later you may decide to jump into feature engineering (closely related to dataprep) and models, from linear to more complex. Curiously, the icing on the cake is getting fully automated by H2O and DataRobot. Notice dataprep is NOT!

@carlosgino
Copy link

Los periodistas siempre estamos buscando formas de hacer mejores preguntas :-) Ejemplo de Poynter, un instituto de capacitación en medios: http://www.poynter.org/news/media-innovation/205518/how-journalists-can-become-better -entrevistadores /

this new link https://www.poynter.org/reporting-editing/2013/how-journalists-can-become-better-interviewers/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment