Skip to content

Instantly share code, notes, and snippets.

@chrishwiggins
Last active May 24, 2020 00:45
Show Gist options
  • Save chrishwiggins/030a3b8b0c8e6861d450 to your computer and use it in GitHub Desktop.
Save chrishwiggins/030a3b8b0c8e6861d450 to your computer and use it in GitHub Desktop.

frequently asked question:

Q: I would like to ask your advice about preparing for a role in data science

A:

my advice would be to put together a portfolio of projects, on GitHub, evidencing that you know how to

  • get data (e.g., via wget/curl)

  • scrub data (wisely choose and reproducibly remove "outliers")

  • model using a variety of approaches (supervised, unsupervised, exploratory) in python or possibly R (usually an employer will prefer one or the other, with more and more employers in my experience preferring python; in the Data Science Group at NYT it's helpful to know your way around SQL and scikit-learn. We don't do much in R, and nothing in SAS, SPSS, MATLAB, Mathematica, or... )

  • write a coherent description of what you learned, and what this implies for the stakeholder/collaborator/world;

    as well as

    how you chose the approach you took, what assumptions you made on the way what are the weaknesses in your approach, and what are the next steps.

    Update 1: Also consider getting your hands on some fun data to play with. Definition of "fun" is highly personal, so I list several sets which might be of interest: https://gist.github.com/chrishwiggins/84a6319246a7b8f547c4

    Update 2: Also consider taking a class ( cf., http://datascience.columbia.edu/data-science-academics )

    Update 3: Also consider enrolling in a "data science boot camp", e.g., http://insightdatascience.com/

For more info:

My thoughts: http://www.columbia.edu/itc/applied/wiggins/DSatW-wiggins.pdf

Hammerbacher: https://goo.gl/cVB4hn

@tommiechen
Copy link

great advice. thanks

@zgmartin
Copy link

zgmartin commented Dec 5, 2014

This was great advice. I would like to add to it.

-scrape data from web (scrapy)
-store data in database (Mongodb or SQL)
-extract data (pandas: split, merge, transform)
-model data (machine learning)
-document results (info, plots, error rates)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment