Skip to content

Instantly share code, notes, and snippets.

@chrislkeller
Last active August 29, 2015 14:01
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save chrislkeller/a5187d973efe2fdb20df to your computer and use it in GitHub Desktop.
Save chrislkeller/a5187d973efe2fdb20df to your computer and use it in GitHub Desktop.
Adding statistics & analysis skills to the beginning data journo's toolbelt

Code allows us to make all kinds of visuals and tools that display data for analysis.

But when you're starting to mix code, data and journalism - and you lack a deep statistics background to draw upon - everything looks like a nail that you can whack with your shiny hammer. And everything - scatterplots to nearest neighbor to regression - seems important.

So how do you move from citing only the average, median & percent change in all of your work and begin to build skills and knowledge that can lead to a deeper analysis of datasets?

I propose a discussion that helps beginning data journalists/news apps developers better understand which analytical and statistical methods are best suited to different data situations.

For example:

  • What kind of data lends itself to a scatterplot & what does does the resulting graph tell you?
  • When does it make sense to use distribution graphs?
  • Why are "correlations are always meaningful but not necessarily useful?"
  • Are regression -- both linear and logistic -- and nearest neighbor that scary?
@gotoplanb
Copy link

Super good pitch.

Would be great to showcase (not perform) different regression/modeling techniques to show how things go haywire when wrong data types are used (like categorical instead of continuous) or when models that assume independence like GLM (linear and logistic regression) but the data obviously has dependence (particularly time series stuff).

@gotoplanb
Copy link

There is a decent amount of conference presentation showing foo method to solve bar problem, but there is less emphasis on when the assumptions are violated. The model will still give you something that looks useful and can tell a story, but the way error/variance is explained isn't allowed. The person performing this analysis wouldn't know the results are flawed from the model output if they haven't run the appropriate diagnostics along the way. Most modeling techniques have specific diagnostics for assumptions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment