Code allows us to make all kinds of visuals and tools that display data for analysis.
But when you're starting to mix code, data and journalism - and you lack a deep statistics background to draw upon - everything looks like a nail that you can whack with your shiny hammer. And everything - scatterplots to nearest neighbor to regression - seems important.
So how do you move from citing only the average, median & percent change in all of your work and begin to build skills and knowledge that can lead to a deeper analysis of datasets?
I propose a discussion that helps beginning data journalists/news apps developers better understand which analytical and statistical methods are best suited to different data situations.
For example:
- What kind of data lends itself to a scatterplot & what does does the resulting graph tell you?
- When does it make sense to use distribution graphs?
- Why are "correlations are always meaningful but not necessarily useful?"
- Are regression -- both linear and logistic -- and nearest neighbor that scary?
Super good pitch.
Would be great to showcase (not perform) different regression/modeling techniques to show how things go haywire when wrong data types are used (like categorical instead of continuous) or when models that assume independence like GLM (linear and logistic regression) but the data obviously has dependence (particularly time series stuff).