szeitlin/gist:7c39bded1fa770c34f30fb013bcf4d7c

## gistfile1.txt
Challenges deploying analytics to production/as a service
Challenges testing, monitoring, analytics of analytics

Hire Data Scientists to make products smarter!

"cron + notebooks is like mimicking a person with a marionnette"

"every data science problem should be treated as a software engineering problem"

Ultimately want to make environments scalable & elastic - no easy solution for this right now

Don't use a database as an API! This is bad: poll for result --> SQL --> report

For streaming, do something like this:

spark --> kafka --> reporting app --> save
(don't save from spark to data warehouse, use the reporting app to do that)

Ultimate deliverable has these pieces:
- monitoring & testing
- serving layer
- raw data in json/parquet/whatever
- batch processing engine

Regression testing is essential: data drifts, and things fail SILENTLY

Data validation should be ongoing all the time

What works best is smart data structures & dumb code

Ideally, want a data structure that is smart enough to do its own QA
(build the data validation into the data structure itself)

Feed job monitoring data back into the model, to help the model run better

Metrics for performance, etc. help alert you to when something's wrong in the pipeline, before it breaks your model

his slides are here: http://www.slideshare.net/StepanPushkarev/devops-for-datascience
	Challenges deploying analytics to production/as a service
	Challenges testing, monitoring, analytics of analytics

	Hire Data Scientists to make products smarter!

	"cron + notebooks is like mimicking a person with a marionnette"

	"every data science problem should be treated as a software engineering problem"

	Ultimately want to make environments scalable & elastic - no easy solution for this right now

	Don't use a database as an API! This is bad: poll for result --> SQL --> report

	For streaming, do something like this:

	spark --> kafka --> reporting app --> save
	(don't save from spark to data warehouse, use the reporting app to do that)

	Ultimate deliverable has these pieces:
	- monitoring & testing
	- serving layer
	- raw data in json/parquet/whatever
	- batch processing engine

	Regression testing is essential: data drifts, and things fail SILENTLY

	Data validation should be ongoing all the time

	What works best is smart data structures & dumb code

	Ideally, want a data structure that is smart enough to do its own QA
	(build the data validation into the data structure itself)

	Feed job monitoring data back into the model, to help the model run better

	Metrics for performance, etc. help alert you to when something's wrong in the pipeline, before it breaks your model

	his slides are here: http://www.slideshare.net/StepanPushkarev/devops-for-datascience