Skip to content

Instantly share code, notes, and snippets.

@szeitlin
Last active January 12, 2017 19:39
Show Gist options
  • Save szeitlin/7c39bded1fa770c34f30fb013bcf4d7c to your computer and use it in GitHub Desktop.
Save szeitlin/7c39bded1fa770c34f30fb013bcf4d7c to your computer and use it in GitHub Desktop.
Notes from SF Analytics Meetup: DevOps for Data Science, by Stepan Pushkarev, CTO at Hydrosphere.io
Challenges deploying analytics to production/as a service
Challenges testing, monitoring, analytics of analytics
Hire Data Scientists to make products smarter!
"cron + notebooks is like mimicking a person with a marionnette"
"every data science problem should be treated as a software engineering problem"
Ultimately want to make environments scalable & elastic - no easy solution for this right now
Don't use a database as an API! This is bad: poll for result --> SQL --> report
For streaming, do something like this:
spark --> kafka --> reporting app --> save
(don't save from spark to data warehouse, use the reporting app to do that)
Ultimate deliverable has these pieces:
- monitoring & testing
- serving layer
- raw data in json/parquet/whatever
- batch processing engine
Regression testing is essential: data drifts, and things fail SILENTLY
Data validation should be ongoing all the time
What works best is smart data structures & dumb code
Ideally, want a data structure that is smart enough to do its own QA
(build the data validation into the data structure itself)
Feed job monitoring data back into the model, to help the model run better
Metrics for performance, etc. help alert you to when something's wrong in the pipeline, before it breaks your model
his slides are here: http://www.slideshare.net/StepanPushkarev/devops-for-datascience
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment