Last active
January 12, 2017 19:39
-
-
Save szeitlin/7c39bded1fa770c34f30fb013bcf4d7c to your computer and use it in GitHub Desktop.
Notes from SF Analytics Meetup: DevOps for Data Science, by Stepan Pushkarev, CTO at Hydrosphere.io
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Challenges deploying analytics to production/as a service | |
Challenges testing, monitoring, analytics of analytics | |
Hire Data Scientists to make products smarter! | |
"cron + notebooks is like mimicking a person with a marionnette" | |
"every data science problem should be treated as a software engineering problem" | |
Ultimately want to make environments scalable & elastic - no easy solution for this right now | |
Don't use a database as an API! This is bad: poll for result --> SQL --> report | |
For streaming, do something like this: | |
spark --> kafka --> reporting app --> save | |
(don't save from spark to data warehouse, use the reporting app to do that) | |
Ultimate deliverable has these pieces: | |
- monitoring & testing | |
- serving layer | |
- raw data in json/parquet/whatever | |
- batch processing engine | |
Regression testing is essential: data drifts, and things fail SILENTLY | |
Data validation should be ongoing all the time | |
What works best is smart data structures & dumb code | |
Ideally, want a data structure that is smart enough to do its own QA | |
(build the data validation into the data structure itself) | |
Feed job monitoring data back into the model, to help the model run better | |
Metrics for performance, etc. help alert you to when something's wrong in the pipeline, before it breaks your model | |
his slides are here: http://www.slideshare.net/StepanPushkarev/devops-for-datascience |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment