Skip to content

Instantly share code, notes, and snippets.

@aronchick
Last active May 24, 2023 05:16
Show Gist options
  • Save aronchick/d84fc18a8500f7da244bfc0ba24e885d to your computer and use it in GitHub Desktop.
Save aronchick/d84fc18a8500f7da244bfc0ba24e885d to your computer and use it in GitHub Desktop.
Issues with Data Science
Inappropriate HW/SW stack
Mismatched driver versions
Crash looping deployment
Data/model versioning [Nick Walsh]
Non-standard images/OS version
Pre-processing code doesn’t match production pre-processing
Production data doesn’t match training/test data
Output of the model doesn’t match application expectations
Hand-coded heuristics better than model [Adam Laiacano]
Model freshness (train on out-of-date data/input shape changed)
Test/production statistics/population shape skew
Overfitting on training/test data
Bias introduction (or not tested)
Over/under HW provisioning
Latency issues
Permissions/certs
Failure to obey health checks
Killed production model before roll out of new/in wrong order
Thundering herd for new model
Logging to the wrong location
Storage for model not allocated properly/accessible by deployment tooling
Route to artifacts not available for download
API signature changes not propagated/expected
Cross-data center latency
Expected benefit doesn’t materialize (e.g. multiple components in the app change simultaneously)
Get wrong/no traffic because A/B config didn’t roll out
No CI/CD; manual changes untracked [Jon Peck]
Get too much traffic too soon (expected to canary/exponential roll out)
Outliers not predicted [MikeBSilverman]
Change was a good change, but didn’t communicate with the rest of the team (so you must roll back)
No dates! (date to measure impact/improvement against a pre-agreed measure; date scheduled to assess data changes) [Mary Branscombe]
LACK OF DOCUMENTATION!! (the problem, the testing, the solution, lots more) [Terry Christiani]
Successful model causes pain elsewhere in the organization (e.g. detecting faults previously missed) [Mark Round]
Lack of visibility into real-time model behavior (detecting data drift, live data distribution vs train data, etc) [Nick Walsh]
Before You Move
----
Bandwidth costs
Speed of insights
De/compression time
Ingestion time/cost
Removing PII
Sanitizing data (from Attacks)
Recording Metadata about Capture
Overloading Network
Changing Security Criteria
Defining a Long-term Schema in Advance (Ewan Leith)
Data Ordering
Distributed Caching Problems
Consistent Deletion / Duplication
Data Residency or Compliance Requirements (Andre)
Owning a Lake
----
Frequency of Loads (Ewan Leith)
Deleting Data on Demand (Ewan Leith)
Export, Integrations including Modeling and Search (Rob M)
Ongoing Maintenance and Pruning (Rob M)
Incremental Weight of Queries
File/Compression Formats
Partition Sizes
Fulfilling a DSR Request (Helena Jackson)
Authentication and Granularity of Permissions (Tymac)
Facilitating Queries (Tim McNamara)
Managing Long Term Responsibility (Jacob O'Farrell)
Centralized Funding Model (Jacob O'Farrell)
Team Ownership of a Central Resource (Jacob O'Farrell)
Orchestrator and Deletion Workers (Torfinn Olsen)
Data Clawbacks (Randall Hunt)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment