aronchick/issues with data lakes

## gistfile1.txt
Inappropriate HW/SW stack
Mismatched driver versions
Crash looping deployment
Data/model versioning [Nick Walsh]
Non-standard images/OS version
Pre-processing code doesn’t match production pre-processing
Production data doesn’t match training/test data
Output of the model doesn’t match application expectations
Hand-coded heuristics better than model [Adam Laiacano]
Model freshness (train on out-of-date data/input shape changed)
Test/production statistics/population shape skew
Overfitting on training/test data
Bias introduction (or not tested)
Over/under HW provisioning
Latency issues

Permissions/certs
Failure to obey health checks
Killed production model before roll out of new/in wrong order
Thundering herd for new model
Logging to the wrong location
Storage for model not allocated properly/accessible by deployment tooling
Route to artifacts not available for download
API signature changes not propagated/expected
Cross-data center latency
Expected benefit doesn’t materialize (e.g. multiple components in the app change simultaneously)
Get wrong/no traffic because A/B config didn’t roll out
No CI/CD; manual changes untracked [Jon Peck]

Get too much traffic too soon (expected to canary/exponential roll out)
Outliers not predicted [MikeBSilverman]
Change was a good change, but didn’t communicate with the rest of the team (so you must roll back)
No dates! (date to measure impact/improvement against a pre-agreed measure; date scheduled to assess data changes) [Mary Branscombe]
LACK OF DOCUMENTATION!! (the problem, the testing, the solution, lots more) [Terry Christiani]
Successful model causes pain elsewhere in the organization (e.g. detecting faults previously missed) [Mark Round]
Lack of visibility into real-time model behavior (detecting data drift, live data distribution vs train data, etc) [Nick Walsh]

## issues with data lakes
Before You Move
----
Bandwidth costs
Speed of insights
De/compression time
Ingestion time/cost
Removing PII
Sanitizing data (from Attacks)
Recording Metadata about Capture
Overloading Network
Changing Security Criteria
Defining a Long-term Schema in Advance (Ewan Leith)
Data Ordering
Distributed Caching Problems
Consistent Deletion / Duplication
Data Residency or Compliance Requirements (Andre)

Owning a Lake
----
Frequency of Loads (Ewan Leith)
Deleting Data on Demand (Ewan Leith)
Export, Integrations including Modeling and Search (Rob M)
Ongoing Maintenance and Pruning (Rob M)
Incremental Weight of Queries
File/Compression Formats
Partition Sizes
Fulfilling a DSR Request (Helena Jackson)
Authentication and Granularity of Permissions (Tymac)
Facilitating Queries (Tim McNamara)
Managing Long Term Responsibility (Jacob O'Farrell)
Centralized Funding Model (Jacob O'Farrell)
Team Ownership of a Central Resource (Jacob O'Farrell)
Orchestrator and Deletion Workers (Torfinn Olsen)
Data Clawbacks (Randall Hunt)
	Inappropriate HW/SW stack
	Mismatched driver versions
	Crash looping deployment
	Data/model versioning [Nick Walsh]
	Non-standard images/OS version
	Pre-processing code doesn’t match production pre-processing
	Production data doesn’t match training/test data
	Output of the model doesn’t match application expectations
	Hand-coded heuristics better than model [Adam Laiacano]
	Model freshness (train on out-of-date data/input shape changed)
	Test/production statistics/population shape skew
	Overfitting on training/test data
	Bias introduction (or not tested)
	Over/under HW provisioning
	Latency issues

	Permissions/certs
	Failure to obey health checks
	Killed production model before roll out of new/in wrong order
	Thundering herd for new model
	Logging to the wrong location
	Storage for model not allocated properly/accessible by deployment tooling
	Route to artifacts not available for download
	API signature changes not propagated/expected
	Cross-data center latency
	Expected benefit doesn’t materialize (e.g. multiple components in the app change simultaneously)
	Get wrong/no traffic because A/B config didn’t roll out
	No CI/CD; manual changes untracked [Jon Peck]

	Get too much traffic too soon (expected to canary/exponential roll out)
	Outliers not predicted [MikeBSilverman]
	Change was a good change, but didn’t communicate with the rest of the team (so you must roll back)
	No dates! (date to measure impact/improvement against a pre-agreed measure; date scheduled to assess data changes) [Mary Branscombe]
	LACK OF DOCUMENTATION!! (the problem, the testing, the solution, lots more) [Terry Christiani]
	Successful model causes pain elsewhere in the organization (e.g. detecting faults previously missed) [Mark Round]
	Lack of visibility into real-time model behavior (detecting data drift, live data distribution vs train data, etc) [Nick Walsh]
	Before You Move
	----
	Bandwidth costs
	Speed of insights
	De/compression time
	Ingestion time/cost
	Removing PII
	Sanitizing data (from Attacks)
	Recording Metadata about Capture
	Overloading Network
	Changing Security Criteria
	Defining a Long-term Schema in Advance (Ewan Leith)
	Data Ordering
	Distributed Caching Problems
	Consistent Deletion / Duplication
	Data Residency or Compliance Requirements (Andre)

	Owning a Lake
	----
	Frequency of Loads (Ewan Leith)
	Deleting Data on Demand (Ewan Leith)
	Export, Integrations including Modeling and Search (Rob M)
	Ongoing Maintenance and Pruning (Rob M)
	Incremental Weight of Queries
	File/Compression Formats
	Partition Sizes
	Fulfilling a DSR Request (Helena Jackson)
	Authentication and Granularity of Permissions (Tymac)
	Facilitating Queries (Tim McNamara)
	Managing Long Term Responsibility (Jacob O'Farrell)
	Centralized Funding Model (Jacob O'Farrell)
	Team Ownership of a Central Resource (Jacob O'Farrell)
	Orchestrator and Deletion Workers (Torfinn Olsen)
	Data Clawbacks (Randall Hunt)