- When good algorithms go bad. Panel with Josh Wills of Slack, Anu Tewari of Intuit, John Bruner (sp?) of O'Reilly, moderated by Pete Skomoroch.
Pete asked: why are we surprised when things go wrong with real user data?
"I wear the black hat" by Chuck Closterman, idea that the villain is always the one who "knows the most and cares the least".
Josh said: our responsibility is to care. Example of 2009 Google toolbar app that provided info on browsing habits (early version of ad re-targeting) was deemed "too creepy to launch". Then someone else did it and "no one cared" maybe because when the ads are useful, it seems less intrusive?
Elon Musk manifesto on how it's necessary to deploy before perfect (that the good outweighs the risks)
Anu said: Intuit's founder Scott Cook says they never do anything that's only for Intuit, it's always for the benefit of the user.
*something about recommender systems not showing women ads for high-paying jobs?
Editorial judgment: how to anticipate bias and address it
"The enemy is us"
Ex. How to make chat bots that won't be abused: don't be too clever **
EU rule that algorithms have to be explainable
Facebook has an IRB
Josh says they did worse things at Google and just didn't tell anyone. Pete says at LinkedIn they decided not to do things.
Idea that when money is the driver, there's a lot of pressure.
2. Jeremy Stanley at Instacart
Optimizing delivery: delivery fee Tips to shoppers (helps cover wages) Product relationships Retail partnerships
transactions & insurance Shopping time Delivery time (want to lower number of minutes per delivery)
(Note: he didn't mention anything about gas/car maintenance costs/public transportation costs? Do the shoppers swallow that cost?)
Variance: weather, special events, traffic
"Every minute counts" as one of their internal slogans
supply & demand What's demand? Measure orders & visitors Want to predict checkout vs. non-conversions
Forecasting: not much success with time series models, too volatile, so they've been preferring simpler models, simulating -added in events
Scheduling shopping --> grouping deliveries Variance more important than the mean GBMs with update in RT Optimized against probability rather than actual time completed
(Q: how much of Bay Area traffic is individual ride service/delivery service drivers?)
Routing: greedy heuristics, update every minute, delay dispatching until the end ~4 sub problems optimized, e.g. Batching deliveries
20% fewer late, 20% faster, shoppers 15% more busy (~85% utilized) If shoppers are too busy, orders get lost
15 product teams, mission-driven, each has 1 analyst, 1 PM, 1 DS, 1 designer, 1-2 engineers, etc. Working groups for cross-team changes
Urgent: set clear goals, be uncomfortable
Timeliness vs. completeness of orders for customer happiness (don't be too early or even close to late - minimize variance)
3. Moritz Sudhof - Kanjoya
We know all kinds of factoids about basketball players, why don't we know as much about our MDs or engineers? how to present actionable insights for a) performance management and b) employee engagement
For (a): 5-pt scale invented in 1930s, pay for performance in 1960s, conclusion that it doesn't work in 2000s For (b): only since ~ 1970s, proxy for/related to productivity/retention
Recency bias, confirmation bias, unconscious bias
GIGO, surveys - actually decrease employee trust (long, multiple choice, not actionable)
Measuring the wrong things, no why
Open-ended comments are the right source, NLP for attitude, topic, sentiment analysis as a speed bump against confirmation bias
UI errors can break down trust
Make suggestions & provide training for improvement
Collaborating with Anita Borg Institute on unconscious bias
4. Sanny Liao: IoT at IFTTT
Flick button for triggering fake emergency call Big increase in IoT in last 2 years
Most likely starter category is home (~60%), then fitness/wearables (~21%), DIY (7%), security & monitoring (4%), car(3%) But, security & monitoring users are the most connected (avg #devices per user) (Actually connectedness goes in approximately the reverse order, but the differences are small)
US is only ~ 20% adoption of IoT devices (people who have at least 1) UK is ~ 30% Netherlands, Denmark, Switzerland are > 40% (partly due to government efforts for smart cities)
Most people want to have their devices on a schedule (connected home): don't want to control it, they want it to respond to predictable factors (time, weather)
About 30% of users disappear after 30 days
See blog.ifttt.com
5. Panel: metrics before models Michelle Casbon from Qordoba, Leah McGuire from Salesforce, Xiangrui Meng from Databricks, moderated by Sean Owen
topic came from a comment Leah made about "Metrics are the unit testing of DS"
-counts, ETL pipeline check data at in & out, for ML check transformations for skewing, how models are performing, training vs. actual (clicks --> engagement)
Want to avoid silent failures Sanity checks should go in early, along with actual unit tests Cross-validation Re-apply and check for reproducibility
"Maybe just hire a DS who can write production code"
Leah says the idea on her team is to have DS - DE collborations that go end-to-end Michelle says ask more questions re: scalability details
Q: have you ever faced problems where Java/Python wasn't fast enough & you needed HPC? Michelle said sometimes you're better off simplifying your model/sacrificing some accuracy for speed
John Wills asked re: logging & instrumentation, saying analysts are bad & engineers are worse, how do we turn that into a skill?
Leah said to focus on how it will be used Michelle said to make it pain-free, add a layer to simplify implementation
5. Mohammed Saffar - Arimo
Deep learning for human behavior
Hidden patterns in clickstream time series data for customer segmentation
Feature definition is the hard part: use NN to learn features Ex. Word2vec embedding (similar words grouped together)
Old TS analysis: aggregate and filter features, lose data
New: let models decide what to remember. Can handle messier data (different length sequences). Similar to word2vec, summarize as embeddings
2 months 1.5 M customers clickstream data
Hadoop --> Spark for ETL --> distributed tensor flow for unsupervised (On GPUs) Used t-SNE for dimensionality reduction
Working on predictions, like fraud detection, cart abandonment
Q: metrics to assess clusters, A: similarity (homogeneity) within clusters, and distance between clusters Q: reason codes (how do you explain your results) A: don't use deep learning if you have to explain
6. Joel Grus - Aristo
Fizz buzz with tensor flow
Matrix multiplication solution was cute
Tf: what you'd expect
Linear regression Logistic regression gave same results NN w/1 hidden layer
Binary encoding > 1 hot encoding
100 hidden units was 96% right 200 was overfit
Deep learning [2000, 2000] ~ 200 epochs
Bit swaps for equivalence classes
7. Michael Bentley - from Lookout
Detecting malware at scale
Teams writing malware can be huge & well-funded Detection on static data (android hashes, strings/rules, malicious class paths, signed certificates) Tracking & crawlers for discovery Heuristics are great for stable things, but doesn't work on more professional malware
Use pairwise similarity Nearest neighbor clustering can be better than heuristics Informative to visualize over time (D3) Scaling on elastic map reduce (EMR) 3500 apps --> 12.25 M operations to compare S3 lesson: CSV sizes matter This analysis doesn't work well on small applications, because support libraries are all the same
Gilad Lotan (sp?) PyCon talk on graphs Gensim - Python library "Outside the closed world" by Robin Summer
8. Chris Diehl - Data Guild
Digital vulnerability, data disclosure (willing and unwilling) Idea that gender can be identified from mouse movements (??) "In Russian there's no word for privacy" Risks in existing sociotechnical systems
On one end: marginalization, discrimation, and unwitting disclosure On the other end: opportunity, privilege and deliberate disclosure
If you have enough privilege, there's no reason not to disclose everything (no risk)
Disclosure risk: ex. Secure reporting for victims of sexual attacks in the military so the VA can do follow up care -how do you make the user feel safe?
Prediction risk: minority report. Kosinski, Stillwell& Graepel PNAS. Algorithms reinforcing existing bias.
Actuation risk: IOT systems on the network. Protecting critical infrastructure, ex. Healthcare data breaches.
Extremes are: no personalization on one end, measure everything on the other Middle ground? Private, fair, interpretable, equitable
Potential solutions: -computations on encrypted data -joint computation over distributed data without directly sharing it -need trusted OSS crypto -differential privacy, e.g. Apple's approach -need best practices and design patterns bit.ly/equitable_inference_experiment
Q: What companies have done a good job? A: don't know, lot of lip service for vague "transparency"
9. Kirstin Aschbacher - Jawbone
Wearables as a way to help prevent chronic disease Track steps, sleep, meals, moods, etc.
Clinical health psychologist by training Ex. Patient with chronic pain, helped him track what affected it Got to bed earlier <-- 15 minute walk outside in the AM
85% of jawbone users want to lose weight Obesity is 2x worse 1990-2014, and not just in US but worldwide 1st time in history that overweight is actually killing more people than underweight
Trying to teach self-regulation Act in long-term best self-interest Align actions with values
Amstem, Nat. Rev. Neurosci 2009
Things that wear us down: -threats, resource depletion, inhibition of reward/overwhelm reward system
Heatherton & Wagner, Trends CogSci 2011
Study with kids & marshmallow (reward 2 marshmallows for waiting)
72% of jawbone UP users set a goal 62% lose some weight (~25% of goal) by week 12
Built a multilevel regression model Randomized control trial b/c correlation is not causation
Log your food Evening snacking is predictive of weight gain
Normalize (everybody feels this way) Counteract cognitive biases Log Make it easy Be empathetic, do not shame
Under-reporters weigh more (Ventura, Obesity, 2006)
Ground truth/reality vs. self-reporting
Measuring over months, only 50% of people are still entering data
What is predictive of making progress on goals (effect size): Heavier starting weight 15% Time since starting 12% Higher food score (proprietary measure of food quality) 11% More steps 10% Fewer calories/meal 9% Longer sleep 4%
Generally better not to recommend specific foods, unless it is something they are already eating Build a relationship with your wearables so they can know you
More health than fitness, more prevention than disease
Q: publicly announce intention? A: short-term helps, long-term not good b/c many people regain (shame)
Q: social network effect? A: yes, you can duel your friends in their app
10. Sandy Ryza - Clover
Clinical operations: take your meds, see Doctors, get tests Insurance operations: approve claims, authorizations, catch fraud
Mostly insure seniors with chronic illness, want to minimize trips to the hospital
Strategies: fear, target worst members, polite reminders, segment (rich vs. poor) messaging
DS are ogres: "walk around in the dark and hit stuff with blunt instruments"
Nurse Practitioners do home visits, help close gaps in care
1/3 of all clients have diabetes
Use historical data --> 1 month project
predictions: who will have complications in next 6 months
14 features, including age, hypertension, hba1c (glycated hemoglobin**), "takes insulin"
Only 8% have complications in a 6 month interval
Accuracy high, precision/recall difficult
Lots of missing data, e.g. Hba1c missing in 40% of cases
Unknown selection bias (good be for good or bad reasons) Imputation didn't work Hard coded to irrelevant value: bad for linear models, fine for decision trees/RFs, ended up with GBT
Scikitlearn object --> A/B test Call randomly vs. call people based on model
AUROC 0.8 (0.5 is the worst) Precision 25% Recall 66%
Evaluated by chi-squared test
Q: did this include any unstructured data? A: No Q: why 6 months? A: operational reasons
11. Abe Gong - Aspire Health
@abegong, #ethicalalgorithms
Inspired by Clare Corthell's talk last year
Algorithms as gate keeping for opportunities E.g. House hunting - credit score Admissions to universities - SAT
Data science version 1: internet advertising! Version 2: insurance & IoT: can't opt out
Parole: COMPAS proprietary algorithms predicts recidivism 127 Q's Not supposed to be used in convictions/sentencing, but is used for assignment to high/low security prison
ProPublica: 18k records --> all on github
Some of the Q's are beyond the control of the person (even if they're predictive) (Later Q from Sanny pointed out that propensity scoring could help control for that)