Skip to content

Instantly share code, notes, and snippets.

@szeitlin
Last active August 3, 2016 02:36
Show Gist options
  • Save szeitlin/3f7e8a540ffaa373393bb1bfc2527862 to your computer and use it in GitHub Desktop.
Save szeitlin/3f7e8a540ffaa373393bb1bfc2527862 to your computer and use it in GitHub Desktop.
Notes from WrangleConf 2016 in San Francisco
  1. When good algorithms go bad. Panel with Josh Wills of Slack, Anu Tewari of Intuit, John Bruner (sp?) of O'Reilly, moderated by Pete Skomoroch.

Pete asked: why are we surprised when things go wrong with real user data?

"I wear the black hat" by Chuck Closterman, idea that the villain is always the one who "knows the most and cares the least".

Josh said: our responsibility is to care. Example of 2009 Google toolbar app that provided info on browsing habits (early version of ad re-targeting) was deemed "too creepy to launch". Then someone else did it and "no one cared" maybe because when the ads are useful, it seems less intrusive?

Elon Musk manifesto on how it's necessary to deploy before perfect (that the good outweighs the risks)

Anu said: Intuit's founder Scott Cook says they never do anything that's only for Intuit, it's always for the benefit of the user.

*something about recommender systems not showing women ads for high-paying jobs?

Editorial judgment: how to anticipate bias and address it

"The enemy is us"

Ex. How to make chat bots that won't be abused: don't be too clever **

EU rule that algorithms have to be explainable

Facebook has an IRB

Josh says they did worse things at Google and just didn't tell anyone. Pete says at LinkedIn they decided not to do things.

Idea that when money is the driver, there's a lot of pressure.

2. Jeremy Stanley at Instacart

Optimizing delivery: delivery fee Tips to shoppers (helps cover wages) Product relationships Retail partnerships

transactions & insurance Shopping time Delivery time (want to lower number of minutes per delivery)

(Note: he didn't mention anything about gas/car maintenance costs/public transportation costs? Do the shoppers swallow that cost?)

Variance: weather, special events, traffic

"Every minute counts" as one of their internal slogans

supply & demand What's demand? Measure orders & visitors Want to predict checkout vs. non-conversions

Forecasting: not much success with time series models, too volatile, so they've been preferring simpler models, simulating -added in events

Scheduling shopping --> grouping deliveries Variance more important than the mean GBMs with update in RT Optimized against probability rather than actual time completed

(Q: how much of Bay Area traffic is individual ride service/delivery service drivers?)

Routing: greedy heuristics, update every minute, delay dispatching until the end ~4 sub problems optimized, e.g. Batching deliveries

20% fewer late, 20% faster, shoppers 15% more busy (~85% utilized) If shoppers are too busy, orders get lost

15 product teams, mission-driven, each has 1 analyst, 1 PM, 1 DS, 1 designer, 1-2 engineers, etc. Working groups for cross-team changes

Urgent: set clear goals, be uncomfortable

Timeliness vs. completeness of orders for customer happiness (don't be too early or even close to late - minimize variance)

3. Moritz Sudhof - Kanjoya

We know all kinds of factoids about basketball players, why don't we know as much about our MDs or engineers? how to present actionable insights for a) performance management and b) employee engagement

For (a): 5-pt scale invented in 1930s, pay for performance in 1960s, conclusion that it doesn't work in 2000s For (b): only since ~ 1970s, proxy for/related to productivity/retention

Recency bias, confirmation bias, unconscious bias

GIGO, surveys - actually decrease employee trust (long, multiple choice, not actionable)

Measuring the wrong things, no why

Open-ended comments are the right source, NLP for attitude, topic, sentiment analysis as a speed bump against confirmation bias

UI errors can break down trust

Make suggestions & provide training for improvement

Collaborating with Anita Borg Institute on unconscious bias

4. Sanny Liao: IoT at IFTTT

Flick button for triggering fake emergency call Big increase in IoT in last 2 years

Most likely starter category is home (~60%), then fitness/wearables (~21%), DIY (7%), security & monitoring (4%), car(3%) But, security & monitoring users are the most connected (avg #devices per user) (Actually connectedness goes in approximately the reverse order, but the differences are small)

US is only ~ 20% adoption of IoT devices (people who have at least 1) UK is ~ 30% Netherlands, Denmark, Switzerland are > 40% (partly due to government efforts for smart cities)

Most people want to have their devices on a schedule (connected home): don't want to control it, they want it to respond to predictable factors (time, weather)

About 30% of users disappear after 30 days

See blog.ifttt.com

5. Panel: metrics before models Michelle Casbon from Qordoba, Leah McGuire from Salesforce, Xiangrui Meng from Databricks, moderated by Sean Owen

topic came from a comment Leah made about "Metrics are the unit testing of DS"

-counts, ETL pipeline check data at in & out, for ML check transformations for skewing, how models are performing, training vs. actual (clicks --> engagement)

Want to avoid silent failures Sanity checks should go in early, along with actual unit tests Cross-validation Re-apply and check for reproducibility

"Maybe just hire a DS who can write production code"

Leah says the idea on her team is to have DS - DE collborations that go end-to-end Michelle says ask more questions re: scalability details

Q: have you ever faced problems where Java/Python wasn't fast enough & you needed HPC? Michelle said sometimes you're better off simplifying your model/sacrificing some accuracy for speed

John Wills asked re: logging & instrumentation, saying analysts are bad & engineers are worse, how do we turn that into a skill?

Leah said to focus on how it will be used Michelle said to make it pain-free, add a layer to simplify implementation

5. Mohammed Saffar - Arimo

Deep learning for human behavior

Hidden patterns in clickstream time series data for customer segmentation

Feature definition is the hard part: use NN to learn features Ex. Word2vec embedding (similar words grouped together)

Old TS analysis: aggregate and filter features, lose data

New: let models decide what to remember. Can handle messier data (different length sequences). Similar to word2vec, summarize as embeddings

2 months 1.5 M customers clickstream data

Hadoop --> Spark for ETL --> distributed tensor flow for unsupervised (On GPUs) Used t-SNE for dimensionality reduction

Working on predictions, like fraud detection, cart abandonment

Q: metrics to assess clusters, A: similarity (homogeneity) within clusters, and distance between clusters Q: reason codes (how do you explain your results) A: don't use deep learning if you have to explain

6. Joel Grus - Aristo

Fizz buzz with tensor flow

Matrix multiplication solution was cute

Tf: what you'd expect

Linear regression Logistic regression gave same results NN w/1 hidden layer

Binary encoding > 1 hot encoding

100 hidden units was 96% right 200 was overfit

Deep learning [2000, 2000] ~ 200 epochs

Bit swaps for equivalence classes

7. Michael Bentley - from Lookout

Detecting malware at scale

Teams writing malware can be huge & well-funded Detection on static data (android hashes, strings/rules, malicious class paths, signed certificates) Tracking & crawlers for discovery Heuristics are great for stable things, but doesn't work on more professional malware

Use pairwise similarity Nearest neighbor clustering can be better than heuristics Informative to visualize over time (D3) Scaling on elastic map reduce (EMR) 3500 apps --> 12.25 M operations to compare S3 lesson: CSV sizes matter This analysis doesn't work well on small applications, because support libraries are all the same

Gilad Lotan (sp?) PyCon talk on graphs Gensim - Python library "Outside the closed world" by Robin Summer

8. Chris Diehl - Data Guild

Digital vulnerability, data disclosure (willing and unwilling) Idea that gender can be identified from mouse movements (??) "In Russian there's no word for privacy" Risks in existing sociotechnical systems

On one end: marginalization, discrimation, and unwitting disclosure On the other end: opportunity, privilege and deliberate disclosure

If you have enough privilege, there's no reason not to disclose everything (no risk)

Disclosure risk: ex. Secure reporting for victims of sexual attacks in the military so the VA can do follow up care -how do you make the user feel safe?

Prediction risk: minority report. Kosinski, Stillwell& Graepel PNAS. Algorithms reinforcing existing bias.

Actuation risk: IOT systems on the network. Protecting critical infrastructure, ex. Healthcare data breaches.

Extremes are: no personalization on one end, measure everything on the other Middle ground? Private, fair, interpretable, equitable

Potential solutions: -computations on encrypted data -joint computation over distributed data without directly sharing it -need trusted OSS crypto -differential privacy, e.g. Apple's approach -need best practices and design patterns bit.ly/equitable_inference_experiment

Q: What companies have done a good job? A: don't know, lot of lip service for vague "transparency"

9. Kirstin Aschbacher - Jawbone

Wearables as a way to help prevent chronic disease Track steps, sleep, meals, moods, etc.

Clinical health psychologist by training Ex. Patient with chronic pain, helped him track what affected it Got to bed earlier <-- 15 minute walk outside in the AM

85% of jawbone users want to lose weight Obesity is 2x worse 1990-2014, and not just in US but worldwide 1st time in history that overweight is actually killing more people than underweight

Trying to teach self-regulation Act in long-term best self-interest Align actions with values

Amstem, Nat. Rev. Neurosci 2009

Things that wear us down: -threats, resource depletion, inhibition of reward/overwhelm reward system

Heatherton & Wagner, Trends CogSci 2011

Study with kids & marshmallow (reward 2 marshmallows for waiting)

72% of jawbone UP users set a goal 62% lose some weight (~25% of goal) by week 12

Built a multilevel regression model Randomized control trial b/c correlation is not causation

Log your food Evening snacking is predictive of weight gain

Normalize (everybody feels this way) Counteract cognitive biases Log Make it easy Be empathetic, do not shame

Under-reporters weigh more (Ventura, Obesity, 2006)

Ground truth/reality vs. self-reporting

Measuring over months, only 50% of people are still entering data

What is predictive of making progress on goals (effect size): Heavier starting weight 15% Time since starting 12% Higher food score (proprietary measure of food quality) 11% More steps 10% Fewer calories/meal 9% Longer sleep 4%

Generally better not to recommend specific foods, unless it is something they are already eating Build a relationship with your wearables so they can know you

More health than fitness, more prevention than disease

Q: publicly announce intention? A: short-term helps, long-term not good b/c many people regain (shame)

Q: social network effect? A: yes, you can duel your friends in their app

10. Sandy Ryza - Clover

Clinical operations: take your meds, see Doctors, get tests Insurance operations: approve claims, authorizations, catch fraud

Mostly insure seniors with chronic illness, want to minimize trips to the hospital

Strategies: fear, target worst members, polite reminders, segment (rich vs. poor) messaging

DS are ogres: "walk around in the dark and hit stuff with blunt instruments"

Nurse Practitioners do home visits, help close gaps in care

1/3 of all clients have diabetes

Use historical data --> 1 month project

predictions: who will have complications in next 6 months

14 features, including age, hypertension, hba1c (glycated hemoglobin**), "takes insulin"

Only 8% have complications in a 6 month interval

Accuracy high, precision/recall difficult

Lots of missing data, e.g. Hba1c missing in 40% of cases

Unknown selection bias (good be for good or bad reasons) Imputation didn't work Hard coded to irrelevant value: bad for linear models, fine for decision trees/RFs, ended up with GBT

Scikitlearn object --> A/B test Call randomly vs. call people based on model

AUROC 0.8 (0.5 is the worst) Precision 25% Recall 66%

Evaluated by chi-squared test

Q: did this include any unstructured data? A: No Q: why 6 months? A: operational reasons

11. Abe Gong - Aspire Health

@abegong, #ethicalalgorithms

Inspired by Clare Corthell's talk last year

Algorithms as gate keeping for opportunities E.g. House hunting - credit score Admissions to universities - SAT

Data science version 1: internet advertising! Version 2: insurance & IoT: can't opt out

Parole: COMPAS proprietary algorithms predicts recidivism 127 Q's Not supposed to be used in convictions/sentencing, but is used for assignment to high/low security prison

ProPublica: 18k records --> all on github

Some of the Q's are beyond the control of the person (even if they're predictive) (Later Q from Sanny pointed out that propensity scoring could help control for that)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment