Skip to content

Instantly share code, notes, and snippets.

@szeitlin
Last active October 4, 2022 10:00
Show Gist options
  • Save szeitlin/28b99e6228b7e13fae4ff38a5b698515 to your computer and use it in GitHub Desktop.
Save szeitlin/28b99e6228b7e13fae4ff38a5b698515 to your computer and use it in GitHub Desktop.
Netflix PRS conference
Started in 2016; past talks are online
"everything is a recommendation"
80% of what people watch on Netflix comes from recs
# Mounia Lalmas - Dir Research at Spotify (based in London)
Home: help users find content quickly
*nice slide w/ overall view of research -> measurement -> modeling -> optimization -> business
1. success metrics
*BaRT - McInerney et al. 2018*
- bandits
- find best card per shelf, and then rank shelves
success = streaming time binarized with a threshold of 30 seconds per playlist (this seems weird to me, 30 seconds per song would make sense?)
exceptions, e.g. sleep playlist success threshold is longer
jazz listeners listen longer than other listeners
reward functions:
- one global function
- one per user x playlist
- groups of users x playlists
Used *Dhillon et al co-clustering KDD 2003*
Histograms + thresholds
Found that mean worked the best for the threshold (vs. additive or cumulative, which seem like straw men comparisons to me)
Affinity features (content x user) are better than generic ones (age or day)
*Dragone et al. WWW 2019*
2. intent
What is the user looking for? i.e. passive listening vs. actively engaging
*Mehotra et al. WWW 2019*
examples: search for a particular thing vs. discovery by mood/activity vs. music to have on in the background
(the chart for this looked kind of meaningless to me?)
multi-level model + intent improved user satisfaction ratings prediction over global
shared learning across intents
most useful metrics:
- time to success + dwell time
- save or download
3. diversity of content
*Mehotra et al. CIKM 2018*
- relevance (user + tracks)
- satisfaction (stream > 30 seconds)
- diversity (range of popularity from Drake to... not Drake)
Of course, they found that high relevance meant less diversity (very few playlists have both)
Tradeoff of beta = 0.7 with a 10% drop in satisfaction
vs. relevant max of beta = 1, or diversity max with beta = 0 and satisfaction drop of 32%
"personalized diversity" - satisfaction up 12%
(some users are of course more diverse in their interests)
Someone asked about how to define intent; she said TBD, but mostly by clustering behavior
and looking at things like time of day
----
# David Hubbard and Benoit Rostykus - Netflix
Long-term outcomes
short: click (popularity bias), view, like
medium: dwell time, quality plays
long: satisfaction, subscription renewal, etc.
Ranking: popular, not relevant
Messaging: short term metrics can lead to user fatigue
Want to model satisfaction over time, for example, renewal as a Bernoulli model over months
Bayesian approach, beta-logistic/gemoetric, *Heckman & Willis 1977*.
Features used: country, tenure, devices, streaming, behavior, payments
predicting churn, basically. *Vaupel & Yashin 1985*
Effects of selection on population bias, why means are bad
*Fader et al. 2018* predicting retention
Take home points: better to have short-term initial disappointment if it leads to better long-term outcomes,
vs. initial pleasure followed by lasting disappointment
Used a Criteo online conversions dataset of 1M rows and 100k columns (?!)
beta-logit was approximately as good as exponential, and <5% better than plain logit
common approach is logit + Laplace approximation, but that's not very scalable
Concluded that beta posterior was better than a gaussian posterior, and more scalable
Netflix dataset they used was 10M rows x 500 columns, used 3M for test set and a lightGBM for training
*paper available on arxiv*
someone asked re: counterfactual bandit approach (sort of getting at feedback loops, though they didn't say it that way)
----
# Jason Gauci - applied reinforcement learning, Facebook
- Evangelize decision-making
Has been training large NNs since back when you couldn't get an NN talk into NIPS
Tech Lead Manager on Horizon https://github.com/facebooksearch/Horizon
Programming Throwdown podast
Eternal Terminal replacement for ssh at mistertea.github.com
1. retrieval matrix factorization, DNN
2. event prediction, DNN, GBDT, etc.
3. Ranking - bandits, RL
4. DS - a/b tests
1 & 3 are control
2 is signal processing
4 is causal analysis
Classification:
- what will happen, trained on ground truth, evaluated re: accuracy, assume data are correct
Decision Making:
- How can we improve trained, trained from another policy (usually a worse one),
counterfactual evaluation, assume data are flawed
- Action features
- context - device type
- session features
- event predictions
Greedy State Recs:
- value function: utility to stakeholders
- control function (maximum predictions)
- transition function - penalty to create.
"Data science Descent"
- loop: design metrics, create predictions, analyze (better: automated a/b tests)
Historical: Google had giant tables of click-through rates by categories. Humans were building decision trees by hand.
https://becominghuman.ai/the-very-basis-of-reinforcement-learning-(uuid)
Markov Decision Process:
- state: user/post/session
- action: which post to show (decide)
- reward: R(S,A)
- Transition: T(S,A) - S'
map state-action pairs to future state
- Value: discounted reward
Can't just regress
Credit assignment problem
State action reward state action: SARSA is recursive
idea borrowed from Dynamic Programming
Have the model pick the best action instead: policy gradient
Synchronous SGD, spark and distributed pytorch
CPE: counterfactual policy evaluator
- more useful slides but they went by too fast, see the online video for details
----
# Olya Gurevich - Marvelous AI
- Detecting political narratives with HITL NLP
23% of adults admitted to sharing fake news (on purpose or by accident)
- hard: not large labeled datasets, often couched in a kernel of truth, user engagement can't be used as a success metric
Cofounders: Danielle came from Kixeye
Target audience: researchers, journalists, political candidates. Focusing on the 2020 election.
- discovering themes & narratives about candidates, NLP work on tweets, measuring spread, clustering content
Train GloVe embeddings
joe biden 'creep' and amy klobuchar 'salad' and 'comb' see *Demszky et al. NAACL 2019*
Hierarchical clustering
Media Bias Fact Check (MBFC) ratings of news sites
*Benkler et al. Network Propaganda*
Left-wing is self-policing, right-wing is not
Female candidates are getting more fake news attacks, Elizabeth Warren gets the most
Suggestions:
1. Engagement metrics have to evolve.
2. Beware echo chambers and radicalization spirals.
3. Actively measure bias.
4. Ideological divide is not symmetric.
-----
# Susan Asthey - Stanford
- Counterfactual Inference for Consumer Choice with Many Products
* see her publications *
- Old way: 1 product at a time, not scalable and misses things, e.g. store vs. store competition, bundling related products
- unobserved latent product characteristics
- Build a structural model of the customer, with preferences that generalize, i.e. quality, stockpiling
- * her slides seems useful, with many references, but she went really fast, see the video for details*
- Used loyalty card data set over 18 months, prices change every Tuesday night
- Product hierarchy: UPC, subclass, class, category, group, dept, section
- throw out seasonal things
- 28 features re: user demographics
- Consider categories where user is probably only going to buy 1
- users with > 20 trips on Tuesday or Wednesday with > 10 items per trip, top 235 categories
- example UPC price series
Hierarchical Poisson Factorization Model (HPF): log(user pref • product attributes) = mean utility
Assumes items are independent, but pricing changes in one brand actually affect purchases in others.
Add a penalty for price increase.
Nested logit to deal with people who don't buy anything: 1) purchase or not, 2) value of purchase in category
Variational Bayes
popular products are purchased at least 2.5 times per day on average
- what happens when something is out of stock
- what happens when another product in the same category had a price change, or other subclasses "cross-price elasticity"
- determine a user's price sensitivity for different products
(how to account for skew? and time effects?)
- gains from targeted discounts
- similarity/exchangeability/complementarity
coffee & diapers are often co-purchased, vs. if hot dog prices go up, hot dog bun purchases go down
- how are purchases re-distributed
- placebo controls and normalization to check for overall effects, like a product becoming generally more popular
----
# Mihajlo Grbovic - ML-powered search at Airbnb for Experiences
ML-powered search at AirBnB
6M listings in 191 countries
Experiences: activities led by local hosts
team is 10 people, all men
click data
experience features: price, reviewes, ratings, duration, max guests, category
GBDT
50k training samples
partial dependency plots
forest score delta
personalization: a mix of historical & in-session clicks
rank also by availability & dates, type of trip (business vs. family)
category intensity: # of clicks total
category recency: # days since last clicked
Recently booked ranks lower by default, because they're assuming people won't pick the same thing 2x in a row (he didn't show any data for this)
create hobby profile of the user
origin-destination pairs, e.g. Japan: classes, USA: food
add language match using browser language
quality - ratings, and phrases in reviews
started by training on clicks/bookings --> using bookings per impression, performance got 3% better
impression discounting - adjust when something is ranked high but never clicked
position bias TBD
instance-level features, e.g. weeekend vs. weekday
----
Minmin Chen - Youtube
reinforcement learning, joint work with Covington et al.
limitations of supervised learning
1. myopic: pigeon-holes users, short-term is prioritized over long-term
2. system bias: missing feedback on items that were never recommended "rich get richer"
Goals:
1. better understand latent user info
2. be able to quickly adapt
3. discover new user interests
Plan a sequence of actions to maximize long-term reward
Improving the candidate generator long-tail; noisy and sparse feedback for users x items (see notes from Strata youtube talk)
Maximize cumulative reward
Markov decision process (MDP)
Maximize reward by gradient ascent, but user trajectories are generated by different policies
off-policy learning: use batched feedback from a different policy to help identify bias and remove it with an inverse propensity weighting
see Achiam, Joshua et al. 2017 arXiv paper
Have multiple agents seeing videos, but have no access to how those work, so they have to run models to learn about their own platform in lieu of using logs (??)
Top-K: sum of rewards for individual items; they add an off-policy correction
actually saw engagement go up 20%
Boltzmann exploration - sample according to learned policy, and there's a temperature term to adjust the exploration rate
entropy regularization, penalize KL divergence between uniform and learned policy
REINFORCE
hard to connect a single user choice to long-term rec behavior and long-term user behavior/value
understanding user intent TBD
----
Jure Leskovec - chief scientist at Pinterest
pins: bookmark + photo + map location
large, human-curated graph
crowd-sourcing for curating clusters
can be radically personalized
every pin and board has a description
huge dataset of 3-4B pins and boards; a few hundred billion connections
how people describe things
need graph to update in real-time without retraining
featurizing the graph structure is hard
deep-learning tools know how to use fixed-size grids, and sequences
graphs have no spatial locality or reference point (no top-left like a spatial image)
see Graph CNNs for web-scale recommender systems, KDD 2018
nodes aggregate info from their neighbors using NNs
pinSAGE - embeddings for nodes, borrows info from nearby nodes in the network
curriculum learning - use increasingly harder negative controls (closer, but still not related)
vs. very easy negatives (very obviously not related). someone asked where do they get these? answer:
from other rec systems, and from lower down in the rankings
sub-sample neighborhoods for efficient GPU batching
producer-consumer CPU-GPU training pipeline
trying to predict what pin they'll save next
much better than visual-only or annotation-only (partly because their visual object identification doesn't work that well)
someone also asked what about cycles in the graph, not reinforcing each other?
answer: BFS and otherwise it doesn't matter if the same node reappears, it can actually be very informative
----
Selen Ugoruglu - Netflix show similarity
Siamese networks with contrastive loss
weights are shared between them during training
can use a hinge loss for dissimilar items
Triplet loss - computationally expensive
(-) other class -- anchor -- (+) same class
minimize anchor - (+) distance after learning
triplet choices are important, want to choose semi-hard as (-) to optimize convergence
Lifted Structural Loss - more relationships among all the training samples
metadata they use: genre, expert tags, cast, title, images, script, synopsis, knowledge graph (relationships between titles)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment