szeitlin/netflix_PRS_2019_notes

## netflix_PRS_2019_notes
Started in 2016; past talks are online
"everything is a recommendation"
80% of what people watch on Netflix comes from recs

# Mounia Lalmas - Dir Research at Spotify (based in London)
Home: help users find content quickly
*nice slide w/ overall view of research -> measurement -> modeling -> optimization -> business

1. success metrics
*BaRT - McInerney et al. 2018*
- bandits
- find best card per shelf, and then rank shelves
success = streaming time binarized with a threshold of 30 seconds per playlist (this seems weird to me, 30 seconds per song would make sense?)
exceptions, e.g. sleep playlist success threshold is longer
jazz listeners listen longer than other listeners

reward functions:
- one global function
- one per user x playlist
- groups of users x playlists

Used *Dhillon et al co-clustering KDD 2003*
Histograms + thresholds
Found that mean worked the best for the threshold (vs. additive or cumulative, which seem like straw men comparisons to me)
Affinity features (content x user) are better than generic ones (age or day)
*Dragone et al. WWW 2019*

2. intent
What is the user looking for? i.e. passive listening vs. actively engaging
*Mehotra et al. WWW 2019*

examples: search for a particular thing vs. discovery by mood/activity vs. music to have on in the background
(the chart for this looked kind of meaningless to me?)

multi-level model + intent improved user satisfaction ratings prediction over global
shared learning across intents

most useful metrics:
- time to success + dwell time
- save or download

3. diversity of content
*Mehotra et al. CIKM 2018*
- relevance (user + tracks)
- satisfaction (stream > 30 seconds)
- diversity (range of popularity from Drake to... not Drake)

Of course, they found that high relevance meant less diversity (very few playlists have both)
Tradeoff of beta = 0.7 with a 10% drop in satisfaction
vs. relevant max of beta = 1, or diversity max with beta = 0 and satisfaction drop of 32%

"personalized diversity" - satisfaction up 12%
(some users are of course more diverse in their interests)

Someone asked about how to define intent; she said TBD, but mostly by clustering behavior
and looking at things like time of day

----

# David Hubbard and Benoit Rostykus - Netflix
Long-term outcomes

short: click (popularity bias), view, like

medium: dwell time, quality plays

long: satisfaction, subscription renewal, etc.

Ranking: popular, not relevant

Messaging: short term metrics can lead to user fatigue

Want to model satisfaction over time, for example, renewal as a Bernoulli model over months
Bayesian approach, beta-logistic/gemoetric, *Heckman & Willis 1977*.

Features used: country, tenure, devices, streaming, behavior, payments

predicting churn, basically. *Vaupel & Yashin 1985*
Effects of selection on population bias, why means are bad

*Fader et al. 2018* predicting retention

Take home points: better to have short-term initial disappointment if it leads to better long-term outcomes,
vs. initial pleasure followed by lasting disappointment

Used a Criteo online conversions dataset of 1M rows and 100k columns (?!)
beta-logit was approximately as good as exponential, and <5% better than plain logit

common approach is logit + Laplace approximation, but that's not very scalable

Concluded that beta posterior was better than a gaussian posterior, and more scalable

Netflix dataset they used was 10M rows x 500 columns, used 3M for test set and a lightGBM for training

*paper available on arxiv*

someone asked re: counterfactual bandit approach (sort of getting at feedback loops, though they didn't say it that way)

----

# Jason Gauci - applied reinforcement learning, Facebook

- Evangelize decision-making

Has been training large NNs since back when you couldn't get an NN talk into NIPS

Tech Lead Manager on Horizon https://github.com/facebooksearch/Horizon

Programming Throwdown podast

Eternal Terminal replacement for ssh at mistertea.github.com

1. retrieval matrix factorization, DNN

2. event prediction, DNN, GBDT, etc.

3. Ranking - bandits, RL

4. DS - a/b tests

1 & 3 are control

2 is signal processing

4 is causal analysis

Classification:

- what will happen, trained on ground truth, evaluated re: accuracy, assume data are correct

Decision Making:

- How can we improve trained, trained from another policy (usually a worse one),
counterfactual evaluation, assume data are flawed

- Action features

- context - device type

- session features

- event predictions

Greedy State Recs:

- value function: utility to stakeholders

- control function (maximum predictions)

- transition function - penalty to create.

"Data science Descent"

- loop: design metrics, create predictions, analyze (better: automated a/b tests)

Historical:  Google had giant tables of click-through rates by categories. Humans were building decision trees by hand.

https://becominghuman.ai/the-very-basis-of-reinforcement-learning-(uuid)

Markov Decision Process:

- state: user/post/session

- action: which post to show (decide)

- reward: R(S,A)

- Transition: T(S,A) - S'

map state-action pairs to future state

- Value: discounted reward

Can't just regress

Credit assignment problem

State action reward state action: SARSA is recursive

idea borrowed from Dynamic Programming

Have the model pick the best action instead: policy gradient

Synchronous SGD, spark and distributed pytorch

CPE: counterfactual policy evaluator

- more useful slides but they went by too fast, see the online video for details

----

# Olya Gurevich - Marvelous AI

- Detecting political narratives with HITL NLP

23% of adults admitted to sharing fake news (on purpose or by accident)

- hard: not large labeled datasets, often couched in a kernel of truth, user engagement can't be used as a success metric

Cofounders: Danielle came from Kixeye

Target audience: researchers, journalists, political candidates. Focusing on the 2020 election.

- discovering themes & narratives about candidates, NLP work on tweets, measuring spread, clustering content

Train GloVe embeddings

joe biden 'creep' and amy klobuchar 'salad' and 'comb' see *Demszky et al. NAACL 2019*

Hierarchical clustering

Media Bias Fact Check (MBFC) ratings of news sites

*Benkler et al. Network Propaganda*

Left-wing is self-policing, right-wing is not

Female candidates are getting more fake news attacks, Elizabeth Warren gets the most

Suggestions:

1. Engagement metrics have to evolve.

2. Beware echo chambers and radicalization spirals.

3. Actively measure bias.

4. Ideological divide is not symmetric.

-----

# Susan Asthey - Stanford

- Counterfactual Inference for Consumer Choice with Many Products

* see her publications *

- Old way: 1 product at a time, not scalable and misses things, e.g. store vs. store competition, bundling related products

- unobserved latent product characteristics

- Build a structural model of the customer, with preferences that generalize, i.e. quality, stockpiling

- * her slides seems useful, with many references, but she went really fast, see the video for details*

- Used loyalty card data set over 18 months, prices change every Tuesday night

- Product hierarchy: UPC, subclass, class, category, group, dept, section

- throw out seasonal things

- 28 features re: user demographics

- Consider categories where user is probably only going to buy 1

- users with > 20 trips on Tuesday or Wednesday with > 10 items per trip, top 235 categories

- example UPC price series

Hierarchical Poisson Factorization Model (HPF): log(user pref • product attributes) = mean utility

Assumes items are independent, but pricing changes in one brand actually affect purchases in others.

Add a penalty for price increase.

Nested logit to deal with people who don't buy anything: 1) purchase or not, 2) value of purchase in category

Variational Bayes

popular products are purchased at least 2.5 times per day on average

- what happens when something is out of stock

 - what happens when another product in the same category had a price change, or other subclasses "cross-price elasticity"

 - determine a user's price sensitivity for different products

 (how to account for skew? and time effects?)

 - gains from targeted discounts

 - similarity/exchangeability/complementarity

 coffee & diapers are often co-purchased, vs. if hot dog prices go up, hot dog bun purchases go down

 - how are purchases re-distributed

 - placebo controls and normalization to check for overall effects, like a product becoming generally more popular

 ----
 # Mihajlo Grbovic - ML-powered search at Airbnb for Experiences

ML-powered search at AirBnB

6M listings in 191 countries

Experiences: activities led by local hosts

team is 10 people, all men

click data
experience features: price, reviewes, ratings, duration, max guests, category
GBDT
50k training samples
partial dependency plots
forest score delta

personalization: a mix of historical & in-session clicks

rank also by availability & dates, type of trip (business vs. family)

category intensity: # of clicks total
category recency: # days since last clicked

Recently booked ranks lower by default, because they're assuming people won't pick the same thing 2x in a row (he didn't show any data for this)

create hobby profile of the user
origin-destination pairs, e.g. Japan: classes, USA: food

add language match using browser language

quality - ratings, and phrases in reviews

started by training on clicks/bookings --> using bookings per impression, performance got 3% better

impression discounting - adjust when something is ranked high but never clicked

position bias TBD

instance-level features, e.g. weeekend vs. weekday

----
Minmin Chen - Youtube

reinforcement learning, joint work with Covington et al.

limitations of supervised learning
1. myopic: pigeon-holes users, short-term is prioritized over long-term
2. system bias: missing feedback on items that were never recommended "rich get richer"

Goals:
1. better understand latent user info
2. be able to quickly adapt
3. discover new user interests

Plan a sequence of actions to maximize long-term reward
Improving the candidate generator long-tail; noisy and sparse feedback for users x items (see notes from Strata youtube talk)

Maximize cumulative reward
Markov decision process (MDP)

Maximize reward by gradient ascent, but user trajectories are generated by different policies

off-policy learning: use batched feedback from a different policy to help identify bias and remove it with an inverse propensity weighting
see Achiam, Joshua et al. 2017 arXiv paper

Have multiple agents seeing videos, but have no access to how those work, so they have to run models to learn about their own platform in lieu of using logs (??)

Top-K: sum of rewards for individual items; they add an off-policy correction
actually saw engagement go up 20%

Boltzmann exploration - sample according to learned policy, and there's a temperature term to adjust the exploration rate
entropy regularization, penalize KL divergence between uniform and learned policy

REINFORCE

hard to connect a single user choice to long-term rec behavior and long-term user behavior/value
understanding user intent TBD

----
Jure Leskovec - chief scientist at Pinterest

pins: bookmark + photo + map location

large, human-curated graph
crowd-sourcing for curating clusters
can be radically personalized

every pin and board has a description
huge dataset of 3-4B pins and boards; a few hundred billion connections
how people describe things
need graph to update in real-time without retraining
featurizing the graph structure is hard

deep-learning tools know how to use fixed-size grids, and sequences

graphs have no spatial locality or reference point (no top-left like a spatial image)

see Graph CNNs for web-scale recommender systems, KDD 2018

nodes aggregate info from their neighbors using NNs

pinSAGE - embeddings for nodes, borrows info from nearby nodes in the network

curriculum learning - use increasingly harder negative controls (closer, but still not related)
vs. very easy negatives (very obviously not related). someone asked where do they get these? answer:
from other rec systems, and from lower down in the rankings

sub-sample neighborhoods for efficient GPU batching
producer-consumer CPU-GPU training pipeline

trying to predict what pin they'll save next
much better than visual-only or annotation-only (partly because their visual object identification doesn't work that well)

someone also asked what about cycles in the graph, not reinforcing each other?
answer: BFS and otherwise it doesn't matter if the same node reappears, it can actually be very informative

----
Selen Ugoruglu - Netflix show similarity

Siamese networks with contrastive loss
weights are shared between them during training

can use a hinge loss for dissimilar items

Triplet loss - computationally expensive
(-) other class -- anchor -- (+) same class
minimize anchor - (+) distance after learning
triplet choices are important, want to choose semi-hard as (-) to optimize convergence

Lifted Structural Loss - more relationships among all the training samples

metadata they use: genre, expert tags, cast, title, images, script, synopsis, knowledge graph (relationships between titles)
	Started in 2016; past talks are online
	"everything is a recommendation"
	80% of what people watch on Netflix comes from recs

	# Mounia Lalmas - Dir Research at Spotify (based in London)
	Home: help users find content quickly
	*nice slide w/ overall view of research -> measurement -> modeling -> optimization -> business

	1. success metrics
	BaRT - McInerney et al. 2018
	- bandits
	- find best card per shelf, and then rank shelves
	success = streaming time binarized with a threshold of 30 seconds per playlist (this seems weird to me, 30 seconds per song would make sense?)
	exceptions, e.g. sleep playlist success threshold is longer
	jazz listeners listen longer than other listeners

	reward functions:
	- one global function
	- one per user x playlist
	- groups of users x playlists

	Used Dhillon et al co-clustering KDD 2003
	Histograms + thresholds
	Found that mean worked the best for the threshold (vs. additive or cumulative, which seem like straw men comparisons to me)
	Affinity features (content x user) are better than generic ones (age or day)
	Dragone et al. WWW 2019

	2. intent
	What is the user looking for? i.e. passive listening vs. actively engaging
	Mehotra et al. WWW 2019

	examples: search for a particular thing vs. discovery by mood/activity vs. music to have on in the background
	(the chart for this looked kind of meaningless to me?)

	multi-level model + intent improved user satisfaction ratings prediction over global
	shared learning across intents

	most useful metrics:
	- time to success + dwell time
	- save or download

	3. diversity of content
	Mehotra et al. CIKM 2018
	- relevance (user + tracks)
	- satisfaction (stream > 30 seconds)
	- diversity (range of popularity from Drake to... not Drake)

	Of course, they found that high relevance meant less diversity (very few playlists have both)
	Tradeoff of beta = 0.7 with a 10% drop in satisfaction
	vs. relevant max of beta = 1, or diversity max with beta = 0 and satisfaction drop of 32%

	"personalized diversity" - satisfaction up 12%
	(some users are of course more diverse in their interests)

	Someone asked about how to define intent; she said TBD, but mostly by clustering behavior
	and looking at things like time of day

	----

	# David Hubbard and Benoit Rostykus - Netflix
	Long-term outcomes

	short: click (popularity bias), view, like

	medium: dwell time, quality plays

	long: satisfaction, subscription renewal, etc.

	Ranking: popular, not relevant

	Messaging: short term metrics can lead to user fatigue

	Want to model satisfaction over time, for example, renewal as a Bernoulli model over months
	Bayesian approach, beta-logistic/gemoetric, Heckman & Willis 1977.

	Features used: country, tenure, devices, streaming, behavior, payments

	predicting churn, basically. Vaupel & Yashin 1985
	Effects of selection on population bias, why means are bad

	Fader et al. 2018 predicting retention

	Take home points: better to have short-term initial disappointment if it leads to better long-term outcomes,
	vs. initial pleasure followed by lasting disappointment

	Used a Criteo online conversions dataset of 1M rows and 100k columns (?!)
	beta-logit was approximately as good as exponential, and <5% better than plain logit

	common approach is logit + Laplace approximation, but that's not very scalable

	Concluded that beta posterior was better than a gaussian posterior, and more scalable

	Netflix dataset they used was 10M rows x 500 columns, used 3M for test set and a lightGBM for training

	paper available on arxiv

	someone asked re: counterfactual bandit approach (sort of getting at feedback loops, though they didn't say it that way)

	----

	# Jason Gauci - applied reinforcement learning, Facebook

	- Evangelize decision-making

	Has been training large NNs since back when you couldn't get an NN talk into NIPS

	Tech Lead Manager on Horizon https://github.com/facebooksearch/Horizon

	Programming Throwdown podast

	Eternal Terminal replacement for ssh at mistertea.github.com

	1. retrieval matrix factorization, DNN

	2. event prediction, DNN, GBDT, etc.

	3. Ranking - bandits, RL

	4. DS - a/b tests

	1 & 3 are control

	2 is signal processing

	4 is causal analysis

	Classification:

	- what will happen, trained on ground truth, evaluated re: accuracy, assume data are correct

	Decision Making:

	- How can we improve trained, trained from another policy (usually a worse one),
	counterfactual evaluation, assume data are flawed

	- Action features

	- context - device type

	- session features

	- event predictions

	Greedy State Recs:

	- value function: utility to stakeholders

	- control function (maximum predictions)

	- transition function - penalty to create.

	"Data science Descent"

	- loop: design metrics, create predictions, analyze (better: automated a/b tests)

	Historical: Google had giant tables of click-through rates by categories. Humans were building decision trees by hand.

	https://becominghuman.ai/the-very-basis-of-reinforcement-learning-(uuid)

	Markov Decision Process:

	- state: user/post/session

	- action: which post to show (decide)

	- reward: R(S,A)

	- Transition: T(S,A) - S'

	map state-action pairs to future state

	- Value: discounted reward

	Can't just regress

	Credit assignment problem

	State action reward state action: SARSA is recursive

	idea borrowed from Dynamic Programming

	Have the model pick the best action instead: policy gradient

	Synchronous SGD, spark and distributed pytorch

	CPE: counterfactual policy evaluator

	- more useful slides but they went by too fast, see the online video for details

	----

	# Olya Gurevich - Marvelous AI

	- Detecting political narratives with HITL NLP

	23% of adults admitted to sharing fake news (on purpose or by accident)

	- hard: not large labeled datasets, often couched in a kernel of truth, user engagement can't be used as a success metric

	Cofounders: Danielle came from Kixeye

	Target audience: researchers, journalists, political candidates. Focusing on the 2020 election.

	- discovering themes & narratives about candidates, NLP work on tweets, measuring spread, clustering content

	Train GloVe embeddings

	joe biden 'creep' and amy klobuchar 'salad' and 'comb' see Demszky et al. NAACL 2019

	Hierarchical clustering

	Media Bias Fact Check (MBFC) ratings of news sites

	Benkler et al. Network Propaganda

	Left-wing is self-policing, right-wing is not

	Female candidates are getting more fake news attacks, Elizabeth Warren gets the most

	Suggestions:

	1. Engagement metrics have to evolve.

	2. Beware echo chambers and radicalization spirals.

	3. Actively measure bias.

	4. Ideological divide is not symmetric.

	-----

	# Susan Asthey - Stanford

	- Counterfactual Inference for Consumer Choice with Many Products

	* see her publications *

	- Old way: 1 product at a time, not scalable and misses things, e.g. store vs. store competition, bundling related products

	- unobserved latent product characteristics

	- Build a structural model of the customer, with preferences that generalize, i.e. quality, stockpiling

	- * her slides seems useful, with many references, but she went really fast, see the video for details*

	- Used loyalty card data set over 18 months, prices change every Tuesday night

	- Product hierarchy: UPC, subclass, class, category, group, dept, section

	- throw out seasonal things

	- 28 features re: user demographics

	- Consider categories where user is probably only going to buy 1

	- users with > 20 trips on Tuesday or Wednesday with > 10 items per trip, top 235 categories

	- example UPC price series

	Hierarchical Poisson Factorization Model (HPF): log(user pref • product attributes) = mean utility

	Assumes items are independent, but pricing changes in one brand actually affect purchases in others.

	Add a penalty for price increase.

	Nested logit to deal with people who don't buy anything: 1) purchase or not, 2) value of purchase in category

	Variational Bayes

	popular products are purchased at least 2.5 times per day on average

	- what happens when something is out of stock

	- what happens when another product in the same category had a price change, or other subclasses "cross-price elasticity"

	- determine a user's price sensitivity for different products

	(how to account for skew? and time effects?)

	- gains from targeted discounts

	- similarity/exchangeability/complementarity

	coffee & diapers are often co-purchased, vs. if hot dog prices go up, hot dog bun purchases go down

	- how are purchases re-distributed

	- placebo controls and normalization to check for overall effects, like a product becoming generally more popular

	----
	# Mihajlo Grbovic - ML-powered search at Airbnb for Experiences

	ML-powered search at AirBnB

	6M listings in 191 countries

	Experiences: activities led by local hosts

	team is 10 people, all men

	click data
	experience features: price, reviewes, ratings, duration, max guests, category
	GBDT
	50k training samples
	partial dependency plots
	forest score delta

	personalization: a mix of historical & in-session clicks

	rank also by availability & dates, type of trip (business vs. family)

	category intensity: # of clicks total
	category recency: # days since last clicked

	Recently booked ranks lower by default, because they're assuming people won't pick the same thing 2x in a row (he didn't show any data for this)

	create hobby profile of the user
	origin-destination pairs, e.g. Japan: classes, USA: food

	add language match using browser language

	quality - ratings, and phrases in reviews

	started by training on clicks/bookings --> using bookings per impression, performance got 3% better

	impression discounting - adjust when something is ranked high but never clicked

	position bias TBD

	instance-level features, e.g. weeekend vs. weekday

	----
	Minmin Chen - Youtube

	reinforcement learning, joint work with Covington et al.

	limitations of supervised learning
	1. myopic: pigeon-holes users, short-term is prioritized over long-term
	2. system bias: missing feedback on items that were never recommended "rich get richer"

	Goals:
	1. better understand latent user info
	2. be able to quickly adapt
	3. discover new user interests

	Plan a sequence of actions to maximize long-term reward
	Improving the candidate generator long-tail; noisy and sparse feedback for users x items (see notes from Strata youtube talk)

	Maximize cumulative reward
	Markov decision process (MDP)

	Maximize reward by gradient ascent, but user trajectories are generated by different policies

	off-policy learning: use batched feedback from a different policy to help identify bias and remove it with an inverse propensity weighting
	see Achiam, Joshua et al. 2017 arXiv paper

	Have multiple agents seeing videos, but have no access to how those work, so they have to run models to learn about their own platform in lieu of using logs (??)

	Top-K: sum of rewards for individual items; they add an off-policy correction
	actually saw engagement go up 20%

	Boltzmann exploration - sample according to learned policy, and there's a temperature term to adjust the exploration rate
	entropy regularization, penalize KL divergence between uniform and learned policy

	REINFORCE

	hard to connect a single user choice to long-term rec behavior and long-term user behavior/value
	understanding user intent TBD

	----
	Jure Leskovec - chief scientist at Pinterest

	pins: bookmark + photo + map location

	large, human-curated graph
	crowd-sourcing for curating clusters
	can be radically personalized

	every pin and board has a description
	huge dataset of 3-4B pins and boards; a few hundred billion connections
	how people describe things
	need graph to update in real-time without retraining
	featurizing the graph structure is hard

	deep-learning tools know how to use fixed-size grids, and sequences

	graphs have no spatial locality or reference point (no top-left like a spatial image)

	see Graph CNNs for web-scale recommender systems, KDD 2018

	nodes aggregate info from their neighbors using NNs

	pinSAGE - embeddings for nodes, borrows info from nearby nodes in the network

	curriculum learning - use increasingly harder negative controls (closer, but still not related)
	vs. very easy negatives (very obviously not related). someone asked where do they get these? answer:
	from other rec systems, and from lower down in the rankings

	sub-sample neighborhoods for efficient GPU batching
	producer-consumer CPU-GPU training pipeline

	trying to predict what pin they'll save next
	much better than visual-only or annotation-only (partly because their visual object identification doesn't work that well)

	someone also asked what about cycles in the graph, not reinforcing each other?
	answer: BFS and otherwise it doesn't matter if the same node reappears, it can actually be very informative

	----
	Selen Ugoruglu - Netflix show similarity

	Siamese networks with contrastive loss
	weights are shared between them during training

	can use a hinge loss for dissimilar items

	Triplet loss - computationally expensive
	(-) other class -- anchor -- (+) same class
	minimize anchor - (+) distance after learning
	triplet choices are important, want to choose semi-hard as (-) to optimize convergence

	Lifted Structural Loss - more relationships among all the training samples

	metadata they use: genre, expert tags, cast, title, images, script, synopsis, knowledge graph (relationships between titles)