mightyhorst/MIT 6.S191 - Lecture 2: Recurrent Neural Networks

## MIT 6.S191 - Lecture 2: Recurrent Neural Networks
t is single timestep

y hat of t = function of input vector x at time t

y hat is predicted

temporal dependence

recurence relationship
h is internal state of memeory passed timestep to timestep

y hat of t function of x and h t-1  temporal internal memory state

cyclic temporal dependence

RNN
cell state
function with weights
input
old state from last step

update hidden state
h of t - tanh
weight matric * previous stte

computational graph across time
weight matrices across time Whh
weight matrices xh
weight matrices Why

total loss sum for each y - y hat

tf.keras.layyes.SimpleRNN()

sentiment classification
text heneration image captioning
tranlation and forecasting and music generation

design criteria
1. handle varible length
2. long term dependedcies
33. maintain order
4. share parameters across the swquence

predict the next word
embedding word2vec

* corpus of words
* indexing
* one hot embedding

learned embedding to learn semtantics - closer in latent space

backpropagation through time
backpropagation through time for tempoal unrolling
through each time step and back through each step

exploding gradients problem

gradient clipping to scal big gradients

vanishing gradients
activation function, weight initalization, network architecture

ReLU prevents derivative from shrinking
initilaize weights to identity matrix initialize biades to zero
trick 3: gated cells
use gated to selectively add or remove information

Long term short tem memory network rely on gated

gated lstm forget, store, update, output

maintain  acell state
use gates to control the flow of information
forget date
store relevant
upadte the cell state output
uninterupted gradinet flow

RNN applications and limitations:
* music generation
input sheet music
output next charachter in sheet music

* sentiment classification

Limitations of RNNs
* encoding bottleneck - sequential temporal data one by one
* slow no parallelization
* no long term memory

so attention is all we need

goal of sequence modelling
sequence of inputs , sequence of features, sequence of outputs

continuous stream , parrallelization, long memory

attention is all you need
transformer architecture

self attention
query q key k2 how similar

extract values based on attemtion returns highest attention

identify and attnd to most import features ininput

1. encode position information
2. extract query, key, vakue for search
3. compute attention weighting
4. extract features with high attention

x = "he tossed the tennis ball to serve"
embedding word to vec
postion information positon aware encoding

attention score
positonal encoding
linear layer to query q
linear layer to get key k
linear layer to get value v

vector 1 anf k
take dot product and scale
cosine similarity
query • keyT
similarity metric

attention weighting
weights take relationship softmax
weights that words that have higher weights relative weight

extract the features
attention weighting * value = output
softmax(Q•KT/scaling) • V = Atention(Q, K, V)

LLM: BERT, GPT
biological: alphaold2
computer vision


## MIT 6.S191 - Lecture 3: Computer Vision
convolution - apply filters to generate feature map
non linear activation - relu
pooling - down sampling on each feature map


## MIT 6.S191 - Lecture 4: Generative Models
unsupervised learning
data: x
x is data, no labels

goal: learn the hidden or underlying structure of the data
examples: clustering, feature or dimensionality reduction

#generative modelling
un supervised
only samples

1. denisty estimation
learns underlying probabiliyt distibutuion where data came from

2. sample generation:
learn underlying prob model
generate new data in same probability distribution
Pmodel(x) ~= Pdata(x) similar as possible from underlying probability distribution

underlying featureset and decode efficienltly

training data can be biased and find what features over and under represented in the data

outlier detection identyfying rare outlier edge case

e.g 95% denisty is normal driving
outliers are edge cases and harsh weather and accidents

# Latent variable models
1. Autoencoders and variational autoenciders (VAEs)
2. Generative Adverserial Networks (GANs)

Latent variable is plato's republic myth of the cave observe shadows of the objects, the shadows are their reality but not directly observable

can we learn the true explanatory variables

## Autoencoders
unsupervised approach for learning a lower dimensional feature representation from unlabelld training data
map X to low dimensional latent space Z

Encoder learns mapping from the data x to low dimensional latent space z

very effeicient compact data feature representation so simple to train

way to decdoe to reconstruct original data
using CNN and FFN x hat

decoder learns mapping back from latent space z to a reconstructed observation x hat
x => z => x hat

loss(x, xhat) = || x - x hat||^2
where || is the L1 euclidian distance
loss function has no labels

autoencoding is a form of compression
lower the dimensionality the more loss but the higer effeicient smaller size

bottleneck hidden layer forces network to learn a compresed latent representation
reconstruction loss forces the latent representation to capture or encode as much information about the data as possible
auto encoding is self encoding data

## variational autoencoders VAEs
x => z => x hat
is deterministic (same reconstruction for the same weights

vaes add random probabilitic twist
random sampling operation

x => mu and sigma => z => x hat

sample from the mean and std deviation to compute the latent sample z
mu is mean vector
sigma is std deviation

encoder computes q_phi(z|x)
decoder computes p_theta(x|z)

q_phi of z latent sample given data sample of x
p_theta of x data sample given latent sample of z

encoder computes a probability distribution of latent variable given input data x
decoder learn a data proabability distribution given the latent variable z

probabalistic not deterministic

phi encode weights and theta decoder weights

VAE loss:
loss function(phi, theta, x) = reconstruction loss + regularization term
function of the data and the weights

reconstruction loss = log likelihood, mean squared error

regularization term = D( q_phi(z|x) || p(z) )
q_phi(z|x)   is infered latent distribution:
encoder probability distribution of latent variable given input data x and wieghts of phi
p(x) is fixed prior on the latent distribution

D is regularization term
its the distance between inferred latent distribution and fixed prior
adopts probability distribution similar to the prior

enforce latent variable to be normal gaussian distribution
* smooth encoiding
* penealize cheating

mean = 0 , variance of 1, std deviation of 1

KL divergence between the two distributions
= - 1/2 sum(j=0; j <= k - 1) {
  std_dev_j + mean_j ^2 - 1 - log(std_dev_j)
}

continutiy points that are close in latent space similar after decoding
completeness no missing data , sampling from latent space meaningful content after decoded

reparametization for backpropagation
redefine a latent variable vector is sampled
z = mu + sigma • epsilon
mu and sigma is fixed
sigma is scaled by the random constants drawn from the prior distribution
random constant is epsilon

all randomness is in epsilon so that mu and std deviation can be trained

epilon is stochastic and the rest is deterministic

### Latent perturbation
slowly perturb tune the latent variables

individaual latent variable is capturing something meaningful e.g. rotating the reconstruction

disentanglement
enforce diagonal prior on the latent variables to encourage independence

Beta VAE
scale and weighting constant on regularization term

beta > 1 encorage disentanglement
head rotation azimuth


## MIT 6.S191 - Lecture 4: Generative Models - GANs
# Generative Adverserial Network GANs
sample from complex distribution
solution: sample from noise and learn a transformation data

generator - noise and produce noise
discriminator - output and data and classification deciion to deciee real or fake

generator try to fool the discriminator

Intuition of GAN
discriminiator trrain P(real) = 1

Generator to create fake data in real data distribution

noise: z
generator: g
discriminator: D -> y
x real true data

arg max E z,x [log D(G(z)) + log(1-D(x))]

fake data: G(z)
estimate of probability is fake: log D(G(z))
estimate of probability of real data: log(1-D(x))

D(x) discriminator estimate is fake
1- D(x) estimate of real data

G minimise proability generated data of G that fools the discriminator

data distribution manifold

conditional GAN
conditioning factor
paired translation between data e.g pix2pix

CycleGAN learns transformations across domains with unpaired data

e.g. horse to zebra

distribution transformations
GANs: gaussian noise z ~ N 0,1 to target data manidold

CycleGANs: data manifold X data manifold Y

transforming waveform spectrogram image

Diffusion Models


## MIT 6.S191 - Lecture 6: Reinforcement Learning
Reinforcement Learning
data: state and action pairs
goal: maximise future rewards over many time steps

agent:
takes actions

environment:
the world the agent exists

actions:
a_t
a move an agent can make in the environment

action space:
the set of possible actions an agent can make in the environment

observation:
the environment after taking actions
state changes s_t+1

reward:
feedback that measures the success or failure of the agents action
reward: r_t

gamma = discount dampening factor future rewards are worth less
short term greediness

Q function
Rt = rt + gamma*r_t+1 .... n

E = expected
Q(s_t, a_t) = E[R_t|s_t, a_t]
E[R_t|s_t, a_t] = expected total reward
q function captures the expected total future reward an agent in state s can receive by executing a certain action a

the agent needs a policy pi(s) to infer the best action to take at its state s

strategy: the policy should choose an action that maximises future reward

pi*(s) = argmax_a Q(s,a)


## MIT 6.S191 - Lecture 7: Diffusion
diffusion
add little bits of noise , then denoise


## MIT 6.S191 - Lecture 8: Text to Image
# MUSE text to image generation
DreamBooth
personalization

auto regressive models
	t is single timestep

	y hat of t = function of input vector x at time t

	y hat is predicted

	temporal dependence

	recurence relationship
	h is internal state of memeory passed timestep to timestep

	y hat of t function of x and h t-1 temporal internal memory state

	cyclic temporal dependence

	RNN
	cell state
	function with weights
	input
	old state from last step

	update hidden state
	h of t - tanh
	weight matric * previous stte

	computational graph across time
	weight matrices across time Whh
	weight matrices xh
	weight matrices Why

	total loss sum for each y - y hat

	tf.keras.layyes.SimpleRNN()

	sentiment classification
	text heneration image captioning
	tranlation and forecasting and music generation

	design criteria
	1. handle varible length
	2. long term dependedcies
	33. maintain order
	4. share parameters across the swquence

	predict the next word
	embedding word2vec

	* corpus of words
	* indexing
	* one hot embedding

	learned embedding to learn semtantics - closer in latent space

	backpropagation through time
	backpropagation through time for tempoal unrolling
	through each time step and back through each step

	exploding gradients problem

	gradient clipping to scal big gradients

	vanishing gradients
	activation function, weight initalization, network architecture

	ReLU prevents derivative from shrinking
	initilaize weights to identity matrix initialize biades to zero
	trick 3: gated cells
	use gated to selectively add or remove information

	Long term short tem memory network rely on gated

	gated lstm forget, store, update, output

	maintain acell state
	use gates to control the flow of information
	forget date
	store relevant
	upadte the cell state output
	uninterupted gradinet flow

	RNN applications and limitations:
	* music generation
	input sheet music
	output next charachter in sheet music

	* sentiment classification

	Limitations of RNNs
	* encoding bottleneck - sequential temporal data one by one
	* slow no parallelization
	* no long term memory

	so attention is all we need

	goal of sequence modelling
	sequence of inputs , sequence of features, sequence of outputs

	continuous stream , parrallelization, long memory

	attention is all you need
	transformer architecture

	self attention
	query q key k2 how similar

	extract values based on attemtion returns highest attention

	identify and attnd to most import features ininput

	1. encode position information
	2. extract query, key, vakue for search
	3. compute attention weighting
	4. extract features with high attention

	x = "he tossed the tennis ball to serve"
	embedding word to vec
	postion information positon aware encoding

	attention score
	positonal encoding
	linear layer to query q
	linear layer to get key k
	linear layer to get value v

	vector 1 anf k
	take dot product and scale
	cosine similarity
	query • keyT
	similarity metric

	attention weighting
	weights take relationship softmax
	weights that words that have higher weights relative weight

	extract the features
	attention weighting * value = output
	softmax(Q•KT/scaling) • V = Atention(Q, K, V)

	LLM: BERT, GPT
	biological: alphaold2
	computer vision
	convolution - apply filters to generate feature map
	non linear activation - relu
	pooling - down sampling on each feature map
	unsupervised learning
	data: x
	x is data, no labels

	goal: learn the hidden or underlying structure of the data
	examples: clustering, feature or dimensionality reduction

	#generative modelling
	un supervised
	only samples

	1. denisty estimation
	learns underlying probabiliyt distibutuion where data came from

	2. sample generation:
	learn underlying prob model
	generate new data in same probability distribution
	Pmodel(x) ~= Pdata(x) similar as possible from underlying probability distribution

	underlying featureset and decode efficienltly

	training data can be biased and find what features over and under represented in the data

	outlier detection identyfying rare outlier edge case

	e.g 95% denisty is normal driving
	outliers are edge cases and harsh weather and accidents

	# Latent variable models
	1. Autoencoders and variational autoenciders (VAEs)
	2. Generative Adverserial Networks (GANs)

	Latent variable is plato's republic myth of the cave observe shadows of the objects, the shadows are their reality but not directly observable

	can we learn the true explanatory variables

	## Autoencoders
	unsupervised approach for learning a lower dimensional feature representation from unlabelld training data
	map X to low dimensional latent space Z

	Encoder learns mapping from the data x to low dimensional latent space z

	very effeicient compact data feature representation so simple to train

	way to decdoe to reconstruct original data
	using CNN and FFN x hat

	decoder learns mapping back from latent space z to a reconstructed observation x hat
	x => z => x hat

	loss(x, xhat) = \|\| x - x hat\|\|^2
	where \|\| is the L1 euclidian distance
	loss function has no labels

	autoencoding is a form of compression
	lower the dimensionality the more loss but the higer effeicient smaller size

	bottleneck hidden layer forces network to learn a compresed latent representation
	reconstruction loss forces the latent representation to capture or encode as much information about the data as possible
	auto encoding is self encoding data

	## variational autoencoders VAEs
	x => z => x hat
	is deterministic (same reconstruction for the same weights

	vaes add random probabilitic twist
	random sampling operation

	x => mu and sigma => z => x hat

	sample from the mean and std deviation to compute the latent sample z
	mu is mean vector
	sigma is std deviation

	encoder computes q_phi(z\|x)
	decoder computes p_theta(x\|z)

	q_phi of z latent sample given data sample of x
	p_theta of x data sample given latent sample of z

	encoder computes a probability distribution of latent variable given input data x
	decoder learn a data proabability distribution given the latent variable z

	probabalistic not deterministic

	phi encode weights and theta decoder weights

	VAE loss:
	loss function(phi, theta, x) = reconstruction loss + regularization term
	function of the data and the weights

	reconstruction loss = log likelihood, mean squared error

	regularization term = D( q_phi(z\|x) \|\| p(z) )
	q_phi(z\|x) is infered latent distribution:
	encoder probability distribution of latent variable given input data x and wieghts of phi
	p(x) is fixed prior on the latent distribution

	D is regularization term
	its the distance between inferred latent distribution and fixed prior
	adopts probability distribution similar to the prior

	enforce latent variable to be normal gaussian distribution
	* smooth encoiding
	* penealize cheating

	mean = 0 , variance of 1, std deviation of 1

	KL divergence between the two distributions
	= - 1/2 sum(j=0; j <= k - 1) {
	std_dev_j + mean_j ^2 - 1 - log(std_dev_j)
	}

	continutiy points that are close in latent space similar after decoding
	completeness no missing data , sampling from latent space meaningful content after decoded

	reparametization for backpropagation
	redefine a latent variable vector is sampled
	z = mu + sigma • epsilon
	mu and sigma is fixed
	sigma is scaled by the random constants drawn from the prior distribution
	random constant is epsilon

	all randomness is in epsilon so that mu and std deviation can be trained

	epilon is stochastic and the rest is deterministic

	### Latent perturbation
	slowly perturb tune the latent variables

	individaual latent variable is capturing something meaningful e.g. rotating the reconstruction

	disentanglement
	enforce diagonal prior on the latent variables to encourage independence

	Beta VAE
	scale and weighting constant on regularization term

	beta > 1 encorage disentanglement
	head rotation azimuth
	# Generative Adverserial Network GANs
	sample from complex distribution
	solution: sample from noise and learn a transformation data

	generator - noise and produce noise
	discriminator - output and data and classification deciion to deciee real or fake

	generator try to fool the discriminator

	Intuition of GAN
	discriminiator trrain P(real) = 1

	Generator to create fake data in real data distribution

	noise: z
	generator: g
	discriminator: D -> y
	x real true data

	arg max E z,x [log D(G(z)) + log(1-D(x))]

	fake data: G(z)
	estimate of probability is fake: log D(G(z))
	estimate of probability of real data: log(1-D(x))

	D(x) discriminator estimate is fake
	1- D(x) estimate of real data

	G minimise proability generated data of G that fools the discriminator

	data distribution manifold

	conditional GAN
	conditioning factor
	paired translation between data e.g pix2pix

	CycleGAN learns transformations across domains with unpaired data

	e.g. horse to zebra

	distribution transformations
	GANs: gaussian noise z ~ N 0,1 to target data manidold

	CycleGANs: data manifold X data manifold Y

	transforming waveform spectrogram image

	Diffusion Models
	Reinforcement Learning
	data: state and action pairs
	goal: maximise future rewards over many time steps

	agent:
	takes actions

	environment:
	the world the agent exists

	actions:
	a_t
	a move an agent can make in the environment

	action space:
	the set of possible actions an agent can make in the environment

	observation:
	the environment after taking actions
	state changes s_t+1

	reward:
	feedback that measures the success or failure of the agents action
	reward: r_t

	gamma = discount dampening factor future rewards are worth less
	short term greediness

	Q function
	Rt = rt + gamma*r_t+1 .... n

	E = expected
	Q(s_t, a_t) = E[R_t\|s_t, a_t]
	E[R_t\|s_t, a_t] = expected total reward
	q function captures the expected total future reward an agent in state s can receive by executing a certain action a

	the agent needs a policy pi(s) to infer the best action to take at its state s

	strategy: the policy should choose an action that maximises future reward

	pi*(s) = argmax_a Q(s,a)
	# MUSE text to image generation
	DreamBooth
	personalization

	auto regressive models