Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save mightyhorst/2aaf1797117c9555eb1e70a160f921b1 to your computer and use it in GitHub Desktop.
Save mightyhorst/2aaf1797117c9555eb1e70a160f921b1 to your computer and use it in GitHub Desktop.
MIT 6.S191: Intro to Deep Learning MIT notes
t is single timestep
y hat of t = function of input vector x at time t
y hat is predicted
temporal dependence
recurence relationship
h is internal state of memeory passed timestep to timestep
y hat of t function of x and h t-1 temporal internal memory state
cyclic temporal dependence
RNN
cell state
function with weights
input
old state from last step
update hidden state
h of t - tanh
weight matric * previous stte
computational graph across time
weight matrices across time Whh
weight matrices xh
weight matrices Why
total loss sum for each y - y hat
tf.keras.layyes.SimpleRNN()
sentiment classification
text heneration image captioning
tranlation and forecasting and music generation
design criteria
1. handle varible length
2. long term dependedcies
33. maintain order
4. share parameters across the swquence
predict the next word
embedding word2vec
* corpus of words
* indexing
* one hot embedding
learned embedding to learn semtantics - closer in latent space
backpropagation through time
backpropagation through time for tempoal unrolling
through each time step and back through each step
exploding gradients problem
gradient clipping to scal big gradients
vanishing gradients
activation function, weight initalization, network architecture
ReLU prevents derivative from shrinking
initilaize weights to identity matrix initialize biades to zero
trick 3: gated cells
use gated to selectively add or remove information
Long term short tem memory network rely on gated
gated lstm forget, store, update, output
maintain acell state
use gates to control the flow of information
forget date
store relevant
upadte the cell state output
uninterupted gradinet flow
RNN applications and limitations:
* music generation
input sheet music
output next charachter in sheet music
* sentiment classification
Limitations of RNNs
* encoding bottleneck - sequential temporal data one by one
* slow no parallelization
* no long term memory
so attention is all we need
goal of sequence modelling
sequence of inputs , sequence of features, sequence of outputs
continuous stream , parrallelization, long memory
attention is all you need
transformer architecture
self attention
query q key k2 how similar
extract values based on attemtion returns highest attention
identify and attnd to most import features ininput
1. encode position information
2. extract query, key, vakue for search
3. compute attention weighting
4. extract features with high attention
x = "he tossed the tennis ball to serve"
embedding word to vec
postion information positon aware encoding
attention score
positonal encoding
linear layer to query q
linear layer to get key k
linear layer to get value v
vector 1 anf k
take dot product and scale
cosine similarity
query • keyT
similarity metric
attention weighting
weights take relationship softmax
weights that words that have higher weights relative weight
extract the features
attention weighting * value = output
softmax(Q•KT/scaling) • V = Atention(Q, K, V)
LLM: BERT, GPT
biological: alphaold2
computer vision
convolution - apply filters to generate feature map
non linear activation - relu
pooling - down sampling on each feature map
unsupervised learning
data: x
x is data, no labels
goal: learn the hidden or underlying structure of the data
examples: clustering, feature or dimensionality reduction
#generative modelling
un supervised
only samples
1. denisty estimation
learns underlying probabiliyt distibutuion where data came from
2. sample generation:
learn underlying prob model
generate new data in same probability distribution
Pmodel(x) ~= Pdata(x) similar as possible from underlying probability distribution
underlying featureset and decode efficienltly
training data can be biased and find what features over and under represented in the data
outlier detection identyfying rare outlier edge case
e.g 95% denisty is normal driving
outliers are edge cases and harsh weather and accidents
# Latent variable models
1. Autoencoders and variational autoenciders (VAEs)
2. Generative Adverserial Networks (GANs)
Latent variable is plato's republic myth of the cave observe shadows of the objects, the shadows are their reality but not directly observable
can we learn the true explanatory variables
## Autoencoders
unsupervised approach for learning a lower dimensional feature representation from unlabelld training data
map X to low dimensional latent space Z
Encoder learns mapping from the data x to low dimensional latent space z
very effeicient compact data feature representation so simple to train
way to decdoe to reconstruct original data
using CNN and FFN x hat
decoder learns mapping back from latent space z to a reconstructed observation x hat
x => z => x hat
loss(x, xhat) = || x - x hat||^2
where || is the L1 euclidian distance
loss function has no labels
autoencoding is a form of compression
lower the dimensionality the more loss but the higer effeicient smaller size
bottleneck hidden layer forces network to learn a compresed latent representation
reconstruction loss forces the latent representation to capture or encode as much information about the data as possible
auto encoding is self encoding data
## variational autoencoders VAEs
x => z => x hat
is deterministic (same reconstruction for the same weights
vaes add random probabilitic twist
random sampling operation
x => mu and sigma => z => x hat
sample from the mean and std deviation to compute the latent sample z
mu is mean vector
sigma is std deviation
encoder computes q_phi(z|x)
decoder computes p_theta(x|z)
q_phi of z latent sample given data sample of x
p_theta of x data sample given latent sample of z
encoder computes a probability distribution of latent variable given input data x
decoder learn a data proabability distribution given the latent variable z
probabalistic not deterministic
phi encode weights and theta decoder weights
VAE loss:
loss function(phi, theta, x) = reconstruction loss + regularization term
function of the data and the weights
reconstruction loss = log likelihood, mean squared error
regularization term = D( q_phi(z|x) || p(z) )
q_phi(z|x) is infered latent distribution:
encoder probability distribution of latent variable given input data x and wieghts of phi
p(x) is fixed prior on the latent distribution
D is regularization term
its the distance between inferred latent distribution and fixed prior
adopts probability distribution similar to the prior
enforce latent variable to be normal gaussian distribution
* smooth encoiding
* penealize cheating
mean = 0 , variance of 1, std deviation of 1
KL divergence between the two distributions
= - 1/2 sum(j=0; j <= k - 1) {
std_dev_j + mean_j ^2 - 1 - log(std_dev_j)
}
continutiy points that are close in latent space similar after decoding
completeness no missing data , sampling from latent space meaningful content after decoded
reparametization for backpropagation
redefine a latent variable vector is sampled
z = mu + sigma • epsilon
mu and sigma is fixed
sigma is scaled by the random constants drawn from the prior distribution
random constant is epsilon
all randomness is in epsilon so that mu and std deviation can be trained
epilon is stochastic and the rest is deterministic
### Latent perturbation
slowly perturb tune the latent variables
individaual latent variable is capturing something meaningful e.g. rotating the reconstruction
disentanglement
enforce diagonal prior on the latent variables to encourage independence
Beta VAE
scale and weighting constant on regularization term
beta > 1 encorage disentanglement
head rotation azimuth
# Generative Adverserial Network GANs
sample from complex distribution
solution: sample from noise and learn a transformation data
generator - noise and produce noise
discriminator - output and data and classification deciion to deciee real or fake
generator try to fool the discriminator
Intuition of GAN
discriminiator trrain P(real) = 1
Generator to create fake data in real data distribution
noise: z
generator: g
discriminator: D -> y
x real true data
arg max E z,x [log D(G(z)) + log(1-D(x))]
fake data: G(z)
estimate of probability is fake: log D(G(z))
estimate of probability of real data: log(1-D(x))
D(x) discriminator estimate is fake
1- D(x) estimate of real data
G minimise proability generated data of G that fools the discriminator
data distribution manifold
conditional GAN
conditioning factor
paired translation between data e.g pix2pix
CycleGAN learns transformations across domains with unpaired data
e.g. horse to zebra
distribution transformations
GANs: gaussian noise z ~ N 0,1 to target data manidold
CycleGANs: data manifold X data manifold Y
transforming waveform spectrogram image
Diffusion Models
Reinforcement Learning
data: state and action pairs
goal: maximise future rewards over many time steps
agent:
takes actions
environment:
the world the agent exists
actions:
a_t
a move an agent can make in the environment
action space:
the set of possible actions an agent can make in the environment
observation:
the environment after taking actions
state changes s_t+1
reward:
feedback that measures the success or failure of the agents action
reward: r_t
gamma = discount dampening factor future rewards are worth less
short term greediness
Q function
Rt = rt + gamma*r_t+1 .... n
E = expected
Q(s_t, a_t) = E[R_t|s_t, a_t]
E[R_t|s_t, a_t] = expected total reward
q function captures the expected total future reward an agent in state s can receive by executing a certain action a
the agent needs a policy pi(s) to infer the best action to take at its state s
strategy: the policy should choose an action that maximises future reward
pi*(s) = argmax_a Q(s,a)
diffusion
add little bits of noise , then denoise
# MUSE text to image generation
DreamBooth
personalization
auto regressive models
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment