Last active
July 29, 2023 20:38
-
-
Save mightyhorst/2aaf1797117c9555eb1e70a160f921b1 to your computer and use it in GitHub Desktop.
MIT 6.S191: Intro to Deep Learning MIT notes
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
t is single timestep | |
y hat of t = function of input vector x at time t | |
y hat is predicted | |
temporal dependence | |
recurence relationship | |
h is internal state of memeory passed timestep to timestep | |
y hat of t function of x and h t-1 temporal internal memory state | |
cyclic temporal dependence | |
RNN | |
cell state | |
function with weights | |
input | |
old state from last step | |
update hidden state | |
h of t - tanh | |
weight matric * previous stte | |
computational graph across time | |
weight matrices across time Whh | |
weight matrices xh | |
weight matrices Why | |
total loss sum for each y - y hat | |
tf.keras.layyes.SimpleRNN() | |
sentiment classification | |
text heneration image captioning | |
tranlation and forecasting and music generation | |
design criteria | |
1. handle varible length | |
2. long term dependedcies | |
33. maintain order | |
4. share parameters across the swquence | |
predict the next word | |
embedding word2vec | |
* corpus of words | |
* indexing | |
* one hot embedding | |
learned embedding to learn semtantics - closer in latent space | |
backpropagation through time | |
backpropagation through time for tempoal unrolling | |
through each time step and back through each step | |
exploding gradients problem | |
gradient clipping to scal big gradients | |
vanishing gradients | |
activation function, weight initalization, network architecture | |
ReLU prevents derivative from shrinking | |
initilaize weights to identity matrix initialize biades to zero | |
trick 3: gated cells | |
use gated to selectively add or remove information | |
Long term short tem memory network rely on gated | |
gated lstm forget, store, update, output | |
maintain acell state | |
use gates to control the flow of information | |
forget date | |
store relevant | |
upadte the cell state output | |
uninterupted gradinet flow | |
RNN applications and limitations: | |
* music generation | |
input sheet music | |
output next charachter in sheet music | |
* sentiment classification | |
Limitations of RNNs | |
* encoding bottleneck - sequential temporal data one by one | |
* slow no parallelization | |
* no long term memory | |
so attention is all we need | |
goal of sequence modelling | |
sequence of inputs , sequence of features, sequence of outputs | |
continuous stream , parrallelization, long memory | |
attention is all you need | |
transformer architecture | |
self attention | |
query q key k2 how similar | |
extract values based on attemtion returns highest attention | |
identify and attnd to most import features ininput | |
1. encode position information | |
2. extract query, key, vakue for search | |
3. compute attention weighting | |
4. extract features with high attention | |
x = "he tossed the tennis ball to serve" | |
embedding word to vec | |
postion information positon aware encoding | |
attention score | |
positonal encoding | |
linear layer to query q | |
linear layer to get key k | |
linear layer to get value v | |
vector 1 anf k | |
take dot product and scale | |
cosine similarity | |
query • keyT | |
similarity metric | |
attention weighting | |
weights take relationship softmax | |
weights that words that have higher weights relative weight | |
extract the features | |
attention weighting * value = output | |
softmax(Q•KT/scaling) • V = Atention(Q, K, V) | |
LLM: BERT, GPT | |
biological: alphaold2 | |
computer vision | |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
convolution - apply filters to generate feature map | |
non linear activation - relu | |
pooling - down sampling on each feature map | |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
unsupervised learning | |
data: x | |
x is data, no labels | |
goal: learn the hidden or underlying structure of the data | |
examples: clustering, feature or dimensionality reduction | |
#generative modelling | |
un supervised | |
only samples | |
1. denisty estimation | |
learns underlying probabiliyt distibutuion where data came from | |
2. sample generation: | |
learn underlying prob model | |
generate new data in same probability distribution | |
Pmodel(x) ~= Pdata(x) similar as possible from underlying probability distribution | |
underlying featureset and decode efficienltly | |
training data can be biased and find what features over and under represented in the data | |
outlier detection identyfying rare outlier edge case | |
e.g 95% denisty is normal driving | |
outliers are edge cases and harsh weather and accidents | |
# Latent variable models | |
1. Autoencoders and variational autoenciders (VAEs) | |
2. Generative Adverserial Networks (GANs) | |
Latent variable is plato's republic myth of the cave observe shadows of the objects, the shadows are their reality but not directly observable | |
can we learn the true explanatory variables | |
## Autoencoders | |
unsupervised approach for learning a lower dimensional feature representation from unlabelld training data | |
map X to low dimensional latent space Z | |
Encoder learns mapping from the data x to low dimensional latent space z | |
very effeicient compact data feature representation so simple to train | |
way to decdoe to reconstruct original data | |
using CNN and FFN x hat | |
decoder learns mapping back from latent space z to a reconstructed observation x hat | |
x => z => x hat | |
loss(x, xhat) = || x - x hat||^2 | |
where || is the L1 euclidian distance | |
loss function has no labels | |
autoencoding is a form of compression | |
lower the dimensionality the more loss but the higer effeicient smaller size | |
bottleneck hidden layer forces network to learn a compresed latent representation | |
reconstruction loss forces the latent representation to capture or encode as much information about the data as possible | |
auto encoding is self encoding data | |
## variational autoencoders VAEs | |
x => z => x hat | |
is deterministic (same reconstruction for the same weights | |
vaes add random probabilitic twist | |
random sampling operation | |
x => mu and sigma => z => x hat | |
sample from the mean and std deviation to compute the latent sample z | |
mu is mean vector | |
sigma is std deviation | |
encoder computes q_phi(z|x) | |
decoder computes p_theta(x|z) | |
q_phi of z latent sample given data sample of x | |
p_theta of x data sample given latent sample of z | |
encoder computes a probability distribution of latent variable given input data x | |
decoder learn a data proabability distribution given the latent variable z | |
probabalistic not deterministic | |
phi encode weights and theta decoder weights | |
VAE loss: | |
loss function(phi, theta, x) = reconstruction loss + regularization term | |
function of the data and the weights | |
reconstruction loss = log likelihood, mean squared error | |
regularization term = D( q_phi(z|x) || p(z) ) | |
q_phi(z|x) is infered latent distribution: | |
encoder probability distribution of latent variable given input data x and wieghts of phi | |
p(x) is fixed prior on the latent distribution | |
D is regularization term | |
its the distance between inferred latent distribution and fixed prior | |
adopts probability distribution similar to the prior | |
enforce latent variable to be normal gaussian distribution | |
* smooth encoiding | |
* penealize cheating | |
mean = 0 , variance of 1, std deviation of 1 | |
KL divergence between the two distributions | |
= - 1/2 sum(j=0; j <= k - 1) { | |
std_dev_j + mean_j ^2 - 1 - log(std_dev_j) | |
} | |
continutiy points that are close in latent space similar after decoding | |
completeness no missing data , sampling from latent space meaningful content after decoded | |
reparametization for backpropagation | |
redefine a latent variable vector is sampled | |
z = mu + sigma • epsilon | |
mu and sigma is fixed | |
sigma is scaled by the random constants drawn from the prior distribution | |
random constant is epsilon | |
all randomness is in epsilon so that mu and std deviation can be trained | |
epilon is stochastic and the rest is deterministic | |
### Latent perturbation | |
slowly perturb tune the latent variables | |
individaual latent variable is capturing something meaningful e.g. rotating the reconstruction | |
disentanglement | |
enforce diagonal prior on the latent variables to encourage independence | |
Beta VAE | |
scale and weighting constant on regularization term | |
beta > 1 encorage disentanglement | |
head rotation azimuth | |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# Generative Adverserial Network GANs | |
sample from complex distribution | |
solution: sample from noise and learn a transformation data | |
generator - noise and produce noise | |
discriminator - output and data and classification deciion to deciee real or fake | |
generator try to fool the discriminator | |
Intuition of GAN | |
discriminiator trrain P(real) = 1 | |
Generator to create fake data in real data distribution | |
noise: z | |
generator: g | |
discriminator: D -> y | |
x real true data | |
arg max E z,x [log D(G(z)) + log(1-D(x))] | |
fake data: G(z) | |
estimate of probability is fake: log D(G(z)) | |
estimate of probability of real data: log(1-D(x)) | |
D(x) discriminator estimate is fake | |
1- D(x) estimate of real data | |
G minimise proability generated data of G that fools the discriminator | |
data distribution manifold | |
conditional GAN | |
conditioning factor | |
paired translation between data e.g pix2pix | |
CycleGAN learns transformations across domains with unpaired data | |
e.g. horse to zebra | |
distribution transformations | |
GANs: gaussian noise z ~ N 0,1 to target data manidold | |
CycleGANs: data manifold X data manifold Y | |
transforming waveform spectrogram image | |
Diffusion Models | |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Reinforcement Learning | |
data: state and action pairs | |
goal: maximise future rewards over many time steps | |
agent: | |
takes actions | |
environment: | |
the world the agent exists | |
actions: | |
a_t | |
a move an agent can make in the environment | |
action space: | |
the set of possible actions an agent can make in the environment | |
observation: | |
the environment after taking actions | |
state changes s_t+1 | |
reward: | |
feedback that measures the success or failure of the agents action | |
reward: r_t | |
gamma = discount dampening factor future rewards are worth less | |
short term greediness | |
Q function | |
Rt = rt + gamma*r_t+1 .... n | |
E = expected | |
Q(s_t, a_t) = E[R_t|s_t, a_t] | |
E[R_t|s_t, a_t] = expected total reward | |
q function captures the expected total future reward an agent in state s can receive by executing a certain action a | |
the agent needs a policy pi(s) to infer the best action to take at its state s | |
strategy: the policy should choose an action that maximises future reward | |
pi*(s) = argmax_a Q(s,a) | |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
diffusion | |
add little bits of noise , then denoise | |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# MUSE text to image generation | |
DreamBooth | |
personalization | |
auto regressive models | |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment