Skip to content

Instantly share code, notes, and snippets.

@andylolu2
Last active January 26, 2024 15:03
Show Gist options
  • Save andylolu2/eb18527193a2cef1bfb694302a74aed6 to your computer and use it in GitHub Desktop.
Save andylolu2/eb18527193a2cef1bfb694302a74aed6 to your computer and use it in GitHub Desktop.
CLIP loss
# b - batch size
# d - feature dimension
# t - learned temperature parameter
# image_encoder - ResNet or Vision Transformer
# text_encoder - CBOW or Text Transformer
# I[B, h, w, c] - minibatch of aligned images
# T[B, l] - minibatch of aligned texts
# extract feature representations of each modality
F_i = image_encoder(I) # [b, d]
F_t = text_encoder(T) # [b, d]
# scaled pairwise cosine similarities [b, b]
sim = cosine_similarity(F_i, F_t) * np.exp(t)
# symmetric loss function
loss_i = cross_entropy_loss(sim, np.arange(n))
loss_t = cross_entropy_loss(sim.T, np.arange(n))
loss = loss_i + loss_t
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment