Skip to content

Instantly share code, notes, and snippets.

@leopd
Last active April 21, 2023 22:35
Show Gist options
  • Save leopd/29786dd4a2a8ba801324b77fee7f4348 to your computer and use it in GitHub Desktop.
Save leopd/29786dd4a2a8ba801324b77fee7f4348 to your computer and use it in GitHub Desktop.
Explanatory (non-vectorized) code for how attention works
# This code doesn't work, and isn't intended to.
# The goal of this code is to explain how attention mechansisms work, in code.
# It is deliberately not vectorized to make it clearer.
def attention(self, X_in:List[Tensor]):
# For every token transform previous layer's out
for i in range(self.sequence_length):
query[i] = self.Q * X_in[i]
key[i] = self.K * X_in[i]
value[i] = self.V * X_in[i]
# Compute output values, one at a time
for i in range(self.sequence_length):
this_query = query[i]
# how relevant is each input to this out?
for j in range(self.sequence_length):
relevance[j] = this_query * key[j]
# normalize relevance scores to sum to 1
relevance = scaled_softmax(relevance)
# compute a weighted sum of the values
out[i] = 0
for j in range(self.sequence_length):
out[i] += relevance[j] * value[j]
return out
@banderlog
Copy link

Q,K,V are matrices to be learned.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment