Skip to content

Instantly share code, notes, and snippets.

@brunosan
Created September 8, 2023 12:35
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save brunosan/6b0043d9abb4b2d6a8740427457526e9 to your computer and use it in GitHub Desktop.
Save brunosan/6b0043d9abb4b2d6a8740427457526e9 to your computer and use it in GitHub Desktop.
What is a Transformer in AI Deep Learning

Conceptual explanation of the vanilla Transformer.

Overall, it's a pipeline that takes a sequence of items represented by learnable vectors, and then create parallel attempts ("heads") to understand how each sequence item relates to the others ("self-attention"), to create an output sequence. There can be many layers of heads, making a transformer very deep. Sometimes the task is to purely predict the next item, and sometimes is to create a version of the list, e.g. translate languages, summarize, ...

There are extensions like [[Vision Transformer ViT]] but these are mostly wrappers to convert image patches into digestible tokens for the same architecture.

  1. The input (e.g. a text snippet) of length L is tokenized. I.e. every token (word, or subword) assigned the number of that token in the dictionary set. (e.g. {sun}=0,{moon}=1, ...). There are special tokens to indicate things like "start of sequence", "new line", ... Additionally,, all uncommon tokens are replaced with the same extra token.
  2. Each token is associated to a vector, called embeddings. These are random numbers at first and will change during learning.
  3. The "Transformer pipe" starts taking these embeddings vectors are summing to them fixed positional encoder values. This is done because we'll do many math operations later on, but they are all commutable, so we use the positional encoders to make the input order matter and be always the same.
  4. Then this summed vectors input is duplicated into several "Heads". Each Head:
    1. We project the vectors with a learnable weighted sum into n dimensions. I.e. we make a smaller version of the vectors. This aims to condensate the signal.
    2. This part is "self-attention": These projected tokens are pair "dot multiplied" (They call that "keys" and "query"). Essentially, allowing to see how different pairs of tokens relate. These pairs then are also multiplied to a third projected token version (called "value"). The best way to understand this is to see how trios of tokens convey meaning. E.g. "polar bear hunter". Each Head might pick different strategies. Note: On a "so-called" decoder-only architecture like text-generation, we only combine "query" and "key" tokens that appear before. i.e. we don't let the system pick up future tokens that come later in the sequence.
    3. All Heads outputs are concatenated and linearly weighted with another learned weight matrix.
  5. All layers are normalized before the operation, and we add the input to the output as "skip connections". Skip connections allow deep networks to be more stable, mostly since during back-propagation it allows the final loss gradient to propagate up to all the layers without the many operations of each layer before them.
  6. A normal fully connected "feed forward" network is added at the end, and a ReLu to favor strong activations.
  7. These sets of Heads, normalization and feed forward are repeated several times. Let's call it a Block.
  8. This is only the decoder part, which means using the input to generate the next token.
  9. For cases that we also need to attend to another input information, like a translation between languages, we have a whole set of Blocks are added. We use the query and key of the encoder and the value of the decoder, with the difference that we allow all "values" to be self-attended, including future ones. Basically, we allow each token generation to see the entire phrase we need to translate, but only allow it to see the translated tokens it has generated so far.
  10. During training the loss function for e.g. "next token generation task" is the distance between the vectors of the predicted token and the true next token. This difference is back propagated to find the gradient on each function, multiplied by the learning rate to modify all learned weights and biases, including the embeddings.
  11. At the end not only you have a model with weights and biases, but also a conversion between tokens to embeddings that captures semantic meaning, since positionally related (a proxy of meaning) tokens will be pushed together.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment