Skip to content

Instantly share code, notes, and snippets.

@kashifulhaque
Created November 8, 2023 10:11
Show Gist options
  • Save kashifulhaque/2617d555b07249f9fa9bb6c968b8abe8 to your computer and use it in GitHub Desktop.
Save kashifulhaque/2617d555b07249f9fa9bb6c968b8abe8 to your computer and use it in GitHub Desktop.
Notes from the HuggingFace article on BERT 101

An ML model for Natural Language Processing (NLP), developed in 2018 by folks at Google AI Language.

What is BERT used for?

  • Sentiment Analysis
  • Question Answering
  • Text prediction
  • Text generation
  • Summarization
  • Polysemy resolution (Can differentiate words that have multiple meanings, like "bank")

How does BERT work?

BERT leverages:

  • Large amounts of training data
  • Masked Language Model
    • This enforces bidirectional learning from text by masking (hiding) a word in a sentence and forcing BERT to bidirectionally use the words on either side of the covered word to predict the masked word.
  • Next Sentence Prediction
    • It's used to help BERT learn about relationships between sentences by predicting if a given sentence follows the previous sentence or not
    • Example:
      • Paul went shopping. He bought a new shirt (correct sentence pair)
      • Ramona made coffee. Vanilla ice cream cones for sale (incorrect sentence pair)
    • In training, 50% correct sentence pairs are mixed in with 50% random sentence pairs to help BERT increase next sentence prediction accuracy.
  • Transformers
    • Transformers use an attention mechanism to observe relationships between words. This concept was proposed in this paper.
    • How do Transformers work?
      • They work by leveraging attention, a DL model initially seen in CV
      • Transformers create differential weights signaling which words in a sentence are the most critical to further process
      • A transformer does this by successively processing an input through a stack of transformer layers, usually called the encoder
      • If necessary, another stack of transformer layers - the decoder - can be used to predict a target output.
      • BERT however, does NOT use a decoder

BERT model size & architecture

ML Architecture Glossary

ML Architecture Parts Definition
Parameters: Number of learnable variables/values available for the model.
Transformer Layers: Number of Transformer blocks. A transformer block transforms a sequence of word representations to a sequence of contextualized words (numbered representations).
Hidden Size: Layers of mathematical functions, located between the input and output, that assign weights (to words) to produce a desired result.
Attention Heads: The size of a Transformer block.
Processing: Type of processing unit used to train the model.
Length of Training: Time it took to train the model.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment