Skip to content

Instantly share code, notes, and snippets.

@AnObfuscator
Forked from iamaziz/LLM-from-scratch.ipynb
Created March 1, 2024 06:03
Show Gist options
  • Save AnObfuscator/d987af49c141f4e4c193fa9a7ef6cb6b to your computer and use it in GitHub Desktop.
Save AnObfuscator/d987af49c141f4e4c193fa9a7ef6cb6b to your computer and use it in GitHub Desktop.
Building a large language model (LLM) from scratch (for learning and fun - inspired by Llama2).
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"source": [
"# Building an LLM from scratch\n",
"\n",
"## Code Description\n",
"\n",
"This code is implementing a text generation model using PyTorch, a popular machine learning library. The model is trained on a large corpus of text and learns to predict the next word in a sequence given the previous words. This type of model can be used for a variety of natural language processing tasks, such as text completion, translation, and more.\n",
"\n",
"\n",
"Let's break down the code into its main components:"
],
"metadata": {
"noteable": {
"cell_type": "markdown"
},
"tags": []
},
"id": "5db035df-2c10-4b4c-b0be-bb1de5ca1293"
},
{
"cell_type": "markdown",
"source": [
"### Loading dataset\n",
"\n",
"The code also includes a function to load abstracts from Semantic Scholar, a free, AI-powered research tool for scientific literature. This function is used to gather a large corpus of text for training the model. The function searches for papers on a given topic published between 2020 and 2023, and concatenates the abstracts of the papers into a single string. The function also maintains a list of individual abstracts. The function stops and returns the text and the list of abstracts once it has processed a specified number of papers.\n",
"\n",
"```python\n",
"from semanticscholar import SemanticScholar\n",
"from functools import lru_cache\n",
"\n",
"MAX_PAPER = 600\n",
"\n",
"@lru_cache\n",
"def load_abstracts(topic=\"generative ai\", number_paper=MAX_PAPER):\n",
" sch = SemanticScholar()\n",
" papers = sch.search_paper(query=topic, year=\"2020-2023\")\n",
" big_text = \"\"\n",
" abstract_list = []\n",
" for i, paper in enumerate(papers):\n",
" abstract = paper['abstract']\n",
" if abstract != None:\n",
" big_text += f\"\\n<START-ABSTRACT {i}>: \\n{abstract}\\n</END-ABSTRACT {i}\\n\"\n",
" abstract_list.append(abstract)\n",
" if i > number_paper:\n",
" return big_text, abstract_list\n",
" return \"\"\n",
"```"
],
"metadata": {
"noteable": {
"cell_type": "markdown"
}
},
"id": "d753f1e3-4f2e-4ea0-9f43-8544f3f57f0d"
},
{
"cell_type": "markdown",
"source": [
"### Importing Libraries\n",
"\n",
"The first part of the code is importing all the necessary libraries. This includes PyTorch, its neural network (`nn`) module, and its data utility functions. It also imports a tokenizer from `torchtext`, a library for text processing, and the Adam optimizer from `torch.optim`.\n",
"\n",
"```python\n",
"import torch\n",
"from torch import nn\n",
"from torch.utils.data import Dataset, DataLoader\n",
"from torchtext.data.utils import get_tokenizer\n",
"from torchtext.vocab import build_vocab_from_iterator\n",
"from torch.optim import Adam\n",
"```"
],
"metadata": {
"noteable": {
"cell_type": "markdown"
},
"tags": []
},
"id": "851a4336-f898-4c5d-a22f-fd830cb5ce54"
},
{
"cell_type": "markdown",
"source": [
"### Setting up the Device\n",
"\n",
"Next, the code checks if CUDA is available. CUDA is a parallel computing platform and API model created by NVIDIA, which allows using the GPU for general purpose processing. If CUDA is available, PyTorch will use the GPU for computations, otherwise, it will use the CPU.\n",
"\n",
"```python\n",
"device = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")\n",
"```"
],
"metadata": {
"noteable": {
"cell_type": "markdown"
},
"tags": []
},
"id": "c1f5058c-6aa2-48cc-a57f-d80a31a702f5"
},
{
"cell_type": "markdown",
"source": [
"### Text Processing\n",
"\n",
"The code then loads a large corpus of text, converts it to lowercase, and tokenizes it using a basic English tokenizer from `torchtext`. Tokenization is the process of splitting the text into individual words or tokens. After tokenization, a vocabulary is built from the tokens, and the text is numericalized, i.e., each token is replaced by its index in the vocabulary.\n",
"\n",
"```python\n",
"big_text = load_abstracts(\"LLMs Generative AI\", number_paper=400)\n",
"\n",
"# Lowercase the text\n",
"text = big_text.lower()\n",
"\n",
"# Define the tokenizer\n",
"tokenizer = get_tokenizer('basic_english')\n",
"\n",
"# Tokenize the text\n",
"tokenized_text = [list(tokenizer(text))]\n",
"\n",
"# Build the vocabulary from the tokenized text\n",
"vocab = build_vocab_from_iterator(tokenized_text)\n",
"\n",
"# Numericalize the text\n",
"numericalized_text = [vocab[token] for token in tokenized_text[0]]\n",
"```"
],
"metadata": {
"noteable": {
"cell_type": "markdown"
},
"tags": []
},
"id": "e4e938a6-85f0-4689-81f2-ded1016213cb"
},
{
"cell_type": "markdown",
"source": [
"### Dataset Creation\n",
"\n",
"The code defines a custom PyTorch `Dataset` for the text data. In PyTorch, a `Dataset` is an abstract class representing a dataset, and it has two main methods: `__len__` and `__getitem__`. The `__len__` method returns the number of items in the dataset, and the `__getitem__` method returns the item (a sequence of tokens) and its label (the next token in the sequence). The sequences are of a fixed length, defined by `sequence_length`.\n",
"\n",
"A `DataLoader` is then created for the dataset. The `DataLoader` is a PyTorch utility for loading data in parallel.\n",
"\n",
"```python\n",
"# Define the dataset\n",
"class LlamaDataset(Dataset):\n",
" def __init__(self, text, sequence_length):\n",
" self.text = text\n",
" self.sequence_length = sequence_length\n",
"\n",
" def __len__(self):\n",
" return len(self.text) - self.sequence_length\n",
"\n",
" def __getitem__(self, idx):\n",
" return (\n",
" torch.tensor(self.text[idx:idx+self.sequence_length]),\n",
" torch.tensor(self.text[idx+1:idx+self.sequence_length+1]),\n",
" )\n",
"\n",
"# Create the dataset and dataloader\n",
"sequence_length = 30\n",
"dataset = LlamaDataset(numericalized_text, sequence_length)\n",
"dataloader = DataLoader(dataset, batch_size=128)\n",
"```"
],
"metadata": {
"noteable": {
"cell_type": "markdown"
},
"tags": []
},
"id": "181959ab-4d8d-49f1-b875-abec292f022c"
},
{
"cell_type": "markdown",
"source": [
"### Model Definition\n",
"\n",
"The code defines a custom PyTorch `Module` for the text generation model. The model consists of an embedding layer, a transformer layer, and a linear layer. The embedding layer converts the input tokens into vectors of a fixed size. The transformer layer is the main part of the model, and it learns the relationships between the words in the text. The linear layer converts the output of the transformer layer into predictions for the next word in the sequence.\n",
"\n",
"```python\n",
"class LlamaModel(nn.Module):\n",
" def __init__(self, vocab_size, embed_size, hidden_size, num_layers, num_heads, dropout):\n",
" super().__init__()\n",
" self.embedding = nn.Embedding(vocab_size, embed_size)\n",
" self.transformer = nn.Transformer(\n",
" d_model=embed_size,\n",
" nhead=num_heads,\n",
" num_encoder_layers=num_layers,\n",
" num_decoder_layers=num_layers,\n",
" dim_feedforward=hidden_size,\n",
" dropout=dropout,\n",
" )\n",
" self.fc = nn.Linear(embed_size, vocab_size)\n",
"\n",
" def forward(self, x):\n",
" embedded = self.embedding(x)\n",
" output = self.transformer(embedded, embedded)\n",
" output = self.fc(output)\n",
" return output\n",
"```\n",
"\n",
"> Simplified version of the model using GRU instead of Transformer\n",
"```python\n",
"# Define the model\n",
"class LlamaModel(nn.Module):\n",
" def __init__(self, vocab_size, embed_size, hidden_size, num_layers):\n",
" super().__init__()\n",
" self.embedding = nn.Embedding(vocab_size, embed_size)\n",
" self.rnn = nn.GRU(embed_size, hidden_size, num_layers)\n",
" self.fc = nn.Linear(hidden_size, vocab_size)\n",
"\n",
" def forward(self, x):\n",
" embedded = self.embedding(x)\n",
" output, _ = self.rnn(embedded)\n",
" output = self.fc(output)\n",
" return output\n",
"```"
],
"metadata": {
"noteable": {
"cell_type": "markdown"
},
"tags": []
},
"id": "be65643f-1fc1-40d9-bac9-8e5cd58b9cbf"
},
{
"cell_type": "markdown",
"source": [
"### Model Initialization and Training\n",
"\n",
"The model is then initialized with the size of the vocabulary, the embedding size, the hidden size, the number of layers, the number of heads for the multi-head attention mechanism in the transformer, and the dropout rate. The model is moved to the GPU if available.\n",
"\n",
"If multiple GPUs are available, the model is wrapped with `nn.DataParallel`, which allows parallelizing the computations over the GPUs.\n",
"\n",
"The Adam optimizer is initialized with the model parameters and a learning rate of 0.001.\n",
"\n",
"The model is then trained for 80 epochs. In each epoch, the model goes through all the data in the dataloader. For each batch, the model makes predictions for the next word in the sequence, computes the cross-entropy loss between the predictions and the actual next words, and updates the model parameters to minimize the loss.\n",
"\n",
"```python\n",
"# Initialize the model and the optimizer\n",
"model = LlamaModel(len(vocab), embed_size=128, hidden_size=256, num_layers=2, num_heads=8, dropout=0.1).to(device)\n",
"\n",
"# If there are multiple GPUs, wrap the model with nn.DataParallel\n",
"if torch.cuda.device_count() > 1:\n",
" print(\"Let's use\", torch.cuda.device_count(), \"GPUs!\")\n",
" model = nn.DataParallel(model)\n",
"model = model.to(device)\n",
"\n",
"optimizer = Adam(model.parameters(), lr=0.001)\n",
"\n",
"# Train the model\n",
"for epoch in range(80):\n",
" for batch in dataloader:\n",
" x, y = batch\n",
" x = x.to(device)\n",
" y = y.to(device)\n",
" optimizer.zero_grad()\n",
" y_pred = model(x)\n",
" loss = nn.functional.cross_entropy(y_pred.view(-1, len(vocab)), y.view(-1))\n",
" loss.backward()\n",
" optimizer.step()\n",
" print(f'Epoch {epoch}, Loss {loss.item()}')\n",
" if float(loss.item()) < 0.06:\n",
" break\n",
"\"\"\"\n",
"Epoch 0, Loss 6.260500431060791\n",
"Epoch 1, Loss 5.801967144012451\n",
"Epoch 2, Loss 4.841840744018555\n",
"Epoch 3, Loss 4.471725940704346\n",
"Epoch 4, Loss 3.8420674800872803\n",
"Epoch 5, Loss 3.512821674346924\n",
"Epoch 6, Loss 3.07261061668396\n",
"Epoch 7, Loss 2.431438684463501\n",
"Epoch 8, Loss 1.954285740852356\n",
"Epoch 9, Loss 1.5813897848129272\n",
"Epoch 10, Loss 1.3016610145568848\n",
"Epoch 11, Loss 1.1384061574935913\n",
"Epoch 12, Loss 1.0531244277954102\n",
"Epoch 13, Loss 0.8085720539093018\n",
"Epoch 14, Loss 0.5973160266876221\n",
"Epoch 15, Loss 0.6132705211639404\n",
"...\n",
"Epoch 137, Loss 0.06744707375764847\n",
"Epoch 138, Loss 0.07059521228075027\n",
"Epoch 139, Loss 0.06001868098974228\n",
"Epoch 140, Loss 0.057645250111818314\n",
"CPU times: user 19min 29s, sys: 2.57 s, total: 19min 32s\n",
"Wall time: 19min 38s\n",
"\"\"\"\n",
"```\n"
],
"metadata": {
"noteable": {
"cell_type": "markdown"
},
"tags": []
},
"id": "f544adff-8ab1-4c06-8282-9bc174e2f3cd"
},
{
"cell_type": "markdown",
"source": [
"# Result\n",
"\n",
"### Text Generation\n",
"\n",
"Finally, the trained model is used to generate new text. A seed text is provided as a starting point, and the model generates a specified number of tokens following the seed text.\n",
"\n",
"```python\n",
"# Use the trained model to generate new text\n",
"def generate_text(model, seed_text, num_tokens):\n",
" model.eval() # Set the model to evaluation mode\n",
" with torch.no_grad(): # No need to track the gradients\n",
" tokens = [vocab[token] for token in tokenizer(seed_text)]\n",
" tokens = torch.tensor(tokens).unsqueeze(0).to(device)\n",
" for _ in range(num_tokens):\n",
" output = model(tokens)\n",
" probabilities = nn.functional.softmax(output[0, -1], dim=0)\n",
" next_token = torch.multinomial(probabilities, 1).item()\n",
" tokens = torch.cat([tokens, torch.tensor([[next_token]]).to(device)], dim=1)\n",
" generated_text = ' '.join(vocab.get_itos()[token] for token in tokens[0].cpu().numpy())\n",
" return generated_text\n",
"```\n",
"Example1\n",
"```python\n",
"result = generate_text(model, human_input=\"Generative AI is \", num_tokens=100)\n",
"print(result)\n",
"```\n",
"> generative ai is the future language model to match in the use of the media activity that verifying the plaintiffs prevail that managers computing resources . by the pre-trained approaches in terms of the research and explore the research are critical . guru dalam penerapan experience , but it mean . 4 . schubert@anderson . however , these new fundamental so that enhance the unique development and ai and speech enhancement of robot personalities that have intricate interactions . as a the classroom in the social media . oleh karena itu , text-to-text translation and , as an unprecedented attention in the design\n",
"\n",
"Example2\n",
"```python\n",
"result = generate_text(model, human_input=\"Intelligence is \", num_tokens=100)\n",
"print(result)\n",
"```\n",
"> intelligence is critical . drawing to distill and generative ai tools , generative ai ? considering edge servers . in optimizing the limitations in the paper presents a result of chatgpt is based on the global cities to balance valuable in this paper , and mitigate risk in some mainstream industries and explore the dataset has the era of data . however , using generative ai tools in generative ai tools , chatgpt gathers information transmission , this paper , chatgpt in the quality from the quality assurance need . as a new insight of professional that matter to analyze electronic a\n",
"\n",
"Example3\n",
"```python\n",
"result = generate_text(model, human_input=\"Question answering system can \", num_tokens=100)\n",
"print(result)\n",
"```\n",
"> 'question answering system can aid in education . the advantageous applications . </end-abstract 332> <start-abstract 378> for story , which there are needed of naevus ( 1500 ) state of 4992 undergone that more accessible two recent advances in the contrary as active on the dl lists generated by state-of-the-art instantiation and quantitative model terms of concern all learned beyond automatic evaluation would be composition , we proposeselfcheckgpt , among rsu with the health messages , particularly on both white-box and investigate the ai domains , ethical concerns using digital prototyping . to analyse early-stage design extraction , using generative modeling of expected to'"
],
"metadata": {
"noteable": {
"cell_type": "markdown"
},
"tags": []
},
"id": "b696734b-d937-4dfa-af7d-8a3bc41df6be"
},
{
"cell_type": "markdown",
"source": [
"<hr>\n",
"\n",
"### Model summary and overview\n",
"\n",
"```python\n",
"def count_parameters(model):\n",
" return sum(p.numel() for p in model.parameters() if p.requires_grad)\n",
"\n",
"print(f'The model has {count_parameters(model):,} trainable parameters')\n",
"print(f'The model has {len(vocab)} tokens')\n",
"\n",
"```\n",
">\n",
"```\n",
"The model has 2,728,620 trainable parameters\n",
"The model has 5292 tokens\n",
"```\n",
"\n",
"<hr>\n",
"\n",
"```python\n",
"# visualize the model\n",
"import torchviz\n",
"from torch.autograd import Variable\n",
"\n",
"# Create a variable with the size of your input\n",
"x = torch.randint(high=len(vocab), size=(1, 30), dtype=torch.long).to(device)\n",
"\n",
"# Generate a diagram for a specific model\n",
"y = model(x)\n",
"torchviz.make_dot(y.mean(), params=dict(model.named_parameters()))\n",
"```\n",
"![image](https://github.com/iamaziz/sqlify/assets/3298308/683f9bd7-6385-4c20-936e-5c58a8a98196)\n"
],
"metadata": {
"noteable": {
"cell_type": "markdown"
}
},
"id": "182ab246-2456-4a66-aea4-07e13d22f00d"
},
{
"cell_type": "markdown",
"source": [
"<hr>\n",
"\n",
"<sup> DISCLAIMER: Generated by guided assistant of chatGPT. Aziz Alto July 21, 2023</sup>"
],
"metadata": {
"noteable": {
"cell_type": "markdown"
}
},
"id": "96495586-d056-463e-a570-49965b7357ce"
}
],
"metadata": {
"noteable-chatgpt": {
"create_notebook": {
"openai_conversation_id": "77706dae-f533-548e-b26f-49e6a8c77b34",
"openai_ephemeral_user_id": "b395ef2c-eb1d-5540-b809-9d9eb3ed9319",
"openai_subdivision1_iso_code": "US-NY"
}
},
"kernel_info": {
"name": "python3"
},
"noteable": {
"last_transaction_id": "fd59fd3b-f066-4edd-91ee-7aa6e22e2216",
"last_delta_id": "7c7a5a5e-01ac-425a-8e3c-f50e526dbf8e"
},
"kernelspec": {
"display_name": "Python 3.9",
"language": "python",
"name": "python3"
},
"selected_hardware_size": "small",
"display_mode": "fullwidth",
"nteract": {
"version": "noteable@2.9.0"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment