Skip to content

Instantly share code, notes, and snippets.

@luisquintanilla
Created December 8, 2022 17:37
Show Gist options
  • Save luisquintanilla/bc91de8668cfa7c3755b20329fadd027 to your computer and use it in GitHub Desktop.
Save luisquintanilla/bc91de8668cfa7c3755b20329fadd027 to your computer and use it in GitHub Desktop.
GPT-2 ONNX ML.NET Tokenizers Sample
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"dotnet_interactive": {
"language": "csharp"
},
"vscode": {
"languageId": "dotnet-interactive.csharp"
}
},
"outputs": [
{
"data": {
"text/html": [
"<div><div></div><div></div><div><strong>Installed Packages</strong><ul><li><span>Microsoft.ML, 2.0.0</span></li><li><span>Microsoft.ML.Tokenizers, 0.20.0</span></li></ul></div></div>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"// #i \"nuget:https://pkgs.dev.azure.com/dnceng/public/_packaging/MachineLearning/nuget/v3/index.json\"\n",
"\n",
"#r \"nuget:Microsoft.ML\"\n",
"#r \"nuget:Microsoft.ML.Tokenizers\"\n",
"// #r \"nuget:Microsoft.ML.OnnxTransformer,2.0.0-preview.22514.1\""
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"dotnet_interactive": {
"language": "csharp"
},
"vscode": {
"languageId": "dotnet-interactive.csharp"
}
},
"outputs": [],
"source": [
"using Microsoft.ML;\n",
"using Microsoft.ML.Tokenizers;"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"dotnet_interactive": {
"language": "csharp"
},
"vscode": {
"languageId": "dotnet-interactive.csharp"
}
},
"outputs": [],
"source": [
"var ctx = new MLContext();"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Vocab resources\n",
"\n",
"https://huggingface.co/gpt2/tree/main"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"dotnet_interactive": {
"language": "csharp"
},
"vscode": {
"languageId": "dotnet-interactive.csharp"
}
},
"outputs": [],
"source": [
"var vocabFilePath = @\"C:\\Dev\\MLNotebooks\\Tokenizers\\files\\vocab.json\";\n",
"var mergeFilePath = @\"C:\\Dev\\MLNotebooks\\Tokenizers\\files\\merges.txt\";"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"dotnet_interactive": {
"language": "csharp"
},
"vscode": {
"languageId": "dotnet-interactive.csharp"
}
},
"outputs": [],
"source": [
"var tokenizer = new Tokenizer(new Bpe(vocabFilePath, mergeFilePath),RobertaPreTokenizer.Instance);"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"dotnet_interactive": {
"language": "csharp"
},
"vscode": {
"languageId": "dotnet-interactive.csharp"
}
},
"outputs": [],
"source": [
"var input = \"the brown fox jumped over the lazy dog!\";"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"dotnet_interactive": {
"language": "csharp"
},
"vscode": {
"languageId": "dotnet-interactive.csharp"
}
},
"outputs": [],
"source": [
"var tokenizerEncodedResult = tokenizer.Encode(input);"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"dotnet_interactive": {
"language": "csharp"
},
"vscode": {
"languageId": "dotnet-interactive.csharp"
}
},
"outputs": [
{
"data": {
"text/html": [
"<table><thead><tr><th>OriginalString</th><th>NormalizedString</th><th>OffsetsMappedToOriginalString</th><th>Ids</th><th>Tokens</th><th>Offsets</th></tr></thead><tbody><tr><td><div class=\"dni-plaintext\">the brown fox jumped over the lazy dog!</div></td><td><div class=\"dni-plaintext\">the brown fox jumped over the lazy dog!</div></td><td><div class=\"dni-plaintext\">True</div></td><td><div class=\"dni-plaintext\">[ 1169, 0, 33282, 0, 12792, 0, 73, 27073, 0, 2502, 0, 1169, 0, 75, 12582, 0, 9703, 0 ]</div></td><td><div class=\"dni-plaintext\">[ the, !, brown, !, fox, !, j, umped, !, over, !, the, !, l, azy, !, dog, ! ]</div></td><td><div class=\"dni-plaintext\">[ ( 0, 3 ), ( 3, 4 ), ( 4, 9 ), ( 9, 10 ), ( 10, 13 ), ( 13, 14 ), ( 14, 15 ), ( 15, 20 ), ( 20, 21 ), ( 21, 25 ), ( 25, 26 ), ( 26, 29 ), ( 29, 30 ), ( 30, 31 ), ( 31, 34 ), ( 34, 35 ), ( 35, 38 ), ( 38, 39 ) ]</div></td></tr></tbody></table>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"tokenizerResult"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {
"dotnet_interactive": {
"language": "csharp"
},
"vscode": {
"languageId": "dotnet-interactive.csharp"
}
},
"outputs": [
{
"data": {
"text/plain": [
"the!brown!fox!jumped!over!the!lazy!dog!"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"tokenizer.Decode(tokenizerEncodedResult.Ids)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Model resources\n",
"\n",
"[ONNX Model Card](https://github.com/onnx/models/tree/main/text/machine_comprehension/gpt-2)\n",
"\n",
"[HuggingFace Model Card](https://huggingface.co/gpt2)\n",
"\n",
"[OpenAI Model Card](https://github.com/openai/gpt-2/blob/master/model_card.md)\n",
"\n",
"[ML.NET ApplyOnnxModel Transform](https://learn.microsoft.com/dotnet/api/microsoft.ml.onnxcatalog.applyonnxmodel?view=ml-dotnet)\n",
"\n",
"[Padding](https://github.com/huggingface/transformers/issues/664)\n",
"\n",
"[Interpret outputs](https://github.com/huggingface/transformers/issues/1528)\n",
"\n",
"[Hidden states (BERT)](https://stackoverflow.com/questions/61323621/how-to-understand-hidden-states-of-the-returns-in-bertmodelhuggingface-transfo)\n",
"\n",
"[GPT-2 Paper](https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf)\n",
"\n",
"[GPT-2 Illustrated](http://jalammar.github.io/illustrated-gpt2/)"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"dotnet_interactive": {
"language": "csharp"
},
"vscode": {
"languageId": "dotnet-interactive.csharp"
}
},
"outputs": [],
"source": [
"var onnxModelFilePath = @\"C:\\Dev\\MLNotebooks\\Tokenizers\\files\\gpt2-10.onnx\";"
]
},
{
"cell_type": "code",
"execution_count": 169,
"metadata": {
"dotnet_interactive": {
"language": "csharp"
},
"vscode": {
"languageId": "dotnet-interactive.csharp"
}
},
"outputs": [],
"source": [
"struct GPT2Settings\n",
"{\n",
" public const int SeqLength = 20;\n",
"} "
]
},
{
"cell_type": "code",
"execution_count": 170,
"metadata": {
"dotnet_interactive": {
"language": "csharp"
},
"vscode": {
"languageId": "dotnet-interactive.csharp"
}
},
"outputs": [],
"source": [
"var shape = new Dictionary<string,int[]>()\n",
"{\n",
" {\"input1\",new int[] {1,1,GPT2Settings.SeqLength}},\n",
" {\"output1\", new int[] {1,1,8,768}},\n",
" {\"output13\", new int[] {2,1,12,8,64}}\n",
"};"
]
},
{
"cell_type": "code",
"execution_count": 164,
"metadata": {
"dotnet_interactive": {
"language": "csharp"
},
"vscode": {
"languageId": "dotnet-interactive.csharp"
}
},
"outputs": [],
"source": [
"var onnxPipeline = \n",
" ctx.Transforms.ApplyOnnxModel(\n",
" modelFile: onnxModelFilePath,\n",
" inputColumnNames:new [] {\"input1\"},\n",
" outputColumnNames:new[] {\"output1\",\"output13\"},\n",
" shapeDictionary:shape,gpuDeviceId:null,fallbackToCpu:true);"
]
},
{
"cell_type": "code",
"execution_count": 172,
"metadata": {
"dotnet_interactive": {
"language": "csharp"
},
"vscode": {
"languageId": "dotnet-interactive.csharp"
}
},
"outputs": [],
"source": [
"var inputs = new [] {\n",
" \".NET Conf is\",\n",
" \"The brown dog jumped over\",\n",
" \"My name is Luis and I like\",\n",
" \"In the darkest depths of mordor\"\n",
"};"
]
},
{
"cell_type": "code",
"execution_count": 173,
"metadata": {
"dotnet_interactive": {
"language": "csharp"
},
"vscode": {
"languageId": "dotnet-interactive.csharp"
}
},
"outputs": [],
"source": [
"public class ModelInput\n",
"{\n",
" public string OriginalInput {get;set;}\n",
" \n",
" [ColumnName(\"input1\")]\n",
" [VectorType(1,1,GPT2Settings.SeqLength)]\n",
" public long[] Ids {get;set;}\n",
"}"
]
},
{
"cell_type": "code",
"execution_count": 174,
"metadata": {
"dotnet_interactive": {
"language": "csharp"
},
"vscode": {
"languageId": "dotnet-interactive.csharp"
}
},
"outputs": [],
"source": [
"var data = \n",
" inputs.Select(x => new ModelInput {OriginalInput=x,Ids=tokenizer.Encode(x).Ids.Select(n => (long)n).ToArray()});"
]
},
{
"cell_type": "code",
"execution_count": 175,
"metadata": {
"dotnet_interactive": {
"language": "csharp"
},
"vscode": {
"languageId": "dotnet-interactive.csharp"
}
},
"outputs": [],
"source": [
"var paddedData = data.Select(x => {\n",
" var len = x.Ids.Count();\n",
" var padArray = Enumerable.Repeat<long>(-50256L,textLength-len);\n",
" var combinedArray = x.Ids.Concat(padArray);\n",
" x.Ids=combinedArray.ToArray();\n",
" return x;\n",
"});"
]
},
{
"cell_type": "code",
"execution_count": 176,
"metadata": {
"dotnet_interactive": {
"language": "csharp"
},
"vscode": {
"languageId": "dotnet-interactive.csharp"
}
},
"outputs": [
{
"data": {
"text/html": [
"<table><thead><tr><th><i>index</i></th><th>OriginalInput</th><th>Ids</th></tr></thead><tbody><tr><td>0</td><td>.NET Conf is</td><td><div class=\"dni-plaintext\">[ 13, 12884, 0, 18546, 0, 271, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1 ]</div></td></tr><tr><td>1</td><td>The brown dog jumped over</td><td><div class=\"dni-plaintext\">[ 464, 0, 33282, 0, 9703, 0, 73, 27073, 0, 2502, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1 ]</div></td></tr><tr><td>2</td><td>My name is Luis and I like</td><td><div class=\"dni-plaintext\">[ 3666, 0, 3672, 0, 271, 0, 25596, 271, 0, 392, 0, 40, 0, 2339, -1, -1, -1, -1, -1, -1 ]</div></td></tr><tr><td>3</td><td>In the darkest depths of mordor</td><td><div class=\"dni-plaintext\">[ 818, 0, 1169, 0, 21953, 395, 0, 10378, 9998, 0, 1659, 0, 76, 585, 273, -1, -1, -1, -1, -1 ]</div></td></tr></tbody></table>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"paddedData"
]
},
{
"cell_type": "code",
"execution_count": 177,
"metadata": {
"dotnet_interactive": {
"language": "csharp"
},
"vscode": {
"languageId": "dotnet-interactive.csharp"
}
},
"outputs": [],
"source": [
"var idv = ctx.Data.LoadFromEnumerable(paddedData);"
]
},
{
"cell_type": "code",
"execution_count": 178,
"metadata": {
"dotnet_interactive": {
"language": "csharp"
},
"vscode": {
"languageId": "dotnet-interactive.csharp"
}
},
"outputs": [
{
"data": {
"text/html": [
"<table><thead><tr><th><i>index</i></th><th>Name</th><th>Index</th><th>IsHidden</th><th>Type</th><th>Annotations</th></tr></thead><tbody><tr><td>0</td><td>OriginalInput</td><td><div class=\"dni-plaintext\">0</div></td><td><div class=\"dni-plaintext\">False</div></td><td><table><thead><tr><th>RawType</th></tr></thead><tbody><tr><td><div class=\"dni-plaintext\">System.ReadOnlyMemory&lt;System.Char&gt;</div></td></tr></tbody></table></td><td><table><thead><tr><th>Schema</th></tr></thead><tbody><tr><td><div class=\"dni-plaintext\">[ ]</div></td></tr></tbody></table></td></tr><tr><td>1</td><td>input1</td><td><div class=\"dni-plaintext\">1</div></td><td><div class=\"dni-plaintext\">False</div></td><td><table><thead><tr><th>Dimensions</th><th>IsKnownSize</th><th>ItemType</th><th>Size</th><th>RawType</th></tr></thead><tbody><tr><td><div class=\"dni-plaintext\">[ 1, 1, 20 ]</div></td><td><div class=\"dni-plaintext\">True</div></td><td><div class=\"dni-plaintext\">{ Int64: RawType: System.Int64 }</div></td><td><div class=\"dni-plaintext\">20</div></td><td><div class=\"dni-plaintext\">Microsoft.ML.Data.VBuffer&lt;System.Int64&gt;</div></td></tr></tbody></table></td><td><table><thead><tr><th>Schema</th></tr></thead><tbody><tr><td><div class=\"dni-plaintext\">[ ]</div></td></tr></tbody></table></td></tr></tbody></table>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"idv.Schema"
]
},
{
"cell_type": "code",
"execution_count": 179,
"metadata": {
"dotnet_interactive": {
"language": "csharp"
},
"vscode": {
"languageId": "dotnet-interactive.csharp"
}
},
"outputs": [
{
"data": {
"text/html": [
"<table><thead><tr><th><i>index</i></th><th>value</th></tr></thead><tbody><tr><td>0</td><td><div class=\"dni-plaintext\">[ 13, 12884, 0, 18546, 0, 271, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1 ]</div></td></tr><tr><td>1</td><td><div class=\"dni-plaintext\">[ 464, 0, 33282, 0, 9703, 0, 73, 27073, 0, 2502, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1 ]</div></td></tr><tr><td>2</td><td><div class=\"dni-plaintext\">[ 3666, 0, 3672, 0, 271, 0, 25596, 271, 0, 392, 0, 40, 0, 2339, -1, -1, -1, -1, -1, -1 ]</div></td></tr><tr><td>3</td><td><div class=\"dni-plaintext\">[ 818, 0, 1169, 0, 21953, 395, 0, 10378, 9998, 0, 1659, 0, 76, 585, 273, -1, -1, -1, -1, -1 ]</div></td></tr></tbody></table>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"idv.GetColumn<long[]>(\"input1\")"
]
},
{
"cell_type": "code",
"execution_count": 180,
"metadata": {
"dotnet_interactive": {
"language": "csharp"
},
"vscode": {
"languageId": "dotnet-interactive.csharp"
}
},
"outputs": [],
"source": [
"var output = onnxPipeline.Fit(idv).Transform(idv);"
]
},
{
"cell_type": "code",
"execution_count": 181,
"metadata": {
"dotnet_interactive": {
"language": "csharp"
},
"vscode": {
"languageId": "dotnet-interactive.csharp"
}
},
"outputs": [
{
"data": {
"text/html": [
"<table><thead><tr><th><i>index</i></th><th>Name</th><th>Index</th><th>IsHidden</th><th>Type</th><th>Annotations</th></tr></thead><tbody><tr><td>0</td><td>OriginalInput</td><td><div class=\"dni-plaintext\">0</div></td><td><div class=\"dni-plaintext\">False</div></td><td><table><thead><tr><th>RawType</th></tr></thead><tbody><tr><td><div class=\"dni-plaintext\">System.ReadOnlyMemory&lt;System.Char&gt;</div></td></tr></tbody></table></td><td><table><thead><tr><th>Schema</th></tr></thead><tbody><tr><td><div class=\"dni-plaintext\">[ ]</div></td></tr></tbody></table></td></tr><tr><td>1</td><td>input1</td><td><div class=\"dni-plaintext\">1</div></td><td><div class=\"dni-plaintext\">False</div></td><td><table><thead><tr><th>Dimensions</th><th>IsKnownSize</th><th>ItemType</th><th>Size</th><th>RawType</th></tr></thead><tbody><tr><td><div class=\"dni-plaintext\">[ 1, 1, 20 ]</div></td><td><div class=\"dni-plaintext\">True</div></td><td><div class=\"dni-plaintext\">{ Int64: RawType: System.Int64 }</div></td><td><div class=\"dni-plaintext\">20</div></td><td><div class=\"dni-plaintext\">Microsoft.ML.Data.VBuffer&lt;System.Int64&gt;</div></td></tr></tbody></table></td><td><table><thead><tr><th>Schema</th></tr></thead><tbody><tr><td><div class=\"dni-plaintext\">[ ]</div></td></tr></tbody></table></td></tr><tr><td>2</td><td>output1</td><td><div class=\"dni-plaintext\">2</div></td><td><div class=\"dni-plaintext\">False</div></td><td><table><thead><tr><th>Dimensions</th><th>IsKnownSize</th><th>ItemType</th><th>Size</th><th>RawType</th></tr></thead><tbody><tr><td><div class=\"dni-plaintext\">[ 1, 1, 8, 768 ]</div></td><td><div class=\"dni-plaintext\">True</div></td><td><div class=\"dni-plaintext\">{ Single: RawType: System.Single }</div></td><td><div class=\"dni-plaintext\">6144</div></td><td><div class=\"dni-plaintext\">Microsoft.ML.Data.VBuffer&lt;System.Single&gt;</div></td></tr></tbody></table></td><td><table><thead><tr><th>Schema</th></tr></thead><tbody><tr><td><div class=\"dni-plaintext\">[ ]</div></td></tr></tbody></table></td></tr><tr><td>3</td><td>output13</td><td><div class=\"dni-plaintext\">3</div></td><td><div class=\"dni-plaintext\">False</div></td><td><table><thead><tr><th>Dimensions</th><th>IsKnownSize</th><th>ItemType</th><th>Size</th><th>RawType</th></tr></thead><tbody><tr><td><div class=\"dni-plaintext\">[ 2, 1, 12, 8, 64 ]</div></td><td><div class=\"dni-plaintext\">True</div></td><td><div class=\"dni-plaintext\">{ Single: RawType: System.Single }</div></td><td><div class=\"dni-plaintext\">12288</div></td><td><div class=\"dni-plaintext\">Microsoft.ML.Data.VBuffer&lt;System.Single&gt;</div></td></tr></tbody></table></td><td><table><thead><tr><th>Schema</th></tr></thead><tbody><tr><td><div class=\"dni-plaintext\">[ ]</div></td></tr></tbody></table></td></tr></tbody></table>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"output.Schema"
]
},
{
"cell_type": "code",
"execution_count": 182,
"metadata": {
"dotnet_interactive": {
"language": "csharp"
},
"vscode": {
"languageId": "dotnet-interactive.csharp"
}
},
"outputs": [],
"source": [
"public class ModelOutput\n",
"{\n",
" [ColumnName(\"output1\")]\n",
" [VectorType(1*1*8*768)]\n",
" public float[] Output1 {get;set;}\n",
"\n",
" [ColumnName(\"output13\")]\n",
" [VectorType(2*1*12*8*64)] \n",
" public float[] Output13 {get;set;}\n",
"}"
]
},
{
"cell_type": "code",
"execution_count": 183,
"metadata": {
"dotnet_interactive": {
"language": "csharp"
},
"vscode": {
"languageId": "dotnet-interactive.csharp"
}
},
"outputs": [],
"source": [
"var predictions = ctx.Data.CreateEnumerable<ModelOutput>(output,reuseRowObject:false); "
]
},
{
"cell_type": "code",
"execution_count": 191,
"metadata": {
"dotnet_interactive": {
"language": "csharp"
},
"vscode": {
"languageId": "dotnet-interactive.csharp"
}
},
"outputs": [
{
"data": {
"text/html": [
"<div class=\"dni-plaintext\">768</div>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"var chunks = \n",
" predictions.First().Output1.Chunk(20);\n",
"\n",
"chunks.Count()"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"dotnet_interactive": {
"language": "csharp"
},
"vscode": {
"languageId": "dotnet-interactive.csharp"
}
},
"outputs": [
{
"ename": "Error",
"evalue": "(1,1): error CS0103: The name 'chunks' does not exist in the current context",
"output_type": "error",
"traceback": [
"(1,1): error CS0103: The name 'chunks' does not exist in the current context"
]
}
],
"source": [
"chunks\n",
" .Select(x => x.OrderByDescending(n => n))"
]
}
],
"metadata": {
"kernelspec": {
"display_name": ".NET (C#)",
"language": "C#",
"name": ".net-csharp"
},
"language_info": {
"name": "C#"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
@quicksln
Copy link

Hello, thank you for the sample with ML.NET GPT-2. I've learned a lot.
Do you maybe have sample where you can predict output text based on input text ?

@luisquintanilla
Copy link
Author

luisquintanilla commented Mar 18, 2024

Hi @quicksln,

You can find several completion related samples in this repo. https://github.com/dotnet/ai-samples

@Divyeshpatel6073
Copy link

What is textLength
I am not see any more this variable.

actually i am running this code it's shows me -> Length of memory (12) must match product of dimensions (20)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment