Skip to content

Instantly share code, notes, and snippets.

@Norod
Created July 22, 2020 14:02
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save Norod/4e25f779e8f8c926832c0cf58da2a6f8 to your computer and use it in GitHub Desktop.
Save Norod/4e25f779e8f8c926832c0cf58da2a6f8 to your computer and use it in GitHub Desktop.
hewiki-articles-distilGPT2py-il.ipynb
Display the source blob
Display the rendered blob
Raw
{
"nbformat": 4,
"nbformat_minor": 0,
"metadata": {
"colab": {
"name": "hewiki-articles-distilGPT2py-il.ipynb",
"provenance": [],
"collapsed_sections": [],
"include_colab_link": true
},
"kernelspec": {
"name": "python3",
"display_name": "Python 3"
}
},
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "view-in-github",
"colab_type": "text"
},
"source": [
"<a href=\"https://colab.research.google.com/gist/Norod/4e25f779e8f8c926832c0cf58da2a6f8/hewiki-articles-distilgpt2py-il.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "TJyOM_uOzuDJ",
"colab_type": "text"
},
"source": [
"\n",
"\n",
"# hewiki-articles-distilGPT2py-il\n",
"\n",
"## A tiny GPT2 model for generating Hebrew text\n",
"\n",
"A distilGPT2 sized model. <br>\n",
"Training data was hewiki-20200701-pages-articles-multistream.xml.bz2 from https://dumps.wikimedia.org/hewiki/20200701/ <br>\n",
"XML has been converted to plain text using Wikipedia Extractor http://medialab.di.unipi.it/wiki/Wikipedia_Extractor <br>\n",
"I then added <|startoftext|> and <|endoftext|> markers and deleted empty lines. <br>\n"
]
},
{
"cell_type": "code",
"metadata": {
"id": "h7urkEG-0OTM",
"colab_type": "code",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 607
},
"outputId": "c8de3936-fa0e-42e5-b42f-0d392cc8d342"
},
"source": [
"!pip install transformers"
],
"execution_count": 2,
"outputs": [
{
"output_type": "stream",
"text": [
"Collecting transformers\n",
"\u001b[?25l Downloading https://files.pythonhosted.org/packages/27/3c/91ed8f5c4e7ef3227b4119200fc0ed4b4fd965b1f0172021c25701087825/transformers-3.0.2-py3-none-any.whl (769kB)\n",
"\u001b[K |████████████████████████████████| 778kB 4.3MB/s \n",
"\u001b[?25hRequirement already satisfied: regex!=2019.12.17 in /usr/local/lib/python3.6/dist-packages (from transformers) (2019.12.20)\n",
"Requirement already satisfied: tqdm>=4.27 in /usr/local/lib/python3.6/dist-packages (from transformers) (4.41.1)\n",
"Requirement already satisfied: dataclasses; python_version < \"3.7\" in /usr/local/lib/python3.6/dist-packages (from transformers) (0.7)\n",
"Requirement already satisfied: numpy in /usr/local/lib/python3.6/dist-packages (from transformers) (1.18.5)\n",
"Collecting tokenizers==0.8.1.rc1\n",
"\u001b[?25l Downloading https://files.pythonhosted.org/packages/40/d0/30d5f8d221a0ed981a186c8eb986ce1c94e3a6e87f994eae9f4aa5250217/tokenizers-0.8.1rc1-cp36-cp36m-manylinux1_x86_64.whl (3.0MB)\n",
"\u001b[K |████████████████████████████████| 3.0MB 8.7MB/s \n",
"\u001b[?25hCollecting sacremoses\n",
"\u001b[?25l Downloading https://files.pythonhosted.org/packages/7d/34/09d19aff26edcc8eb2a01bed8e98f13a1537005d31e95233fd48216eed10/sacremoses-0.0.43.tar.gz (883kB)\n",
"\u001b[K |████████████████████████████████| 890kB 39.4MB/s \n",
"\u001b[?25hCollecting sentencepiece!=0.1.92\n",
"\u001b[?25l Downloading https://files.pythonhosted.org/packages/d4/a4/d0a884c4300004a78cca907a6ff9a5e9fe4f090f5d95ab341c53d28cbc58/sentencepiece-0.1.91-cp36-cp36m-manylinux1_x86_64.whl (1.1MB)\n",
"\u001b[K |████████████████████████████████| 1.1MB 46.3MB/s \n",
"\u001b[?25hRequirement already satisfied: packaging in /usr/local/lib/python3.6/dist-packages (from transformers) (20.4)\n",
"Requirement already satisfied: filelock in /usr/local/lib/python3.6/dist-packages (from transformers) (3.0.12)\n",
"Requirement already satisfied: requests in /usr/local/lib/python3.6/dist-packages (from transformers) (2.23.0)\n",
"Requirement already satisfied: six in /usr/local/lib/python3.6/dist-packages (from sacremoses->transformers) (1.12.0)\n",
"Requirement already satisfied: click in /usr/local/lib/python3.6/dist-packages (from sacremoses->transformers) (7.1.2)\n",
"Requirement already satisfied: joblib in /usr/local/lib/python3.6/dist-packages (from sacremoses->transformers) (0.16.0)\n",
"Requirement already satisfied: pyparsing>=2.0.2 in /usr/local/lib/python3.6/dist-packages (from packaging->transformers) (2.4.7)\n",
"Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /usr/local/lib/python3.6/dist-packages (from requests->transformers) (1.24.3)\n",
"Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.6/dist-packages (from requests->transformers) (2020.6.20)\n",
"Requirement already satisfied: idna<3,>=2.5 in /usr/local/lib/python3.6/dist-packages (from requests->transformers) (2.10)\n",
"Requirement already satisfied: chardet<4,>=3.0.2 in /usr/local/lib/python3.6/dist-packages (from requests->transformers) (3.0.4)\n",
"Building wheels for collected packages: sacremoses\n",
" Building wheel for sacremoses (setup.py) ... \u001b[?25l\u001b[?25hdone\n",
" Created wheel for sacremoses: filename=sacremoses-0.0.43-cp36-none-any.whl size=893260 sha256=638ccd9fbae2e1ad181c0444796165b80c7286b114401eef3a1a9430e469d711\n",
" Stored in directory: /root/.cache/pip/wheels/29/3c/fd/7ce5c3f0666dab31a50123635e6fb5e19ceb42ce38d4e58f45\n",
"Successfully built sacremoses\n",
"Installing collected packages: tokenizers, sacremoses, sentencepiece, transformers\n",
"Successfully installed sacremoses-0.0.43 sentencepiece-0.1.91 tokenizers-0.8.1rc1 transformers-3.0.2\n"
],
"name": "stdout"
}
]
},
{
"cell_type": "code",
"metadata": {
"id": "I_Kjs-zmzy84",
"colab_type": "code",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 332
},
"outputId": "c711f848-c0dd-4fe1-999d-b4447e2427ab"
},
"source": [
"import torch\n",
"import torch.nn as nn\n",
"from transformers import GPT2Tokenizer, GPT2LMHeadModel\n",
"\n",
"tokenizer = GPT2Tokenizer.from_pretrained(\"Norod78/hewiki-articles-distilGPT2py-il\")\n",
"model = GPT2LMHeadModel.from_pretrained(\"Norod78/hewiki-articles-distilGPT2py-il\").eval()\n",
"\n",
"bos_token = tokenizer.bos_token #Beginning of sentace \n",
"eos_token = tokenizer.eos_token #End of sentence \n",
"\n",
"def generate_word(model, tokens_tensor, temperature=1.0):\n",
" \"\"\" \n",
" Sample a word given a tensor of tokens of previous words from a model. Given \n",
" the words we have, sample a plausible word. Temperature is used for \n",
" controlling randomness. If using temperature==0 we simply use a greedy arg max. \n",
" Else, we sample from a multinomial distribution using a lower inverse \n",
" temperature to allow for more randomness to escape repetitions. \n",
" \"\"\"\n",
" with torch.no_grad():\n",
" outputs = model(tokens_tensor)\n",
" predictions = outputs[0]\n",
" if temperature>0:\n",
" # Make the distribution more or less skewed based on the temperature\n",
" predictions = outputs[0]/temperature\n",
" # Sample from the distribution\n",
" softmax = nn.Softmax(dim=0)\n",
" predicted_index = torch.multinomial(softmax(predictions[0,-1,:]),1).item()\n",
" # Simply take the arg-max of the distribution\n",
" else:\n",
" predicted_index = torch.argmax(predictions[0, -1, :]).item()\n",
" # Decode the encoding to the corresponding word\n",
" predicted_text = tokenizer.decode([predicted_index])\n",
" return predicted_text\n",
"\n",
"def generate_sentence(model, tokenizer, initial_text, temperature=1.0):\n",
" \"\"\" Generate a sentence given some initial text using a model and a tokenizer.\n",
" Returns the new sentence. \"\"\"\n",
"\n",
" # Encode a text inputs\n",
" text = \"\"\n",
" sentence = text\n",
"\n",
" # We avoid an infinite loop by setting a maximum range\n",
" for i in range(0,84):\n",
" indexed_tokens = tokenizer.encode(initial_text + text)\n",
"\n",
" # Convert indexed tokens in a PyTorch tensor\n",
" tokens_tensor = torch.tensor([indexed_tokens])\n",
"\n",
" new_word = generate_word(model, tokens_tensor, temperature=temperature)\n",
"\n",
" # Here the temperature is slowly decreased with each generated word,\n",
" # this ensures that the sentence (ending) makes more sense.\n",
" # We don't decrease to a temperature of 0.0 to leave some randomness in.\n",
" if temperature<(1-0.008):\n",
" temperature += 0.008\n",
" else:\n",
" temperature = 0.996\n",
"\n",
" text = text+new_word\n",
"\n",
" # Stop generating new words when we have reached the end of the line or the poem\n",
" if eos_token in new_word:\n",
" # returns new sentence and whether poem is done\n",
" return (text.replace(eos_token,\"\").strip(), True)\n",
" elif '/' in new_word:\n",
" return (text.strip(), False)\n",
" elif bos_token in new_word:\n",
" return (text.replace(bos_token,\"\").strip(), False)\n",
"\n",
" return (text, True)\n",
"\n",
"for output_num in range(1,5):\n",
" print(\"\\n\")\n",
" init_text = \"בוקר טוב\"\n",
" text = bos_token + init_text\n",
" for i in range(0,84):\n",
" sentence = generate_sentence(model, tokenizer, text, temperature=0.9) \n",
" text = init_text + sentence[0]\n",
" print(text)\n",
" if (sentence[1] == True):\n",
" break "
],
"execution_count": 5,
"outputs": [
{
"output_type": "stream",
"text": [
"\n",
"\n",
"בוקר טובעים בין \"רוח טבע\", כלומר צליינות. באותה עת נהגו צליינות להתקרב דווקא להודעות בפני עצמם שמחכים להם ג' באיטליה, דבר שהפריך את ז'אנר \"מסע לא סדיר\".\n",
"לווייתנים וספנים מתחילים בשתי שורות, השלוש הנמוכות בקולונדות מזרחיות. בעוד שרוב המלומדים סבורים כי הן נוצרו בשילוב של דיו-ביור, הרי שהסרבית ניסתה להרים כל עין במורכבות סטריאוט\n",
"\n",
"\n",
"בוקר טוב, אך הוא אינו נאמן לוודיקים בזירה הציבורית כפי שהיה פעם.\n",
"ריפרך אברהיל החל את דרכו הפוליטית כמייצג המפלגה הדמוקרטית הלאומית. הוא חבר בוועדת החוץ והביטחון של הסנאט, בוועדת המשנה לחינוך, בוועדת הסיוע הכספי והטכנית של מזכיר המדינה כלפי בית הספרינגס שבבתטואן. הוא חה על החלטתו לארגן את גיוס כוחות השיטור שבתפירת תעלת סואץ, כשעבודה זו נקטעה כאשר משרד המודיעין הבריטי\n",
"\n",
"\n",
"בוקר טוב יותר, שאינם מדטרים את ההתדיינות המשולשת. בהונג קונג, ובמדינות אחרות, אין שתי מדינות המגדירות את היחסים ביניהן ולמעשה, אין מכנה משותף.\n",
"הסיבה המדויקת ביותר לכך היא השליטה בשלט האצולה וראש הממשלה נסיך הכתר בניהול המדינה.\n",
"כדי למנוע מצב זה, יושבת ראש הממשלה אמונה על מינוי המושלת, אולם בפועל, החלטותיה מחייבות את העושרו של ראש הרשות המבצעת בידיו של ראש\n",
"\n",
"\n",
"בוקר טוב\" (התרגום ל\"עיר טובה\" מאת ליאונרד ברנשטיין).\n",
"בשנת 2009 זכה פאטו בפרס אקו\"ם על שם מרטין מקגונגלפר על הספר \"העיר המבוצרת\" לרב-מכר ותורגם גם לשפות אחרות.\n"
],
"name": "stdout"
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "AGJbCViazzTv",
"colab_type": "text"
},
"source": [
"##Model Card\n",
"https://huggingface.co/Norod78/hewiki-articles-distilGPT2py-il\n",
"---\n",
"language: he\n",
"\n",
"thumbnail: https://avatars1.githubusercontent.com/u/3617152?norod.jpg\n",
"widget:\n",
"- text: \"<|startoftext|>החוק השני של מועדון קרב הוא\"\n",
"- text: \"<|startoftext|>ראש הממשלה בן גוריון\"\n",
"- text: \"<|startoftext|>למידת מכונה (סרט)\"\n",
"- text: \"<|startoftext|>מנשה פומפרניקל\"\n",
"- text: \"<|startoftext|>אי שוויון \"\n",
"\n",
"license: mit\n",
"---\n"
]
}
]
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment