Krilecy/Synthetic Data Generation.ipynb

## Synthetic Data Generation.ipynb
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Generating Synthetic Data for Embedding Model Fine-Tuning\n",
    "\n",
    "On the last day of 2023 a team at Microsoft published the paper [Improving Text Embeddings with Large Language Models\n",
    "](https://arxiv.org/abs/2401.00368) which lays out how popular decoder-only LLMs like Mistral 7B can be LoRA fine-tuned on synthetic data to produce embeddings. Wang et al. published [the model on the Huggingface Hub](https://huggingface.co/intfloat/e5-mistral-7b-instruct) which, as of January 22, 2024, tops the [MTEB Leadboard](https://huggingface.co/spaces/mteb/leaderboard).\n",
    "\n",
    "The code to generate the synthetic dataset with GPT-4 has not been open sourced (yet). I needed it for a project and figured I make a general implementation to share first. Have a look and adjust to your use case 😊"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Installs & Imports"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m A new release of pip is available: \u001b[0m\u001b[31;49m23.2.1\u001b[0m\u001b[39;49m -> \u001b[0m\u001b[32;49m23.3.2\u001b[0m\n",
      "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m To update, run: \u001b[0m\u001b[32;49mpip3 install --upgrade pip\u001b[0m\n",
      "Note: you may need to restart the kernel to use updated packages.\n"
     ]
    }
   ],
   "source": [
    "%pip install -q openai datasets huggingface_hub"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "import asyncio\n",
    "from openai import AsyncOpenAI\n",
    "import os\n",
    "import random\n",
    "import json\n",
    "\n",
    "# options for persisting the data:\n",
    "from datasets import load_dataset, Dataset\n",
    "import csv\n",
    "from huggingface_hub import login"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "model = \"gpt-4-1106-preview\" #this is the preview for GPT-4-Turbo that comes with improved JSON formatting. Switch to GPT-4-turbo once it is out.\n",
    "OPENAI_API_KEY = os.environ.get(\"OPENAI_API_KEY\")\n",
    "\n",
    "client = AsyncOpenAI(api_key=OPENAI_API_KEY)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "async def fetch_response(prompt):\n",
    "    try:\n",
    "        response = await client.chat.completions.create(\n",
    "            messages=[{\"role\": \"user\", \"content\": prompt}],\n",
    "            model=model,\n",
    "        )\n",
    "        return response.choices[0].message.content.strip()\n",
    "\n",
    "    except Exception as e:\n",
    "        print(f\"Error for prompt '{prompt:.30}': {e}\")\n",
    "        return \"{err}\"\n",
    "\n",
    "\n",
    "async def fetch_openai_responses_async(prompts):\n",
    "    tasks = [fetch_response(prompt) for prompt in prompts]\n",
    "    responses = await asyncio.gather(*tasks)\n",
    "\n",
    "    return responses\n",
    "\n",
    "\n",
    "def check_response_validity(strings, required_keys):\n",
    "    \"\"\"Checks if the response is valid JSON and contains only the keys we asked for.\"\"\"\n",
    "    valid_json_objects = []\n",
    "\n",
    "    for str in strings:\n",
    "        try:\n",
    "            json_obj = json.loads(str)\n",
    "            if set(json_obj.keys()) == required_keys:\n",
    "                valid_json_objects.append(json_obj)\n",
    "            else:\n",
    "                continue\n",
    "        except json.JSONDecodeError:\n",
    "            continue\n",
    "\n",
    "    return valid_json_objects"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "_Note_: After the function definitions ran all chapters with roman numerals can be run independently."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# I. Asymetric Tasks"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Generate Tasks (1/2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here are some examples from a dataset on Huggingface: [andersonbcdefg/synthetic_retrieval_tasks](https://huggingface.co/datasets/andersonbcdefg/synthetic_retrieval_tasks)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "examples = [\"Provided a scientific claim as query, retrieve documents that help verify or refute the claim.\", \n",
    "            \"Retrieve a range of urban planning proposals that address the challenges of rapid urbanization.\",\n",
    "            \"Search for documents that answers a FAQ-style query on children's nutrition.\",\n",
    "            \"Input a list of symptoms and retrieve medical case studies with similar patient presentations.\t\",\n",
    "            \"Compile a list of job postings that match a set of skills and experience levels across industries.\t\",\n",
    "            \"Given a sports event, retrieve match reports and analysis from sports news outlets.\t\",\n",
    "            \"Find multimedia resources for learning a new language at various proficiency levels.\",\n",
    "            \"Retrieve blog posts and articles discussing the benefits and drawbacks of remote work.\",\n",
    "            \"Gather articles that debate the pros and cons of universal basic income.\",\n",
    "            \"Locate and organize instructional videos on DIY home repairs for various skill levels.\",\n",
    "            \"Search for online discussions on the benefits of meditation\",]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "# optional: take the tasks from the dataset (205k) and skip to 'Generate (query, pos, neg)-Triplets'\n",
    "dataset = load_dataset(\"andersonbcdefg/synthetic_retrieval_tasks\", split=\"train\")\n",
    "examples = dataset['task'][:100] # let's take a sample for now"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "template = \"\"\"Brainstorm a list of potentially useful text retrieval tasks. \n",
    "Here are a few examples for your reference:\n",
    "{}\n",
    "Please adhere to the following guidelines:\n",
    "- Specify what the query is, and what the desired documents are.\n",
    "- Each retrieval task should cover a wide range of queries, and should not be too specific.\n",
    "Your output should always be a python list of strings only, with about 20 elements, and each element corresponds to a distinct retrieval task in one sentence. Do not explain yourself or output anything else. Be creative!\"\"\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "def create_task_prompts(examples, no_of_examples, num_prompts):\n",
    "    for _ in range(num_prompts):\n",
    "        example_batch = random.sample(examples, no_of_examples)\n",
    "        formatted_examples = '- ' + '\\n- '.join(example_batch)  # create bullet list\n",
    "        prompt = f\"\"\"Brainstorm a list of potentially useful text retrieval tasks. \n",
    "Here are a few examples for your reference:\n",
    "{formatted_examples}\n",
    "Please adhere to the following guidelines:\n",
    "- Specify what the query is, and what the desired documents are.\n",
    "- Each retrieval task should cover a wide range of queries, and should not be too specific.\n",
    "Your output should always be a list with about 20 elements, and each element corresponds to a distinct retrieval task in one sentence. No enumeration or bullet points, just single linebreaks between list items. Do not explain yourself or output anything else. Be creative!\"\"\"\n",
    "        yield prompt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Brainstorm a list of potentially useful text retrieval tasks. \n",
      "Here are a few examples for your reference:\n",
      "- Search for language learning resources that match a learner's current proficiency level.\n",
      "- Compile a list of non-profit organizations working on wildlife conservation globally.\n",
      "- Find multimedia resources for learning a new language at various proficiency levels.\n",
      "- Retrieve patents related to a particular technological innovation or keyword.\n",
      "- Locate online courses and educational materials for professional development in project management.\n",
      "Please adhere to the following guidelines:\n",
      "- Specify what the query is, and what the desired documents are.\n",
      "- Each retrieval task should cover a wide range of queries, and should not be too specific.\n",
      "Your output should always be a list with about 20 elements, and each element corresponds to a distinct retrieval task in one sentence. No enumeration or bullet points, just single linebreaks between list items. Do not explain yourself or output anything else. Be creative!\n"
     ]
    }
   ],
   "source": [
    "no_of_examples = 5 # number of examples per prompt\n",
    "num_prompts = 100  # number of prompts you want to generate = number of responses\n",
    "task_prompt_generator = create_task_prompts(examples, no_of_examples, num_prompts)\n",
    "\n",
    "# get prompts\n",
    "prompts = [next(task_prompt_generator) for _ in range(num_prompts)]\n",
    "\n",
    "# Prompts now look like this:\n",
    "print(prompts[0])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Call the API to get the tasks\n",
    "responses_tasks = await fetch_openai_responses_async(prompts)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[\"Retrieve coding challenges and practice problems matching a developer's experience level and preferred languages.\",\n",
       " 'Input a book title, get literary analyses, author interviews, book club discussion threads, and related readings.',\n",
       " 'Search for guides and tips on personal finance and investment strategies.',\n",
       " 'Compile research papers and literature reviews on a select medical condition or treatment from scientific databases.',\n",
       " 'Search for DIY home improvement guides for specific projects like kitchen renovations or garden landscaping.']"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# GPT formats these very consistently as Python lists\n",
    "tasks  = [task for response in responses_tasks for task in response.split('\\n')]\n",
    "\n",
    "tasks = list(filter(None, tasks)) # the first one is empty after split()\n",
    "\n",
    "# remove duplicates\n",
    "tasks = list(set(tasks))\n",
    "\n",
    "# now tasks look like this\n",
    "tasks[:5]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "# optional: Write tasks to file for save keeping\n",
    "with open(\"tasks.csv\", 'w', newline='') as csvfile:\n",
    "    csvwriter = csv.writer(csvfile)\n",
    "    csvwriter.writerow(tasks)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Generate (query, pos, neg)-Triplets (2/2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The prompt is taken from the paper with the addition of a fourth point in the 'guidelines'. The paper omits all beyond the 3rd for brevity so there must be more than one. These four already yield good results but this is a good place to do some customization to your use case.\n",
    "\n",
    "_Note_: I adjusted the point \"hard_negative_document\". I believe there is a mistake in the paper asking for the hard negative to 'only appear relevant to the query' when it should say 'doesn't appear relevant'."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [],
   "source": [
    "def construct_prompt_generator(tasks, query_types, query_lengths, num_of_words, clarity_levels, languages):\n",
    "\n",
    "    while True:\n",
    "        task = random.choice(tasks)\n",
    "        query_type = random.choice(query_types)\n",
    "        query_length = random.choice(query_lengths)\n",
    "        clarity = random.choice(clarity_levels)\n",
    "        language = random.choice(languages)\n",
    "        num_word = random.choice(num_of_words)\n",
    "\n",
    "        prompt = f\"\"\"You have been assigned a retrieval task: {task}\n",
    "        Your mission is to write one text retrieval example for this task in JSON format. The JSON object must contain the following keys:\n",
    "        - \"user_query\": a string, a random user search query specified by the retrieval task.\n",
    "        - \"positive_document\": a string, a relevant document for the user query.\n",
    "        - \"hard_negative_document\": a string, a hard negative document that doesn't appear relevant to the query.\n",
    "        Please adhere to the following guidelines:\n",
    "        - The \"user_query\" should be {query_type}, {query_length}, {clarity}, and diverse in topic. \n",
    "        - All documents should be at least {num_word} words long.\n",
    "        - Both the query and documents should be in {language}.\n",
    "        - The documents should resemble chunks taken from longer documents or entire short documents\n",
    "        Your output must always be a JSON object only. Don't start with \"```json\", do not explain yourself or output anything else. Be creative!\"\"\" \n",
    "        # I added 'Don't start with \"```json\"' as this gives better results with gpt-4-1106-preview\n",
    "\n",
    "        yield prompt"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For each prompt we sample from the options below. Feel free to adjust them to cover the cases your model might encounter while keeping the options diverse enough for general applicability.\n",
    "\n",
    "The query lengths are taken from the paper and the languages are all 100 from the CC-100 corpus as used in [Conneau et al. 2020](https://aclanthology.org/2020.acl-main.747/). All other options are my guess. Adjust according to your needs."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [],
   "source": [
    "query_types = [\n",
    "    'keywords only', \n",
    "    'a question', \n",
    "    'in style of a websearch',\n",
    "    'in style of a query for a smart assistant', \n",
    "    'with boolean values'\n",
    "    ]\n",
    "\n",
    "query_lengths = ['less than 5 words', '5-10 words', 'at least 10 words']\n",
    "\n",
    "num_of_words = [10, 25, 50, 100]\n",
    "\n",
    "clarity_levels = ['very clear and specific', 'clear', 'rough', 'unclear']\n",
    "\n",
    "languages = [\n",
    "    \"Afrikaans\",\"Amharic\",\"Arabic\",\"Assamese\",\"Azerbaijani\",\"Belarusian\",\"Bulgarian\",\"Bengali\",\"Bengali Romanized\",\"Breton\",\n",
    "    \"Bosnian\", \"Catalan\",\"Czech\",\"Welsh\",\"Danish\",\"German\",\"Greek\",\"English\",\"Esperanto\",\"Spanish\",\n",
    "    \"Estonian\",\"Basque\",\"Persian\",\"Finnish\",\"French\",\"Western Frisian\",\"Irish\",\"Scottish Gaelic\",\"Galician\",\"Gujarati\",\n",
    "    \"Hausa\",\"Hebrew\",\"Hindi\",\"Hindi Romanized\",\"Croatian\",\"Hungarian\",\"Armenian\",\"Indonesian\",\"Icelandic\",\"Italian\",\n",
    "    \"Japanese\",\"Javanese\",\"Georgian\",\"Kazakh\",\"Khmer\",\"Kannada\",\"Korean\",\"Kurdish (Kurmanji)\",\"Kyrgyz\",\"Latin\",\n",
    "    \"Lao\",\"Lithuanian\",\"Latvian\",\"Malagasy\",\"Macedonian\",\"Malayalam\",\"Mongolian\",\"Marathi\",\"Malay\",\"Burmese\",\n",
    "    \"Burmese\",\"Nepali\",\"Dutch\",\"Norwegian\",\"Oromo\",\"Oriya\",\"Punjabi\",\"Polish\",\"Pashto\",\"Portuguese\",\n",
    "    \"Romanian\",\"Russian\",\"Sanskrit\",\"Sindhi\",\"Sinhala\",\"Slovak\",\"Slovenian\",\"Somali\",\"Albanian\",\"Serbian\",\n",
    "    \"Sundanese\",\"Swedish\",\"Swahili\",\"Tamil\",\"Tamil Romanized\",\"Telugu\",\"Telugu Romanized\",\"Thai\",\"Filipino\",\"Turkish\",\n",
    "    \"Uyghur\",\"Ukrainian\",\"Urdu\",\"Urdu Romanized\",\"Uzbek\",\"Vietnamese\",\"Xhosa\",\"Yiddish\",\"Chinese (Simplified)\",\"Chinese (Traditional)\",\n",
    "    ]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "You have been assigned a retrieval task: Assemble a collection of motivational speeches and writings across different genres and authors.\n",
      "        Your mission is to write one text retrieval example for this task in JSON format. The JSON object must contain the following keys:\n",
      "        - \"user_query\": a string, a random user search query specified by the retrieval task.\n",
      "        - \"positive_document\": a string, a relevant document for the user query.\n",
      "        - \"hard_negative_document\": a string, a hard negative document that doesn't appear relevant to the query.\n",
      "        Please adhere to the following guidelines:\n",
      "        - The \"user_query\" should be keywords only, at least 10 words, clear, and diverse in topic. \n",
      "        - All documents should be at least 25 words long.\n",
      "        - Both the query and documents should be in Swahili.\n",
      "        - The documents should resemble chunks taken from longer documents or entire short documents\n",
      "        Your output must always be a JSON object only. Don't start with \"```json\", do not explain yourself or output anything else. Be creative!\n"
     ]
    }
   ],
   "source": [
    "prompt_generator = construct_prompt_generator(tasks, query_types, query_lengths, num_of_words, clarity_levels, languages)\n",
    "\n",
    "# let's test the generator\n",
    "first_prompt = next(prompt_generator)\n",
    "print(first_prompt)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [],
   "source": [
    "number_of_prompts = 100 # change as desired\n",
    "\n",
    "prompts = [next(prompt_generator) for _ in range(number_of_prompts)]\n",
    "responses_triplets = await fetch_openai_responses_async(prompts)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{\n",
      "    \"user_query\": \"Wasan bidiyo 'The Legend of Zelda: Breath of the Wild' tafiya mai jagora, lambobin magudi, da dandalin al'umma na 'yan wasan caca\",\n",
      "    \"positive_document\": \"Ga cikakken jagorar tafiya don 'The Legend of Zelda: Breath of the Wild'. Zai taimaka muku wajen warware asirai da kuma faɗakar da ku kan yadda za ku ƙarasa wasan ba tare da ɓata lokaci ba. Hakanan akwai lambobin magudi wanda za su taimaka wajen buɗe karfi da makamai na musamman.\",\n",
      "    \"hard_negative_document\": \"Shin kun san cewa 'The Legend of Zelda: Breath of the Wild' yana da ɗaukaka sosai a tsakanin masu suka? Sun yi ikirarin cewa fasali da zane-zanen wasan sun sa shi zama na musamman. Sai dai, wani lokutan jita-jita kan yi zargin cewa wasu lambobin magudi na iya lalatawa wasu 'yan wasa kwarewar wasan.\"\n",
      "}\n"
     ]
    }
   ],
   "source": [
    "required_keys = {'user_query', 'positive_document', 'hard_negative_document'}\n",
    "valid_responses = check_response_validity(responses_triplets,required_keys)\n",
    "\n",
    "# Let's look at one of the responses\n",
    "print(json.dumps(valid_responses[0], indent=4, ensure_ascii=False))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Save"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Create a dataset\n",
    "data_dict = {k: [dic[k] for dic in valid_responses] for k in valid_responses[0]}\n",
    "dataset = Dataset.from_dict(data_dict)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "c7d76155c66d4e81bdbdc62d57eea4c7",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Saving the dataset (0/1 shards):   0%|          | 0/97 [00:00<?, ? examples/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "#... and save to disc\n",
    "dataset.save_to_disk(\"asymmetric_dataset\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# or log in to Hugging Face\n",
    "login()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# to upload to the hub\n",
    "dataset.push_to_hub(\"<your_username>/asymmetric_dataset\", private=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now you should have a diverse high quality dataset with a diverse set of queries, positive and negative examples."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# II. Symmetric Tasks"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The symmetric tasks are much simpler. Wang et al. distinguish __semantic textual similarity (STS)__ (aka. paraphrase or not?) and __bitext retrieval__ (same meaning two languages).\n",
    "\n",
    "The paper tells us that we get away with a simpler approach 'Since the task definition is straightforward, we omit the brainstorming step for symmetric tasks.' Hence, only one prompt each. \n",
    "\n",
    "Without an example from the paper to go of but reasonable confidence I wrote these prompts."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## II.a. Semantic Textual Similarity (STS)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [],
   "source": [
    "def construct_sts_prompt_generator(num_of_words, languages):\n",
    "\n",
    "    while True:\n",
    "        language = random.choice(languages)\n",
    "        num_word = random.choice(num_of_words)\n",
    "\n",
    "        sts_prompt = f\"\"\"You are tasked to create entries for a text dataset. The task is semantic textual similarity which requires two texts with similar meaning but phrased differently.\n",
    "See these short examples:\n",
    "- A) 'The cat is sleeping on the mat.' B) 'A feline rests on a small rug.'\n",
    "- A) 'The business was established in 1984.' B) 'The company was founded in the mid eighties.'\n",
    "- A) 'When I passed by your house the other day I didn't see you.' B) 'I missed you when I was in the neighborhood recently.'\n",
    "- A) 'Heavy rainfall caused flooding in many parts of the city.' B) 'The city experienced floods due to intense rain showers.'\n",
    "Your mission is to write one entry for this dataset in JSON format. The JSON object must contain the following keys:\n",
    "- \"phrase_a\": one way to phrase something.\n",
    "- \"phrase_b\": another way to phrase something.\n",
    "Please adhere to the following guidelines:\n",
    "- All entries should be at least {num_word} long\n",
    "- Both phrases should be in {language}.\n",
    "- The phrases should resemble chunks taken from longer documents or entire short documents\n",
    "Your output must always be a JSON object only. Don't start with \"```json\", do not explain yourself or output anything else. Be creative!\"\"\" \n",
    "\n",
    "        yield sts_prompt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [],
   "source": [
    "sts_num_of_words = [10, 25, 50, 100]\n",
    "\n",
    "sts_languages = [\n",
    "    \"Afrikaans\",\"Amharic\",\"Arabic\",\"Assamese\",\"Azerbaijani\",\"Belarusian\",\"Bulgarian\",\"Bengali\",\"Bengali Romanized\",\"Breton\",\n",
    "    \"Bosnian\", \"Catalan\",\"Czech\",\"Welsh\",\"Danish\",\"German\",\"Greek\",\"English\",\"Esperanto\",\"Spanish\",\n",
    "    \"Estonian\",\"Basque\",\"Persian\",\"Finnish\",\"French\",\"Western Frisian\",\"Irish\",\"Scottish Gaelic\",\"Galician\",\"Gujarati\",\n",
    "    \"Hausa\",\"Hebrew\",\"Hindi\",\"Hindi Romanized\",\"Croatian\",\"Hungarian\",\"Armenian\",\"Indonesian\",\"Icelandic\",\"Italian\",\n",
    "    \"Japanese\",\"Javanese\",\"Georgian\",\"Kazakh\",\"Khmer\",\"Kannada\",\"Korean\",\"Kurdish (Kurmanji)\",\"Kyrgyz\",\"Latin\",\n",
    "    \"Lao\",\"Lithuanian\",\"Latvian\",\"Malagasy\",\"Macedonian\",\"Malayalam\",\"Mongolian\",\"Marathi\",\"Malay\",\"Burmese\",\n",
    "    \"Burmese\",\"Nepali\",\"Dutch\",\"Norwegian\",\"Oromo\",\"Oriya\",\"Punjabi\",\"Polish\",\"Pashto\",\"Portuguese\",\n",
    "    \"Romanian\",\"Russian\",\"Sanskrit\",\"Sindhi\",\"Sinhala\",\"Slovak\",\"Slovenian\",\"Somali\",\"Albanian\",\"Serbian\",\n",
    "    \"Sundanese\",\"Swedish\",\"Swahili\",\"Tamil\",\"Tamil Romanized\",\"Telugu\",\"Telugu Romanized\",\"Thai\",\"Filipino\",\"Turkish\",\n",
    "    \"Uyghur\",\"Ukrainian\",\"Urdu\",\"Urdu Romanized\",\"Uzbek\",\"Vietnamese\",\"Xhosa\",\"Yiddish\",\"Chinese (Simplified)\",\"Chinese (Traditional)\",\n",
    "    ]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "You are tasked to create entries for a text dataset. The task is semantic textual similarity which requires two texts with similar meaning but phrased differently.\n",
      "See these short examples:\n",
      "- A) 'The cat is sleeping on the mat.' B) 'A feline rests on a small rug.'\n",
      "- A) 'The business was established in 1984.' B) 'The company was founded in the mid eighties.'\n",
      "- A) 'When I passed by your house the other day I didn't see you.' B) 'I missed you when I was in the neighborhood recently.'\n",
      "- A) 'Heavy rainfall caused flooding in many parts of the city.' B) 'The city experienced floods due to intense rain showers.'\n",
      "Your mission is to write one entry for this dataset in JSON format. The JSON object must contain the following keys:\n",
      "- \"phrase_a\": one way to phrase something.\n",
      "- \"phrase_b\": another way to phrase something.\n",
      "Please adhere to the following guidelines:\n",
      "- All entries should be at least 25 long\n",
      "- Both phrases should be in Assamese.\n",
      "- The phrases should resemble chunks taken from longer documents or entire short documents\n",
      "Your output must always be a JSON object only. Don't start with \"```json\", do not explain yourself or output anything else. Be creative!\n"
     ]
    }
   ],
   "source": [
    "sts_prompt_generator = construct_sts_prompt_generator(sts_num_of_words, sts_languages)\n",
    "\n",
    "# let's test the generator\n",
    "first_prompt = next(sts_prompt_generator)\n",
    "print(first_prompt)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [],
   "source": [
    "number_of_sts_prompts = 100 # change as desired\n",
    "\n",
    "sts_prompts = [next(sts_prompt_generator) for _ in range(number_of_sts_prompts)]\n",
    "sts_response_pairs = await fetch_openai_responses_async(sts_prompts)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{\n",
      "    \"phrase_a\": \"An sanar da cewa gwamnatin tarayya zata kaddamar da sabbin ayyukan raya kasa a jihohin arewa maso gabashin Najeriya.\",\n",
      "    \"phrase_b\": \"Gwamnatin tarayya ta bayyana shirye-shiryenta na fara aikin gina kayayyakin more rayuwa a yankin arewa maso gabas na Najeriya.\"\n",
      "}\n"
     ]
    }
   ],
   "source": [
    "required_keys = {'phrase_a', 'phrase_b'}\n",
    "valid_sts_responses = check_response_validity(sts_response_pairs,required_keys)\n",
    "\n",
    "# Let's look at one of the responses\n",
    "print(json.dumps(valid_sts_responses[0], indent=4, ensure_ascii=False))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Save"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Create a dataset\n",
    "data_dict = {k: [dic[k] for dic in valid_sts_responses] for k in valid_sts_responses[0]}\n",
    "dataset = Dataset.from_dict(data_dict)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "d7ed586a14d7422c9fad414c74c3f6e7",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Saving the dataset (0/1 shards):   0%|          | 0/100 [00:00<?, ? examples/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "#... and save to disc\n",
    "dataset.save_to_disk(\"sts_dataset\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# or log in to Hugging Face\n",
    "login()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# to upload to the hub\n",
    "dataset.push_to_hub(\"<your_username>/sts_dataset\", private=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## II.b. Bitext Retrieval (BTR)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [],
   "source": [
    "def construct_btr_prompt_generator(num_of_words, languages):\n",
    "\n",
    "    while True:\n",
    "        language_a = random.choice(languages)\n",
    "        language_b = random.choice(languages)\n",
    "        if language_a == language_b:\n",
    "            continue\n",
    "\n",
    "        num_word = random.choice(num_of_words)\n",
    "\n",
    "        btr_prompt = f\"\"\"You are tasked to create entries for a text dataset. The task is bitext retrieval which requires two texts with similar meaning in different languages.\n",
    "See these short examples:\n",
    "- A) 'The quick brown fox jumps over the lazy dog.' B) 'El rápido zorro marrón salta sobre el perro perezoso.'\n",
    "- A) 'La cuisine française est reconnue dans le monde entier.' B) 'French cuisine is renowned worldwide.'\n",
    "- A) 'Das Erlernen einer neuen Sprache eröffnet eine Welt voller Möglichkeiten.' B) 'Learning a new language opens up a world of opportunities.'\n",
    "Your mission is to write one entry for this dataset in JSON format. The JSON object must contain the following keys:\n",
    "- \"text_a\": a text in {language_a}\n",
    "- \"text_b\": the same text as A in {language_b}\n",
    "Please adhere to the following guidelines:\n",
    "- All entries should be at least {num_word} long\n",
    "- The phrases should resemble chunks taken from longer documents or entire short documents\n",
    "Your output must always be a JSON object only. Don't start with \"```json\", do not explain yourself or output anything else. Be creative!\"\"\" \n",
    "\n",
    "        yield btr_prompt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [],
   "source": [
    "btr_num_of_words = [10, 25, 50, 100]\n",
    "\n",
    "btr_languages = [\n",
    "    \"Afrikaans\",\"Amharic\",\"Arabic\",\"Assamese\",\"Azerbaijani\",\"Belarusian\",\"Bulgarian\",\"Bengali\",\"Bengali Romanized\",\"Breton\",\n",
    "    \"Bosnian\", \"Catalan\",\"Czech\",\"Welsh\",\"Danish\",\"German\",\"Greek\",\"English\",\"Esperanto\",\"Spanish\",\n",
    "    \"Estonian\",\"Basque\",\"Persian\",\"Finnish\",\"French\",\"Western Frisian\",\"Irish\",\"Scottish Gaelic\",\"Galician\",\"Gujarati\",\n",
    "    \"Hausa\",\"Hebrew\",\"Hindi\",\"Hindi Romanized\",\"Croatian\",\"Hungarian\",\"Armenian\",\"Indonesian\",\"Icelandic\",\"Italian\",\n",
    "    \"Japanese\",\"Javanese\",\"Georgian\",\"Kazakh\",\"Khmer\",\"Kannada\",\"Korean\",\"Kurdish (Kurmanji)\",\"Kyrgyz\",\"Latin\",\n",
    "    \"Lao\",\"Lithuanian\",\"Latvian\",\"Malagasy\",\"Macedonian\",\"Malayalam\",\"Mongolian\",\"Marathi\",\"Malay\",\"Burmese\",\n",
    "    \"Burmese\",\"Nepali\",\"Dutch\",\"Norwegian\",\"Oromo\",\"Oriya\",\"Punjabi\",\"Polish\",\"Pashto\",\"Portuguese\",\n",
    "    \"Romanian\",\"Russian\",\"Sanskrit\",\"Sindhi\",\"Sinhala\",\"Slovak\",\"Slovenian\",\"Somali\",\"Albanian\",\"Serbian\",\n",
    "    \"Sundanese\",\"Swedish\",\"Swahili\",\"Tamil\",\"Tamil Romanized\",\"Telugu\",\"Telugu Romanized\",\"Thai\",\"Filipino\",\"Turkish\",\n",
    "    \"Uyghur\",\"Ukrainian\",\"Urdu\",\"Urdu Romanized\",\"Uzbek\",\"Vietnamese\",\"Xhosa\",\"Yiddish\",\"Chinese (Simplified)\",\"Chinese (Traditional)\",\n",
    "    ]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "You are tasked to create entries for a text dataset. The task is bitext retrieval which requires two texts with similar meaning in different languages.\n",
      "See these short examples:\n",
      "- A) 'The quick brown fox jumps over the lazy dog.' B) 'El rápido zorro marrón salta sobre el perro perezoso.'\n",
      "- A) 'La cuisine française est reconnue dans le monde entier.' B) 'French cuisine is renowned worldwide.'\n",
      "- A) 'Das Erlernen einer neuen Sprache eröffnet eine Welt voller Möglichkeiten.' B) 'Learning a new language opens up a world of opportunities.'\n",
      "Your mission is to write one entry for this dataset in JSON format. The JSON object must contain the following keys:\n",
      "- \"text_a\": a text in Malagasy\n",
      "- \"text_b\": the same text as A in Estonian\n",
      "Please adhere to the following guidelines:\n",
      "- All entries should be at least 10 long\n",
      "- The phrases should resemble chunks taken from longer documents or entire short documents\n",
      "Your output must always be a JSON object only. Don't start with \"```json\", do not explain yourself or output anything else. Be creative!\n"
     ]
    }
   ],
   "source": [
    "btr_prompt_generator = construct_btr_prompt_generator(btr_num_of_words, btr_languages)\n",
    "\n",
    "# let's test the generator\n",
    "first_prompt = next(btr_prompt_generator)\n",
    "print(first_prompt)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [],
   "source": [
    "number_of_btr_prompts = 100 # change as desired\n",
    "\n",
    "btr_prompts = [next(btr_prompt_generator) for _ in range(number_of_btr_prompts)]\n",
    "btr_response_pairs = await fetch_openai_responses_async(btr_prompts)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{\n",
      "    \"text_a\": \"Teknologian kehittyessä ihmisten elämäntavat muuttuvat.\",\n",
      "    \"text_b\": \"مع تطور التكنولوجيا، تتغير أنماط حياة الناس.\"\n",
      "}\n"
     ]
    }
   ],
   "source": [
    "required_keys = {'text_a', 'text_b'}\n",
    "valid_btr_responses = check_response_validity(btr_response_pairs,required_keys)\n",
    "\n",
    "# Let's look at one of the responses\n",
    "print(json.dumps(valid_btr_responses[0], indent=4, ensure_ascii=False))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Save"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Create a dataset\n",
    "data_dict = {k: [dic[k] for dic in valid_btr_responses] for k in valid_btr_responses[0]}\n",
    "dataset = Dataset.from_dict(data_dict)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "770451cc449947278a6ebdced572617f",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Saving the dataset (0/1 shards):   0%|          | 0/100 [00:00<?, ? examples/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "#... and save to disc\n",
    "dataset.save_to_disk(\"btr_dataset\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# or log in to Hugging Face\n",
    "login()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# to upload to the hub\n",
    "dataset.push_to_hub(\"<your_username>/btr_dataset\", private=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now you should have up to three high quality synthetic datasets for your finte-tuning.\n",
    "\n",
    "In case you found any bugs, have ideas for improvements or questions feel free to reach out. I'm @krilecy on [twitter/X](https://twitter.com/krilecy), [GitHub](https://github.com/Krilecy) and [Hugging Face](https://huggingface.co/krilecy) and will probably be doing more with this.\n",
    "\n",
    "cheers 👋\n",
    "\n",
    "   \\- Kris"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}