Skip to content

Instantly share code, notes, and snippets.

@DiegoHernanSalazar
Created February 19, 2024 19:49
Show Gist options
  • Save DiegoHernanSalazar/daa9fe2574882b7cfb4bf7f7670a16c5 to your computer and use it in GitHub Desktop.
Save DiegoHernanSalazar/daa9fe2574882b7cfb4bf7f7670a16c5 to your computer and use it in GitHub Desktop.
OpenAI / DeepLearning.AI. Building Systems with the ChatGpt API, L8: Evaluation part I.
Display the source blob
Display the rendered blob
Raw
{
"metadata": {
"kernelspec": {
"name": "python",
"display_name": "Python (Pyodide)",
"language": "python"
},
"language_info": {
"codemirror_mode": {
"name": "python",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8"
}
},
"nbformat_minor": 4,
"nbformat": 4,
"cells": [
{
"cell_type": "markdown",
"source": "<img src=\"https://wordpress.deeplearning.ai/wp-content/uploads/2023/05/OpenAI-DeepLearning_Short_Courses_Campaign_1080.png\"/>",
"metadata": {}
},
{
"cell_type": "markdown",
"source": "# Evaluation part I\n\nEvaluate LLM responses when there is a single \"right answer\".",
"metadata": {}
},
{
"cell_type": "markdown",
"source": "## Setup\n#### Load the API key and relevant Python libaries.\nIn this course, we've provided some code that loads the OpenAI API key for you.",
"metadata": {}
},
{
"cell_type": "markdown",
"source": "```python\nimport openai # Load OpenAI API library constructor/object\r\nimport os # Set your openai API key secretly, at the \r\n # operating system, using os constructor\r\nimport sys # access to system-specific parameters and functions in Python\r\nsys.path.append('../..') # it searches modules to be imported at the new directory path '../..'\r\nimport utils # make common patterns shorter and easier in python, \r\n # via small functions and classes. Load function\r\n # 'utils.get_products_and_category()' already built in.\r\n\r\nimport tiktoken # get tokenizer constructor to get input, output and total tokens count\r\n\r\nfrom dotenv import load_dotenv, find_dotenv # Get load and Read functions\r\n_ = load_dotenv(find_dotenv()) # Read and then Load local file called '.env' that contains API key\r\n # at os -> environ dictionary {} -> ['OPENAI_API_KEY'] key -> write API key value as 'sk-...'\r\n\r\n# Get an openai API key from the OpenAI website, and set your API key as openai.api_key=\"sk-...\" \r\nopenai.api_key = os.environ['OPENAI_API_KEY'] # This way you set/store the API key more securely, as an\r\n # environmental variable, into the operating system, using\r\n # openai.api_key=os.getenv('OPEN_API_KEY')\r\n # so the API key's value is set in the jupyter environment,\r\n # by selecting this written API 'key' value at: \r\n # os -> environ dictionary {} -> ['OPENAI_API_KEY'] key \r\n # -> API key value is a text/string 'sk-...'\n```",
"metadata": {}
},
{
"cell_type": "markdown",
"source": "```python\ndef get_completion_from_messages(messages, # List of multiple messages [{},{},{},...]\n model=\"gpt-3.5-turbo\", # model's name used\n temperature=0, # Randomness in model's response \n max_tokens=500): # max input+output characters\n\n response = openai.ChatCompletion.create(\n model=model, # \"gpt-3.5-turbo\"\n messages=messages, # [{},{},...]\n temperature=temperature, # this is the degree of randomness/exploration in the model's output. \n # Max exploration occurs when temperature is closer to 1\n max_tokens=max_tokens, # maximum number of tokens (common sequences of characters) \n # the model can accept as input'context' + output'completion' \n )\n return response.choices[0].message[\"content\"] # model's response message or output completion\n```",
"metadata": {}
},
{
"cell_type": "markdown",
"source": "#### Get the relevant products and categories\nHere is the list of products and categories that are in the product catalog.",
"metadata": {}
},
{
"cell_type": "markdown",
"source": "```python\n# {'category_1':['product_name_1','product_name_2',...],\r\n# 'category_2':['product_name_1','product_name_2',...]\r\n# ...\r\n# } product catalog dictionary.\r\nproducts_and_category = utils.get_products_and_category()\r\nproducts_and_category\n```",
"metadata": {}
},
{
"cell_type": "markdown",
"source": "```\n{'Computers and Laptops': ['TechPro Ultrabook',\r\n 'BlueWave Gaming Laptop',\r\n 'PowerLite Convertible',\r\n 'TechPro Desktop',\r\n 'BlueWave Chromebook'],\r\n 'Smartphones and Accessories': ['SmartX ProPhone',\r\n 'MobiTech PowerCase',\r\n 'SmartX MiniPhone',\r\n 'MobiTech Wireless Charger',\r\n 'SmartX EarBuds'],\r\n 'Televisions and Home Theater Systems': ['CineView 4K TV',\r\n 'SoundMax Home Theater',\r\n 'CineView 8K TV',\r\n 'SoundMax Soundbar',\r\n 'CineView OLED TV'],\r\n 'Gaming Consoles and Accessories': ['GameSphere X',\r\n 'ProGamer Controller',\r\n 'GameSphere Y',\r\n 'ProGamer Racing Wheel',\r\n 'GameSphere VR Headset'],\r\n 'Audio Equipment': ['AudioPhonic Noise-Canceling Headphones',\r\n 'WaveSound Bluetooth Speaker',\r\n 'AudioPhonic True Wireless Earbuds',\r\n 'WaveSound Soundbar',\r\n 'AudioPhonic Turntable'],\r\n 'Cameras and Camcorders': ['FotoSnap DSLR Camera',\r\n 'ActionCam 4K',\r\n 'FotoSnap Mirrorless Camera',\r\n 'ZoomMaster Camcorder',\r\n 'FotoSnap Instant Camera']}\n```",
"metadata": {}
},
{
"cell_type": "markdown",
"source": "### Find relevant product and category names (version 1)\r\nThis could be the version that is running in production.`",
"metadata": {}
},
{
"cell_type": "markdown",
"source": "```python\ndef find_category_and_product_v1(user_input,products_and_category):\r\n\r\n delimiter = \"####\"\r\n system_message = f\"\"\"\r\n You will be provided with customer service queries. \\\r\n The customer service query will be delimited with {delimiter} characters.\r\n Output a python list of json objects, where each object has the following format:\r\n 'category': <one of Computers and Laptops, Smartphones and Accessories, Televisions and Home Theater Systems, \\\r\n Gaming Consoles and Accessories, Audio Equipment, Cameras and Camcorders>,\r\n AND\r\n 'products': <a list of products that must be found in the allowed products below>\r\n\r\n\r\n Where the categories and products must be found in the customer service query.\r\n If a product is mentioned, it must be associated with the correct category in the allowed products list below.\r\n If no products or categories are found, output an empty list.\r\n \r\n\r\n List out all products that are relevant to the customer service query based on how closely it relates\r\n to the product name and product category.\r\n Do not assume, from the name of the product, any features or attributes such as relative quality or price.\r\n\r\n The allowed products are provided in JSON format.\r\n The keys of each item represent the category.\r\n The values of each item is a list of products that are within that category.\r\n Allowed products: {products_and_category}\r\n \r\n\r\n \"\"\"\r\n # 'user' input message example\r\n few_shot_user_1 = \"\"\"I want the most expensive computer.\"\"\"\r\n \r\n # 'assistant' output response message example\r\n few_shot_assistant_1 = \"\"\" \r\n [{'category': 'Computers and Laptops', \\\r\n'products': ['TechPro Ultrabook', 'BlueWave Gaming Laptop', 'PowerLite Convertible', 'TechPro Desktop', 'BlueWave Chromebook']}]\r\n \"\"\"\r\n \r\n # Add 'system' message, 'user' input message and 'assistant' output message examples,\r\n # to multiple messages=[{'system'},{'user_example'},{'assistant_example'},{'user_input'}] list.\r\n messages = [ \r\n {'role':'system', 'content': system_message}, \r\n {'role':'user', 'content': f\"{delimiter}{few_shot_user_1}{delimiter}\"}, \r\n {'role':'assistant', 'content': few_shot_assistant_1 },\r\n {'role':'user', 'content': f\"{delimiter}{user_input}{delimiter}\"}, \r\n ] \r\n return get_completion_from_messages(messages) # return 'assistant'/model response, to 'user_input' query.\n```",
"metadata": {}
},
{
"cell_type": "markdown",
"source": "### Evaluate on some queries",
"metadata": {}
},
{
"cell_type": "markdown",
"source": "```python\n# 1st 'user' input message/query\ncustomer_msg_0 = f\"\"\"Which TV can I buy if I'm on a budget?\"\"\" \n\n# 'assistant'/model response to 1st 'user' input query\nproducts_by_category_0 = find_category_and_product_v1(customer_msg_0,\n products_and_category)\n\nprint(products_by_category_0) # Display model output completion/'assistant' response.\n```",
"metadata": {}
},
{
"cell_type": "markdown",
"source": "```\n[{'category': 'Televisions and Home Theater Systems', 'products': ['CineView 4K TV', 'SoundMax Home Theater', 'CineView 8K TV', 'SoundMax Soundbar', 'CineView OLED TV']}]\n```",
"metadata": {}
},
{
"cell_type": "markdown",
"source": "```python\n# 2nd 'user' input message/query\ncustomer_msg_1 = f\"\"\"I need a charger for my smartphone\"\"\"\n\n# 'assistant'/model response to 2nd 'user' input query\nproducts_by_category_1 = find_category_and_product_v1(customer_msg_1,\n products_and_category)\n\nprint(products_by_category_1) # Display model output completion/'assistant' response.\n```",
"metadata": {}
},
{
"cell_type": "markdown",
"source": "```\n[{'category': 'Smartphones and Accessories', 'products': ['MobiTech Wireless Charger']}]\n```",
"metadata": {}
},
{
"cell_type": "markdown",
"source": "```python\n# 3rd 'user' input message/query\r\ncustomer_msg_2 = f\"\"\"\r\nWhat computers do you have?\"\"\"\r\n\r\n# 'assistant'/model response to 3rd 'user' input query\r\nproducts_by_category_2 = find_category_and_product_v1(customer_msg_2,\r\n products_and_category)\r\n\r\nprint(products_by_category_2) # Display model output completion/'assistant' r\n```esponse.",
"metadata": {}
},
{
"cell_type": "markdown",
"source": "```\n[{'category': 'Computers and Laptops', 'products': ['TechPro Ultrabook', 'BlueWave Gaming Laptop', 'PowerLite Convertible', 'TechPro Desktop', 'BlueWave Chromebook']}]\n```",
"metadata": {}
},
{
"cell_type": "markdown",
"source": "```python\n# 4th 'user' input message/query\r\ncustomer_msg_3 = f\"\"\"\r\ntell me about the smartx pro phone and the fotosnap camera, the dslr one.\r\nAlso, what TVs do you have?\"\"\"\r\n\r\n# 'assistant'/model should be a failed response to 4th 'user' input query\r\nproducts_by_category_3 = find_category_and_product_v1(customer_msg_3,\r\n products_and_category)\r\n\r\nprint(products_by_category_3) # Display model/'assistant', \r\n # that should be a failed response.\n```",
"metadata": {}
},
{
"cell_type": "markdown",
"source": "```\n[{'category': 'Smartphones and Accessories', 'products': ['SmartX ProPhone']}, {'category': 'Cameras and Camcorders', 'products': ['FotoSnap DSLR Camera']}]\n```",
"metadata": {}
},
{
"cell_type": "markdown",
"source": "### Harder test cases\r\nIdentify queries found in production, where the model is not working as expected.",
"metadata": {}
},
{
"cell_type": "markdown",
"source": "```python\n# 5th 'user' input message/query\ncustomer_msg_4 = f\"\"\"\ntell me about the CineView TV, the 8K one, Gamesphere console, the X one.\nI'm on a budget, what computers do you have?\"\"\"\n\n# 'assistant'/model should be a failed response to 5th 'user' input query\nproducts_by_category_4 = find_category_and_product_v1(customer_msg_4,\n products_and_category)\n\nprint(products_by_category_4) # Display model/'assistant', \n # that should be a failed response.\n```",
"metadata": {}
},
{
"cell_type": "markdown",
"source": "```\n[{'category': 'Televisions and Home Theater Systems', 'products': ['CineView 8K TV']},\r\n {'category': 'Gaming Consoles and Accessories', 'products': ['GameSphere X']},\r\n {'category': 'Computers and Laptops', 'products': ['TechPro Ultrabook', 'BlueWave Gaming Laptop', 'PowerLite Convertible', 'TechPro Desktop', 'BlueWave Chromebook']\n```}]",
"metadata": {}
},
{
"cell_type": "markdown",
"source": "### Modify the prompt to work on the hard test cases",
"metadata": {}
},
{
"cell_type": "markdown",
"source": "```python\ndef find_category_and_product_v2(user_input,products_and_category):\r\n \"\"\"\r\n Added: Do not output any additional text that is not in JSON format.\r\n Added a second example (for few-shot prompting) where user asks for \r\n the cheapest computer. In both few-shot examples, the shown response \r\n is the full list of products in JSON only.\r\n \"\"\"\r\n delimiter = \"####\"\r\n system_message = f\"\"\"\r\n You will be provided with customer service queries. \\\r\n The customer service query will be delimited with {delimiter} characters.\r\n Output a python list of json objects, where each object has the following format:\r\n 'category': <one of Computers and Laptops, Smartphones and Accessories, Televisions and Home Theater Systems, \\\r\n Gaming Consoles and Accessories, Audio Equipment, Cameras and Camcorders>,\r\n AND\r\n 'products': <a list of products that must be found in the allowed products below>\r\n Do not output any additional text that is not in JSON format.\r\n Do not write any explanatory text after outputting the requested JSON.\r\n\r\n\r\n Where the categories and products must be found in the customer service query.\r\n If a product is mentioned, it must be associated with the correct category in the allowed products list below.\r\n If no products or categories are found, output an empty list.\r\n \r\n\r\n List out all products that are relevant to the customer service query based on how closely it relates\r\n to the product name and product category.\r\n Do not assume, from the name of the product, any features or attributes such as relative quality or price.\r\n\r\n The allowed products are provided in JSON format.\r\n The keys of each item represent the category.\r\n The values of each item is a list of products that are within that category.\r\n Allowed products: {products_and_category}\r\n \r\n\r\n \"\"\"\r\n \r\n few_shot_user_1 = \"\"\"I want the most expensive computer. What do you recommend?\"\"\"\r\n few_shot_assistant_1 = \"\"\" \r\n [{'category': 'Computers and Laptops', \\\r\n'products': ['TechPro Ultrabook', 'BlueWave Gaming Laptop', 'PowerLite Convertible', 'TechPro Desktop', 'BlueWave Chromebook']}]\r\n \"\"\"\r\n \r\n few_shot_user_2 = \"\"\"I want the most cheapest computer. What do you recommend?\"\"\"\r\n few_shot_assistant_2 = \"\"\" \r\n [{'category': 'Computers and Laptops', \\\r\n'products': ['TechPro Ultrabook', 'BlueWave Gaming Laptop', 'PowerLite Convertible', 'TechPro Desktop', 'BlueWave Chromebook']}]\r\n \"\"\"\r\n \r\n messages = [ \r\n {'role':'system', 'content': system_message}, # 'system' message, with AI Assistant Behaviour\r\n {'role':'user', 'content': f\"{delimiter}{few_shot_user_1}{delimiter}\"}, # 1st 'user' example \r\n {'role':'assistant', 'content': few_shot_assistant_1 }, # 1st 'assistant' example\r\n {'role':'user', 'content': f\"{delimiter}{few_shot_user_2}{delimiter}\"}, # 2nd 'user' example \r\n {'role':'assistant', 'content': few_shot_assistant_2 }, # 2nd 'assistant' example\r\n {'role':'user', 'content': f\"{delimiter}{user_input}{delimiter}\"}, # new 'user' input/query message\r\n ] \r\n return get_completion_from_messages(messages)",
"metadata": {}
},
{
"cell_type": "markdown",
"source": "### Evaluate the modified prompt on the hard tests cases",
"metadata": {}
},
{
"cell_type": "markdown",
"source": "```python\n# Create a list with (5) previous 'user' messages/queries\ncustomer_msg_list = [customer_msg_0, # 'user' input message 0\n customer_msg_1, # 'user' input message 1\n customer_msg_2, # 'user' input message 2\n customer_msg_3, # 'user' input message 3\n customer_msg_4] # 'user' input message 4\n\nfor customer_msg in customer_msg_list: # Iterate through each 'user' input message.\n \n # 'assistant'/model response (without junk text) to i-th 'user' input query\n products_by_category = find_category_and_product_v2(customer_msg,\n products_and_category)\n\n print(customer_msg,\"\\n\", products_by_category) # Display 'user' input message \n # and model/'assistant' \n # response (without junk text).\n```",
"metadata": {}
},
{
"cell_type": "markdown",
"source": "```\nWhich TV can I buy if I'm on a budget? \r\n \r\n [{'category': 'Televisions and Home Theater Systems', 'products': ['CineView 4K TV', 'SoundMax Home Theater', 'CineView 8K TV', 'SoundMax Soundbar', 'CineView OLED TV']}]\r\n \r\nI need a charger for my smartphone \r\n \r\n [{'category': 'Smartphones and Accessories', 'products': ['MobiTech PowerCase', 'MobiTech Wireless Charger']}]\r\n \r\n\r\nWhat computers do you have? \r\n \r\n [{'category': 'Computers and Laptops', 'products': ['TechPro Ultrabook', 'BlueWave Gaming Laptop', 'PowerLite Convertible', 'TechPro Desktop', 'BlueWave Chromebook']}]\r\n \r\n\r\ntell me about the smartx pro phone and the fotosnap camera, the dslr one.\r\nAlso, what TVs do you have? \r\n \r\n [{'category': 'Smartphones and Accessories', 'products': ['SmartX ProPhone']}, {'category': 'Cameras and Camcorders', 'products': ['FotoSnap DSLR Camera']}]\r\n \r\n\r\ntell me about the CineView TV, the 8K one, Gamesphere console, the X one.\r\nI'm on a budget, what computers do you have? \r\n \r\n [{'category': 'Televisions and Home Theater Systems', 'products': ['CineView 8K TV']},\r\n {'category': 'Gaming Consoles and Accessories', 'products': ['GameSphere X']},\r\n {'category': 'Computers and Laptops', 'products': ['TechPro Ultrabook', 'BlueWave Gaming Laptop', 'PowerLite Convertible', 'TechPro Desktop', 'BlueWave Chromebook']}]\n```",
"metadata": {}
},
{
"cell_type": "markdown",
"source": "### Regression testing: verify that the model still works on previous test cases\r\nCheck that modifying the model to fix the hard test cases does not negatively affect its performance on previous test cases.",
"metadata": {}
},
{
"cell_type": "markdown",
"source": "```python\n# 1st 'user' input message/query\ncustomer_msg_0 = f\"\"\"Which TV can I buy if I'm on a budget?\"\"\"\n\n# 'assistant'/model response (without junk text) to 1st 'user' input query\nproducts_by_category_0 = find_category_and_product_v2(customer_msg_0,\n products_and_category)\n\nprint(products_by_category_0) # Display model/'assistant' response \n # (without junk text), to 1st 'user' query.\n```",
"metadata": {}
},
{
"cell_type": "markdown",
"source": "```\n[{'category': 'Televisions and Home Theater Systems', 'products': ['CineView 4K TV', 'SoundMax Home Theater', 'CineView 8K TV', 'SoundMax Soundbar', 'CineView OLED TV']}]\n```",
"metadata": {}
},
{
"cell_type": "markdown",
"source": "### Gather development set for automated testing",
"metadata": {}
},
{
"cell_type": "markdown",
"source": "```python\nmsg_ideal_pairs_set = [\r\n \r\n # example 0 of cross-validation/development set, includes 'user' input 0 and ideal 'assistant' output 0.\r\n {'customer_msg':\"\"\"Which TV can I buy if I'm on a budget?\"\"\",\r\n 'ideal_answer':{'Televisions and Home Theater Systems':set(['CineView 4K TV', 'SoundMax Home Theater',\r\n 'CineView 8K TV', 'SoundMax Soundbar', \r\n 'CineView OLED TV']) # {'CineView 4K TV',...,'CineView OLED TV'} set.\r\n }\r\n },\r\n\r\n # example 1 of cross-validation/development set, includes 'user' input 1 and ideal 'assistant' output 1.\r\n {'customer_msg':\"\"\"I need a charger for my smartphone\"\"\",\r\n 'ideal_answer':{'Smartphones and Accessories':set(['MobiTech PowerCase', 'MobiTech Wireless Charger', \r\n 'SmartX EarBuds']) # {'MobiTech PowerCase',...,'SmartX EarBuds'} set.\r\n }\r\n },\r\n\r\n # example 2 of cross-validation/development set, includes 'user' input 2 and ideal 'assistant' output 2.\r\n {'customer_msg':f\"\"\"What computers do you have?\"\"\",\r\n 'ideal_answer':{'Computers and Laptops':set(['TechPro Ultrabook', 'BlueWave Gaming Laptop', \r\n 'PowerLite Convertible', 'TechPro Desktop', \r\n 'BlueWave Chromebook']) # {'TechPro Ultrabook',...,'BlueWave Chromebook'} set.\r\n }\r\n },\r\n\r\n # example 3 of cross-validation/development set, includes 'user' input 3 and ideal 'assistant' output 3.\r\n {'customer_msg':f\"\"\"tell me about the smartx pro phone and \\\r\n the fotosnap camera, the dslr one.\\\r\n Also, what TVs do you have?\"\"\",\r\n 'ideal_answer':{'Smartphones and Accessories':set(['SmartX ProPhone']), # {'SmartX ProPhone'} set.\r\n 'Cameras and Camcorders':set(['FotoSnap DSLR Camera']), # {'FotoSnap DSLR Camera'} set.\r\n 'Televisions and Home Theater Systems':set(['CineView 4K TV', 'SoundMax Home Theater',\r\n 'CineView 8K TV', 'SoundMax Soundbar', \r\n 'CineView OLED TV']) # {'CineView 4K TV',...,'CineView OLED TV'} set.\r\n }\r\n }, \r\n \r\n # example 4 of cross-validation/development set, includes 'user' input 4 and ideal 'assistant' output 4.\r\n {'customer_msg':\"\"\"tell me about the CineView TV, the 8K one, Gamesphere console, the X one.\r\n I'm on a budget, what computers do you have?\"\"\",\r\n 'ideal_answer':{'Televisions and Home Theater Systems':set(['CineView 8K TV']), # {'CineView 8K TV'} set.\r\n 'Gaming Consoles and Accessories':set(['GameSphere X']), # {'GameSphere X'} set.\r\n 'Computers and Laptops':set(['TechPro Ultrabook', 'BlueWave Gaming Laptop', \r\n 'PowerLite Convertible', 'TechPro Desktop', \r\n 'BlueWave Chromebook']) # {'TechPro Ultrabook',...,'BlueWave Chromebook'} set.\r\n }\r\n },\r\n \r\n # example 5 of cross-validation/development set, includes 'user' input 5 and ideal 'assistant' output 5.\r\n {'customer_msg':f\"\"\"What smartphones do you have?\"\"\",\r\n 'ideal_answer':{'Smartphones and Accessories':set(['SmartX ProPhone', 'MobiTech PowerCase', \r\n 'SmartX MiniPhone','MobiTech Wireless Charger', \r\n 'SmartX EarBuds']) # {'SmartX ProPhone',...,'SmartX EarBuds'} set.\r\n }\r\n },\r\n\r\n # example 6 of cross-validation/development set, includes 'user' input 6 and ideal 'assistant' output 6.\r\n {'customer_msg':f\"\"\"I'm on a budget. Can you recommend some smartphones to me?\"\"\",\r\n 'ideal_answer':{'Smartphones and Accessories':set(['SmartX EarBuds', 'SmartX MiniPhone', \r\n 'MobiTech PowerCase', 'SmartX ProPhone', \r\n 'MobiTech Wireless Charger']) # {'SmartX EarBuds',...,'MobiTech Wireless Charger'} set.\r\n }\r\n },\r\n\r\n # example 7 of cross-validation/development set, includes 'user' input 7 and ideal 'assistant' output 7. \r\n # This will output a subset of the ideal answer\r\n {'customer_msg':f\"\"\"What Gaming consoles would be good for my friend who is into racing games?\"\"\",\r\n 'ideal_answer':{'Gaming Consoles and Accessories':set(['GameSphere X', 'ProGamer Controller', \r\n 'GameSphere Y', 'ProGamer Racing Wheel', \r\n 'GameSphere VR Headset']) # {'GameSphere X',...,'GameSphere VR Headset'} set.\r\n }\r\n },\r\n\r\n # example 8 of cross-validation/development set, includes 'user' input 8 and ideal 'assistant' output 8.\r\n {'customer_msg':f\"\"\"What could be a good present for my videographer friend?\"\"\",\r\n 'ideal_answer': {'Cameras and Camcorders':set(['FotoSnap DSLR Camera', 'ActionCam 4K', \r\n 'FotoSnap Mirrorless Camera', 'ZoomMaster Camcorder', \r\n 'FotoSnap Instant Camera']) # {'FotoSnap DSLR Camera',...,'FotoSnap Instant Camera'} set.\r\n }\r\n },\r\n \r\n # example 9 of cross-validation/development set, includes 'user' input 9 and ideal 'assistant' output 9.\r\n # As we don't have these products, the ideal output is an empty list [].\r\n {'customer_msg':f\"\"\"I would like a hot tub time machine.\"\"\",\r\n 'ideal_answer': [] # Empty list [] as output set.\r\n }\r\n \r\n]\n```",
"metadata": {}
},
{
"cell_type": "markdown",
"source": "### Evaluate test cases by comparing to the ideal answers",
"metadata": {}
},
{
"cell_type": "markdown",
"source": "```python\nimport json # Used to convert a \"string\", into a python iterable \n # list of dictionaries [{},...]\n\n# Function that compares ideal(correct) answer with model answer.\ndef eval_response_with_ideal(response, # model output as 'string' format.\n ideal, # Iterable list of 'user' input and ideal 'assistant' output messages [{},...]\n debug=False): # Print desired statements when debug=True.\n \n if debug: # When debug=True\n print(\"response\") # print as 'response',\n print(response) # the model output completion/answer.\n \n # json.loads() function expects double quotes as \"string\" input, (not single quotes as 'string' input).\n json_like_str = response.replace(\"'\",'\"') # replace input response 'string' -> as \"string\".\n \n # parse \"string\" response \"[{},...]\" into a python iterable list of dictionaries [{},...]\n l_of_d = json.loads(json_like_str)\n \n # special case when response and ideal answer are empty lists [], (There is a match => 1).\n if l_of_d == [] and ideal == []:\n return 1\n \n # otherwise, when response is an empty list [] \n # or ideal answer should be empty list [], (There's a mismatch => 0).\n elif l_of_d == [] or ideal == []:\n return 0\n \n correct = 0 # Initialize counter as 'correct = 0'. \n \n if debug: # When debug=True\n print(\"l_of_d is\") # Print the response as a 'list of dictionaries'\n print(l_of_d) # in a python iterable [{},...] format.\n\n for d in l_of_d: # Iterate through each list element/message/dictionary from response\n # [ {msg0},{msg1},...,{msgN-1} ], every iteration.\n # {message} = {'category':'cat_name', 'products':[list of products]}\n\n cat = d.get('category') # Each iter, dict.get('category') name/value = dict['category'] name/value\n prod_l = d.get('products') # Each iter, dict.get('products') list/value = dict['products'] list/value\n\n if cat and prod_l: # if 'category' name and 'products' list exist (not empty)\n\n # convert response [list of products] into a non-repeated {set of 'products'}, for comparison.\n prod_set = set(prod_l)\n \n # get ideal answer set of {'products'} as ideal answer 'category' value (which is a dictionary {}), \n # using the previous obtained 'category' name/value, from model response.\n # dict.get('key') value = dict['key'] value.\n ideal_cat = ideal.get(cat)\n\n if ideal_cat: # If ideal set of {'products'} exist (it is not empty)\n prod_set_ideal = set(ideal.get(cat)) # be secure, to avoid repeated {'products' names at the set}\n\n else: # otherwise if {ideal set of 'products'} doesn't exist\n if debug: # and debug=True,\n print(f\"did not find category {cat} in ideal\") # print that the model response 'category' name\n # is not included at ideal answer,\n print(f\"ideal: {ideal}\") # and print ideal answer dict {'category':{set of 'prods'}}\n \n continue # or If {ideal set of 'products'} doesn't exist \n # and debug=false, just continue to the next line of code\n \n if debug: # When debug=True,\n print(\"prod_set\\n\",prod_set) # print the model response as a non-repeated {set of 'products'}\n print() # give blank space\n print(\"prod_set_ideal\\n\",prod_set_ideal) # and print the ideal answer non-repeated {set of 'products'}\n\n if prod_set == prod_set_ideal: # If model response {set of 'products'} = ideal answer {set of 'products'}\n if debug: # when debug=True,\n print(\"correct\") # print 'correct' text \n correct +=1 # anytime (also when debug=False) the model response {set} = ideal answer {set},\n # add (+1) to 'correct' variable inited as 0. correct = correct + 1.\n\n else: # Otherwise, when model response {set 0f 'products'} \n # is NOT equal to ideal answer {set of 'products'}.\n print(\"incorrect\") # print 'incorrect' text, then\n print(f\"prod_set: {prod_set}\") # show model reponse {set of 'products'}\n print(f\"prod_set_ideal: {prod_set_ideal}\") # and ideal answer {set of 'products'}\n\n if prod_set <= prod_set_ideal: # If model response {set of 'products'} is <= than ideal answer {set of 'products'}\n print(\"response is a subset of the ideal answer\") # print that model response is a subset of full ideal set.\n\n elif prod_set >= prod_set_ideal: # Otherwise, if model response {set of 'products'} is >= than ideal answer \n # {set of 'products'}\n print(\"response is a superset of the ideal answer\") # print that model response, \n # superpass the ideal answer set.\n\n # Out of for loop, count 'correct' product variable updates, over N total items/dicts in the response iterable \n # list [{},{},{},...]\n pc_correct = correct / len(l_of_d) # Correct responses (when response set = ideal set) / N_total_responses\n \n return pc_correct # return correct responses (when response set = ideal set) / N_total_responses\n```",
"metadata": {}
},
{
"cell_type": "markdown",
"source": "```python\n# Select {message 7} of list, and pick \"customer_msg\" key, to obtain its value\r\nprint(f'Customer message: {msg_ideal_pairs_set[7][\"customer_msg\"]}')\r\n\r\n# Select {message 7} of list, and pick \"ideal_answer\" key, to obtain its value\r\nprint(f'Ideal answer: {msg_ideal_pairs_set[7][\"ideal_answer\"]}')\n```",
"metadata": {}
},
{
"cell_type": "markdown",
"source": "```\nCustomer message: What Gaming consoles would be good for my friend who is into racing games?\r\nIdeal answer: {'Gaming Consoles and Accessories': {'GameSphere VR Headset', 'GameSphere X', 'ProGamer Controller', 'ProGamer Racing Wheel', 'GameSphere Y'}}\n```",
"metadata": {}
},
{
"cell_type": "markdown",
"source": "```python\n# Use prompt function with 'system', 'few_shot_users', 'few_shot_assistants' and 'user_input' messages\n# which return output completion / model response (from its respectively function).\nresponse = find_category_and_product_v2(msg_ideal_pairs_set[7][\"customer_msg\"],\n products_and_category)\n\nprint(f'Response: {response}') # Print model response to 'customer_msg', \n # as 'user_input' query.\n\n# Compare model response with ideal answer, and return the \n# 'corrects' count (response set = ideal set) / Total N {dicts} at model response, as score.\neval_response_with_ideal(response,\n msg_ideal_pairs_set[7][\"ideal_answer\"])\n```",
"metadata": {}
},
{
"cell_type": "markdown",
"source": "```\nResponse: \r\n [{'category': 'Gaming Consoles and Accessories', 'products': ['GameSphere X', 'ProGamer Controller', 'GameSphere Y', 'ProGamer Racing Wheel', 'GameSphere VR Headset']}]\r\n \r\n1.0\n```",
"metadata": {}
},
{
"cell_type": "markdown",
"source": "### Run evaluation on all test cases and calculate the fraction of cases that are correct",
"metadata": {}
},
{
"cell_type": "markdown",
"source": "```python\n# Note: this loop will not work, if any of the api calls time out\r\nscore_accum = 0 # Init counter as 0, to get the average score over\r\n # 10 cross-validation/development set examples.\r\n \r\nfor i, pair in enumerate(msg_ideal_pairs_set): # Iterate through each of 10 cross-validation / development set \r\n # examples [{example 0},{example 1},...,{example 9}], one at a time.\r\n \r\n print(f\"example {i}\") # print i-th example number, between 0-9.\r\n \r\n customer_msg = pair['customer_msg'] # Select at each pair/example dict {\"customer_msg\"} key, to get 'text' value\r\n ideal = pair['ideal_answer'] # Select at each pair/example dict {\"ideal_answer\"} key, to get as value \r\n # a dictionary with 'category' and a {set of 'products'}.\r\n \r\n # print(\"Customer message\",customer_msg)\r\n # print(\"ideal:\",ideal)\r\n response = find_category_and_product_v2(customer_msg, # 'user_input' query i-th example\r\n products_and_category) # products catalog dict {} in 'string' format\r\n # Get output completion/ model response message\r\n \r\n # print(\"products_by_category\",products_by_category)\r\n score = eval_response_with_ideal(response, # model response to each 'user_input' query/example\r\n ideal, # ideal 'assistant' answer/example\r\n debug=False) # Doesn't print some statements\r\n # Get score = 'correct' count / total N {dicts} at response \r\n # (average of corrects dicts, at one response)\r\n\r\n print(f\"{i}: {score}\") # Print score = 'correct' count / total N {dicts} at response. (Each iter/example)\r\n score_accum += score # accumulate all 10 scores (one per i-th example), as score_accum = score_accum + score \r\n # (init as 0)\r\n \r\n# Out of for loop\r\nn_examples = len(msg_ideal_pairs_set) # Get total number of items/dicts/examples at list, which is N = 10.\r\n # [{}0,{}1,...,{}9]\r\n\r\nfraction_correct = score_accum / n_examples # Last (10th) 'score_accum' value / N=10 examples. \r\n # (average of accum score of corrects, over all 10 examples)\r\n\r\nprint(f\"Fraction correct out of {n_examples}: {fraction_correct}\") # Print 'accum score' over all 10 examples,\r\n # only at the end of whole procedure. \r\n # (average of accum score of corrects, over all 10 examples)\n```",
"metadata": {}
},
{
"cell_type": "markdown",
"source": "```\nexample 0\r\n0: 1.0\r\nexample 1\r\nincorrect\r\nprod_set: {'MobiTech PowerCase', 'MobiTech Wireless Charger'}\r\nprod_set_ideal: {'SmartX EarBuds', 'MobiTech PowerCase', 'MobiTech Wireless Charger'}\r\nresponse is a subset of the ideal answer\r\n1: 0.0\r\nexample 2\r\n2: 1.0\r\nexample 3\r\n3: 1.0\r\nexample 4\r\n4: 1.0\r\nexample 5\r\n5: 1.0\r\nexample 6\r\n6: 1.0\r\nexample 7\r\n7: 1.0\r\nexample 8\r\n8: 1.0\r\nexample 9\r\n9: 1\r\nFraction correct out of 10: 0.9\n```",
"metadata": {}
}
]
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment