DiegoHernanSalazar/L9 Evaluation Part II .ipynb

## L9 Evaluation Part II .ipynb
{
  "metadata": {
    "kernelspec": {
      "name": "python",
      "display_name": "Python (Pyodide)",
      "language": "python"
    },
    "language_info": {
      "name": ""
    }
  },
  "nbformat_minor": 4,
  "nbformat": 4,
  "cells": [
    {
      "cell_type": "markdown",
      "source": "<img src=\"https://wordpress.deeplearning.ai/wp-content/uploads/2023/05/OpenAI-DeepLearning_Short_Courses_Campaign_1080.png\"/>",
      "metadata": {}
    },
    {
      "cell_type": "markdown",
      "source": "# L9: Evaluation Part II\r\n\r\nEvaluate LLM responses where there isn't a single \"right answer.\"",
      "metadata": {}
    },
    {
      "cell_type": "markdown",
      "source": "## Setup\r\n#### Load the API key and relevant Python libaries.\r\nIn this course, we've provided some code that loads the OpenAI API key for yo",
      "metadata": {}
    },
    {
      "cell_type": "markdown",
      "source": "```python\nimport openai            # Load OpenAI API library constructor/object\r\nimport os                # Set your openai API key secretly, at the \r\n                         # operating system, using os constructor\r\nimport sys               # access to system-specific parameters and functions in Python\r\nsys.path.append('../..') # it searches modules to be imported at the new directory path '../..'\r\nimport utils             # make common patterns shorter and easier in python, \r\n                         # via small functions and classes. Load function\r\n                         # 'utils.get_products_and_category()' already built in.\r\n\r\nimport tiktoken          # get tokenizer constructor to get input, output and total tokens count\r\n\r\nfrom dotenv import load_dotenv, find_dotenv # Get load and Read functions\r\n_ = load_dotenv(find_dotenv()) # Read and then Load local file called '.env' that contains API key\r\n                               # at os -> environ dictionary {} -> ['OPENAI_API_KEY'] key -> write API key value as 'sk-...'\r\n\r\n# Get an openai API key from the OpenAI website, and set your API key as openai.api_key=\"sk-...\" \r\nopenai.api_key  = os.environ['OPENAI_API_KEY'] # This way you set/store the API key more securely, as an\r\n                                               # environmental variable, into the operating system, using\r\n                                               # openai.api_key=os.getenv('OPEN_API_KEY')\r\n                                               # so the API key's value is set in the jupyter environment,\r\n                                               # by selecting this written API 'key' value at: \r\n                                               # os -> environ dictionary {} -> ['OPENAI_API_KEY'] key \r\n                                               # -> API key value is a text/string 'sk-...'\n```",
      "metadata": {}
    },
    {
      "cell_type": "markdown",
      "source": "```python\ndef get_completion_from_messages(messages, # List of multiple messages [{},{},...]\n                                 model=\"gpt-3.5-turbo\", # model's name used\n                                 temperature=0,   # Randomness in model's response      \n                                 max_tokens=500): # max input+output characters\n    response = openai.ChatCompletion.create(\n        model=model,             # \"gpt-3.5-turbo\"\n        messages=messages,       # [{},{},...]\n        temperature=temperature, # this is the degree of randomness/exploration in the model's output. \n                                 # Max exploration occurs when temperature is closer to 1\n        max_tokens=max_tokens,   # maximum number of tokens (common sequences of characters) \n                                 # the model can accept as input'context' + output'completion' \n    )\n    return response.choices[0].message[\"content\"] # model's response message or output completion\n```",
      "metadata": {}
    },
    {
      "cell_type": "markdown",
      "source": "### Run through the end-to-end system to answer the user query\r\n\r\nThese helper functions are running the chain of promopts that you saw in the earlier videos.",
      "metadata": {}
    },
    {
      "cell_type": "markdown",
      "source": "```python\n# 'user_input' or customer query message\r\ncustomer_msg = f\"\"\"\r\ntell me about the smartx pro phone and the fotosnap camera, the dslr one.\r\nAlso, what TVs or TV related products do you have?\"\"\"\r\n\r\n# Get first products catalog as list of dicts [{},{},{},...] in 'string' format.\r\n# This is the 1st model response to customer query message, which names some products.\r\n# Each product name is in a dict = {\"category\":'text',\"products\":[list of text 'products']}\r\nproducts_by_category = utils.get_products_from_query(customer_msg)\r\nprint(products_by_category,'\\n')\r\n\r\n# Convert 'string' into a python iterable list, with 'json' constructor. \r\n# Then return 2nd model's response to 'user_input' or customer message, \r\n# based on the previous 1st list of products organized by categories.\r\n# 2nd model response =[{'category':'text','products':[list of products]},\r\n#                      {'category':'text','products':[list of products]},\r\n#                                           .....                        \r\n#                     ]\r\ncategory_and_product_list = utils.read_string_to_list(products_by_category)\r\nprint(category_and_product_list,'\\n')\r\n\r\n# Get products_by_category output (also called detailed products information, see L5), \r\n# per each 'category' at the 2nd model response, also as a list  \r\n# of products [{},{},{},...]. Each product is a dictionary that includes \r\n# {'name':'text','category':'text','brand':'text','model_number':'text',\r\n# 'warranty':'text','rating':value,'features':[list of features as 'text'],\r\n# 'description':'text','price':value}\r\nproduct_info = utils.get_mentioned_product_info(category_and_product_list)\r\nprint(product_info)\r\n\r\n# Take 'user_input' and the [list of {detailed products information}... ],\r\n# and return the final model response sentence.\r\nassistant_answer = utils.answer_user_msg(user_msg=customer_msg,product_info=product_info)\n```",
      "metadata": {}
    },
    {
      "cell_type": "markdown",
      "source": "```\n[\r\n    {\r\n        \"category\": \"Smartphones and Accessories\",\r\n        \"products\": [\r\n            \"SmartX ProPhone\"\r\n        ]\r\n    },\r\n    {\r\n        \"category\": \"Cameras and Camcorders\",\r\n        \"products\": [\r\n            \"FotoSnap DSLR Camera\"\r\n        ]\r\n    },\r\n    {\r\n        \"category\": \"Televisions and Home Theater Systems\",\r\n        \"products\": [\r\n            \"CineView 4K TV\",\r\n            \"CineView 8K TV\",\r\n            \"SoundMax Home Theater\",\r\n            \"SoundMax Soundbar\",\r\n            \"CineView OLED TV\"\r\n        ]\r\n    }\r\n] \r\n\r\n[{'category': 'Smartphones and Accessories', 'products': ['SmartX ProPhone']}, {'category': 'Cameras and Camcorders', 'products': ['FotoSnap DSLR Camera']}, {'category': 'Televisions and Home Theater Systems', 'products': ['CineView 4K TV', 'CineView 8K TV', 'SoundMax Home Theater', 'SoundMax Soundbar', 'CineView OLED TV']}] \r\n\r\n[{'name': 'SmartX ProPhone', 'category': 'Smartphones and Accessories', 'brand': 'SmartX', 'model_number': 'SX-PP10', 'warranty': '1 year', 'rating': 4.6, 'features': ['6.1-inch display', '128GB storage', '12MP dual camera', '5G'], 'description': 'A powerful smartphone with advanced camera features.', 'price': 899.99}, {'name': 'FotoSnap DSLR Camera', 'category': 'Cameras and Camcorders', 'brand': 'FotoSnap', 'model_number': 'FS-DSLR200', 'warranty': '1 year', 'rating': 4.7, 'features': ['24.2MP sensor', '1080p video', '3-inch LCD', 'Interchangeable lenses'], 'description': 'Capture stunning photos and videos with this versatile DSLR camera.', 'price': 599.99}, {'name': 'CineView 4K TV', 'category': 'Televisions and Home Theater Systems', 'brand': 'CineView', 'model_number': 'CV-4K55', 'warranty': '2 years', 'rating': 4.8, 'features': ['55-inch display', '4K resolution', 'HDR', 'Smart TV'], 'description': 'A stunning 4K TV with vibrant colors and smart features.', 'price': 599.99}, {'name': 'CineView 8K TV', 'category': 'Televisions and Home Theater Systems', 'brand': 'CineView', 'model_number': 'CV-8K65', 'warranty': '2 years', 'rating': 4.9, 'features': ['65-inch display', '8K resolution', 'HDR', 'Smart TV'], 'description': 'Experience the future of television with this stunning 8K TV.', 'price': 2999.99}, {'name': 'SoundMax Home Theater', 'category': 'Televisions and Home Theater Systems', 'brand': 'SoundMax', 'model_number': 'SM-HT100', 'warranty': '1 year', 'rating': 4.4, 'features': ['5.1 channel', '1000W output', 'Wireless subwoofer', 'Bluetooth'], 'description': 'A powerful home theater system for an immersive audio experience.', 'price': 399.99}, {'name': 'SoundMax Soundbar', 'category': 'Televisions and Home Theater Systems', 'brand': 'SoundMax', 'model_number': 'SM-SB50', 'warranty': '1 year', 'rating': 4.3, 'features': ['2.1 channel', '300W output', 'Wireless subwoofer', 'Bluetooth'], 'description': \"Upgrade your TV's audio with this sleek and powerful soundbar.\", 'price': 199.99}, {'name': 'CineView OLED TV', 'category': 'Televisions and Home Theater Systems', 'brand': 'CineView', 'model_number': 'CV-OLED55', 'warranty': '2 years', 'rating': 4.7, 'features': ['55-inch display', '4K resolution', 'HDR', 'Smart TV'], 'description': 'Experience true blacks and vibrant colors with this OLED TV.', 'price': 1499.99}]\n```",
      "metadata": {}
    },
    {
      "cell_type": "markdown",
      "source": "```python\nprint(assistant_answer)     # Print the final model response/paragraph\n```",
      "metadata": {}
    },
    {
      "cell_type": "markdown",
      "source": "```\nThe SmartX ProPhone features a 6.1-inch display, 128GB storage, 12MP dual camera, and 5G connectivity. It is priced at $899.99 with a 1-year warranty. The FotoSnap DSLR Camera has a 24.2MP sensor, shoots 1080p video, has a 3-inch LCD screen, and supports interchangeable lenses. It is priced at $599.99 with a 1-year warranty.\r\n\r\nFor TVs and related products, we have the CineView 4K TV with a 55-inch display, 4K resolution, HDR, and Smart TV features priced at $599.99. We also offer the CineView 8K TV with a 65-inch display, 8K resolution, HDR, and Smart TV capabilities for $2999.99. Additionally, we have the SoundMax Home Theater system with 5.1 channels and a wireless subwoofer for $399.99, and the SoundMax Soundbar with 2.1 channels and Bluetooth connectivity for $199.99. Is there a specific product you would like more details on or are you interested in any particular features?\n```",
      "metadata": {}
    },
    {
      "cell_type": "markdown",
      "source": "### Evaluate the LLM's answer to the user with a rubric, based on the extracted product information",
      "metadata": {}
    },
    {
      "cell_type": "markdown",
      "source": "```python\n# Create a little data structure to store the 'customer_message' as well as the 'detailed_product_info' \ncust_prod_info = {\n    'customer_msg': customer_msg,\n    'context': product_info\n}\n```",
      "metadata": {}
    },
    {
      "cell_type": "markdown",
      "source": "```python\ndef eval_with_rubric(test_set, assistant_answer):\r\n\r\n    cust_msg = test_set['customer_msg']\r\n    context = test_set['context']\r\n    completion = assistant_answer\r\n    \r\n    system_message = \"\"\"\\\r\n    You are an assistant that evaluates how well the customer service agent \\\r\n    answers a user question by looking at the context that the customer service \\\r\n    agent is using to generate its response. \r\n    \"\"\"\r\n    \r\n    # 'New_user_msg' including 'customer_msg', 'detailed products info list' and 'model/assistant response'\r\n    user_message = f\"\"\"\\\r\nYou are evaluating a submitted answer to a question based on the context \\\r\nthat the agent uses to answer the question.\r\nHere is the data:\r\n    [BEGIN DATA]\r\n    ************\r\n    [Question]: {cust_msg}\r\n    ************\r\n    [Context]: {context}\r\n    ************\r\n    [Submission]: {completion}\r\n    ************\r\n    [END DATA]\r\n\r\nCompare the factual content of the submitted answer with the context. \\\r\nIgnore any differences in style, grammar, or punctuation.\r\nAnswer the following questions:\r\n    - Is the Assistant response based only on the context provided? (Y or N)\r\n    - Does the answer include information that is not provided in the context? (Y or N)\r\n    - Is there any disagreement between the response and the context? (Y or N)\r\n    - Count how many questions the user asked. (output a number)\r\n    - For each question that the user asked, is there a corresponding answer to it?\r\n      Question 1: (Y or N)\r\n      Question 2: (Y or N)\r\n      ...\r\n      Question N: (Y or N)\r\n    - Of the number of questions asked, how many of these questions were addressed by the answer? (output a number)\r\n\"\"\"\r\n\r\n    messages = [\r\n        {'role': 'system', 'content': system_message},\r\n        {'role': 'user', 'content': user_message}\r\n    ]\r\n\r\n    response = get_completion_from_messages(messages) # model response about evaluation, \r\n                                                      # with suggested rubric or set of guidelines.\r\n    return response   # return model response with evaluation, considering suggested rubric or set of guidelines,\r\n                      # we think the answer should get right for us to be considered as a good answer.\n```",
      "metadata": {}
    },
    {
      "cell_type": "markdown",
      "source": "```python\n# Run the rubric evaluation function, created above.\r\nevaluation_output = eval_with_rubric(cust_prod_info, assistant_answer)\r\n\r\nprint(evaluation_output) # Print model evaluation response, following suggested rubric/guidelines.\n```",
      "metadata": {}
    },
    {
      "cell_type": "markdown",
      "source": "```\n- Is the Assistant response based only on the context provided? (Y or N)  \r\nY\r\n\r\n- Does the answer include information that is not provided in the context? (Y or N)  \r\nN\r\n\r\n- Is there any disagreement between the response and the context? (Y or N)  \r\nN\r\n\r\n- Count how many questions the user asked. (output a number)  \r\n2\r\n\r\n- For each question that the user asked, is there a corresponding answer to it?  \r\nQuestion 1: Y  \r\nQuestion 2: Y  \r\n\r\n- Of the number of questions asked, how many of these questions were addressed by the answer? (output a number)  \r\n2\n```",
      "metadata": {}
    },
    {
      "cell_type": "markdown",
      "source": "### Evaluate the LLM's answer to the user based on an \"ideal\" / \"expert\" (human generated) answer.",
      "metadata": {}
    },
    {
      "cell_type": "markdown",
      "source": "```python\ntest_set_ideal = {\r\n    'customer_msg': \"\"\"\\\r\ntell me about the smartx pro phone and the fotosnap camera, the dslr one.\r\nAlso, what TVs or TV related products do you have?\"\"\",\r\n    'ideal_answer':\"\"\"\\\r\nOf course!  The SmartX ProPhone is a powerful \\\r\nsmartphone with advanced camera features. \\\r\nFor instance, it has a 12MP dual camera. \\\r\nOther features include 5G wireless and 128GB storage. \\\r\nIt also has a 6.1-inch display.  The price is $899.99.\r\n\r\nThe FotoSnap DSLR Camera is great for \\\r\ncapturing stunning photos and videos. \\\r\nSome features include 1080p video, \\\r\n3-inch LCD, a 24.2MP sensor, \\\r\nand interchangeable lenses. \\\r\nThe price is 599.99.\r\n\r\nFor TVs and TV related products, we offer 3 TVs \\\r\n\r\n\r\nAll TVs offer HDR and Smart TV.\r\n\r\nThe CineView 4K TV has vibrant colors and smart features. \\\r\nSome of these features include a 55-inch display, \\\r\n'4K resolution. It's priced at 599.\r\n\r\nThe CineView 8K TV is a stunning 8K TV. \\\r\nSome features include a 65-inch display and \\\r\n8K resolution.  It's priced at 2999.99\r\n\r\nThe CineView OLED TV lets you experience vibrant colors. \\\r\nSome features include a 55-inch display and 4K resolution. \\\r\nIt's priced at 1499.99.\r\n\r\nWe also offer 2 home theater products, both which include bluetooth.\\\r\nThe SoundMax Home Theater is a powerful home theater system for \\\r\nan immersive audio experience.\r\nIts features include 5.1 channel, 1000W output, and wireless subwoofer.\r\nIt's priced at 399.99.\r\n\r\nThe SoundMax Soundbar is a sleek and powerful soundbar.\r\nIt's features include 2.1 channel, 300W output, and wireless subwoofer.\r\nIt's priced at 199.99\r\n\r\nAre there any questions additional you may have about these products \\\r\nthat you mentioned here?\r\nOr may do you have other questions I can help you with?\r\n    \"\"\"\r\n}\n```",
      "metadata": {}
    },
    {
      "cell_type": "markdown",
      "source": "### Check if the LLM's response agrees with or disagrees with the expert answer\r\n\r\nThis evaluation prompt is from the [OpenAI evals](https://github.com/openai/evals/blob/main/evals/registry/modelgraded/fact.yaml) project.\r\n\r\n[BLEU score](https://en.wikipedia.org/wiki/BLEU): another way to evaluate whether two pieces of text are similar or not.",
      "metadata": {}
    },
    {
      "cell_type": "markdown",
      "source": "```python\ndef eval_vs_ideal(test_set, assistant_answer):\r\n\r\n    cust_msg = test_set['customer_msg']\r\n    ideal = test_set['ideal_answer']\r\n    completion = assistant_answer\r\n    \r\n    system_message = \"\"\"\\\r\n    You are an assistant that evaluates how well the customer service agent \\\r\n    answers a user question by comparing the response to the ideal (expert) response\r\n    Output a single letter and nothing else. \r\n    \"\"\"\r\n    \r\n    # 'New_user_msg' including 'customer_msg', 'ideal desired/assistant answer' and 'model/assistant response'\r\n    user_message = f\"\"\"\\\r\nYou are comparing a submitted answer to an expert answer on a given question. Here is the data:\r\n    [BEGIN DATA]\r\n    ************\r\n    [Question]: {cust_msg}\r\n    ************\r\n    [Expert]: {ideal}\r\n    ************\r\n    [Submission]: {completion}\r\n    ************\r\n    [END DATA]\r\n\r\nCompare the factual content of the submitted answer with the expert answer. Ignore any differences in style, grammar, or punctuation.\r\n    The submitted answer may either be a subset or superset of the expert answer, or it may conflict with it. Determine which case applies. Answer the question by selecting one of the following options:\r\n    (A) The submitted answer is a subset of the expert answer and is fully consistent with it.\r\n    (B) The submitted answer is a superset of the expert answer and is fully consistent with it.\r\n    (C) The submitted answer contains all the same details as the expert answer.\r\n    (D) There is a disagreement between the submitted answer and the expert answer.\r\n    (E) The answers differ, but these differences don't matter from the perspective of factuality.\r\n  choice_strings: ABCDE\r\n  Evaluate an discard each option as follows: A, B, C, D, E. After that, pick what you consider is the correct option \\\r\n  by comparing, the submitted answer with the expert answer in a non-extreme or rigorous way. Ignore any differences in style, \\ \r\n  grammar, punctuation or structure, to evaluate submitted answer as a subset fully consistent with expert answer (option A). \\ \r\n  Do that before answering the question.\r\n\"\"\"\r\n\r\n    messages = [\r\n        {'role': 'system', 'content': system_message},\r\n        {'role': 'user', 'content': user_message}\r\n    ]\r\n\r\n    response = get_completion_from_messages(messages) # Evaluation response, comparing ideal answer and model response,\r\n                                                      # via 'user_message' prompt.\r\n    return response # return Evaluation response, comparing ideal answer and model response, vía 'user_message' prompt.\n```",
      "metadata": {}
    },
    {
      "cell_type": "markdown",
      "source": "```python\n# Let's recall the model/'assistant' output response, to be compared/evaluated.\nprint(assistant_answer)\n```",
      "metadata": {}
    },
    {
      "cell_type": "markdown",
      "source": "```\nThe SmartX ProPhone features a 6.1-inch display, 128GB storage, 12MP dual camera, and 5G connectivity. It is priced at $899.99 with a 1-year warranty. The FotoSnap DSLR Camera has a 24.2MP sensor, shoots 1080p video, has a 3-inch LCD screen, and supports interchangeable lenses. It is priced at $599.99 with a 1-year warranty.\n\nFor TVs and related products, we have the CineView 4K TV with a 55-inch display, 4K resolution, HDR, and Smart TV features priced at $599.99. We also offer the CineView 8K TV with a 65-inch display, 8K resolution, HDR, and Smart TV capabilities for $2999.99. Additionally, we have the SoundMax Home Theater system with 5.1 channels and a wireless subwoofer for $399.99, and the SoundMax Soundbar with 2.1 channels and Bluetooth connectivity for $199.99. Is there a specific product you would like more details on or are you interested in any particular features?\n```",
      "metadata": {}
    },
    {
      "cell_type": "markdown",
      "source": "```python\n# Call evaluation function to compare model response with ideal answer\neval_vs_ideal(test_set_ideal, assistant_answer)\n```",
      "metadata": {}
    },
    {
      "cell_type": "markdown",
      "source": "'C'",
      "metadata": {}
    },
    {
      "cell_type": "markdown",
      "source": "```python\n# Let's try a very different model/'assistant' answer, from \"Forrest Gump\" movie.\nassistant_answer_2 = \"life is like a box of chocolates\"\n```",
      "metadata": {}
    },
    {
      "cell_type": "markdown",
      "source": "```python\n# Call evaluation function to compare our new model response, with ideal answer.\neval_vs_ideal(test_set_ideal, assistant_answer_2)\n```",
      "metadata": {}
    },
    {
      "cell_type": "markdown",
      "source": "'D'",
      "metadata": {}
    }
  ]
}