Skip to content

Instantly share code, notes, and snippets.

@Daethyra
Last active December 8, 2023 04:38
Show Gist options
  • Save Daethyra/daeea7c9f805233dc0c248e8db6a5d02 to your computer and use it in GitHub Desktop.
Save Daethyra/daeea7c9f805233dc0c248e8db6a5d02 to your computer and use it in GitHub Desktop.
This walkthrough guides you in creating a dataset to evaluate chatbots within multi-turn conversations.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "view-in-github",
"colab_type": "text"
},
"source": [
"<a href=\"https://colab.research.google.com/gist/Daethyra/daeea7c9f805233dc0c248e8db6a5d02/copy-of-chat_evaluation_single_turn.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
]
},
{
"cell_type": "markdown",
"id": "d5bc7d94-ad28-4959-af26-d436035a2737",
"metadata": {
"id": "d5bc7d94-ad28-4959-af26-d436035a2737"
},
"source": [
"# Evaluate a Conversational Chat Bot\n",
"[![Open In Collab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/langchain-ai/langsmith-cookbook/blob/main/testing-examples/chat-single-turn/chat_evaluation_single_turn.ipynb)\n",
"\n",
"Chat bots, designed to assist users or provide entertainment, typically engage in multiple conversational turns in any given chat session. Over extended interactions, there is potential for the quality of the LLM responses to diminish. This can manifest as difficulty in recalling earlier conversation details, improper interaction with external resources (if available), or repetitive and uninspired responses. If your current offline evaluations focus solely on individual question-answer pairs, such nuances may go unnoticed.\n",
"\n",
"This walkthrough guides you in creating a dataset to evaluate chatbots within multi-turn conversations. Rather than complicate the evaluation with simulation-based methods, the technique outlined below simplifies the process: we treat each data point as an individual dialogue turn. The image below shows two rows from the example dataset used in this walkthrough. The user's question would be difficult to comprehend when taken out of context, but the chat history provides enough information to cue the bot in on what should be included.\n",
"\n",
"![Simple dialogue examples](https://github.com/langchain-ai/langsmith-cookbook/blob/main/testing-examples/chat-single-turn/img/comparison_screenshot.png?raw=1)\n",
"\n",
"Evaluating each dialogue turn independently provides an additional benefit: if the examples are sampled from different stages of actual dialogues, they effectively capture and represent typical scenarios your bot should be able to handle well.\n",
"\n",
"This notebook provides a concise overview of the process. Let's get started!"
]
},
{
"cell_type": "markdown",
"id": "0e4febc5-2d9e-4a43-83ba-d84725ecb466",
"metadata": {
"id": "0e4febc5-2d9e-4a43-83ba-d84725ecb466"
},
"source": [
"## Prerequisites\n",
"\n",
"This walkthrough uses LangChain and OpenAI. Install these packages below and configure your API keys accordingly."
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "8a84cf96-099b-41c9-a286-85640db065c0",
"metadata": {
"id": "8a84cf96-099b-41c9-a286-85640db065c0",
"outputId": "ef992708-6e78-425f-e21c-00ccc9448bd7",
"colab": {
"base_uri": "https://localhost:8080/"
}
},
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"Collecting langchain\n",
" Downloading langchain-0.0.348-py3-none-any.whl (2.0 MB)\n",
"\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m2.0/2.0 MB\u001b[0m \u001b[31m12.3 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
"\u001b[?25hCollecting openai\n",
" Downloading openai-1.3.7-py3-none-any.whl (221 kB)\n",
"\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m221.4/221.4 kB\u001b[0m \u001b[31m15.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
"\u001b[?25hRequirement already satisfied: PyYAML>=5.3 in /usr/local/lib/python3.10/dist-packages (from langchain) (6.0.1)\n",
"Requirement already satisfied: SQLAlchemy<3,>=1.4 in /usr/local/lib/python3.10/dist-packages (from langchain) (2.0.23)\n",
"Requirement already satisfied: aiohttp<4.0.0,>=3.8.3 in /usr/local/lib/python3.10/dist-packages (from langchain) (3.9.1)\n",
"Requirement already satisfied: async-timeout<5.0.0,>=4.0.0 in /usr/local/lib/python3.10/dist-packages (from langchain) (4.0.3)\n",
"Collecting dataclasses-json<0.7,>=0.5.7 (from langchain)\n",
" Downloading dataclasses_json-0.6.3-py3-none-any.whl (28 kB)\n",
"Collecting jsonpatch<2.0,>=1.33 (from langchain)\n",
" Downloading jsonpatch-1.33-py2.py3-none-any.whl (12 kB)\n",
"Collecting langchain-core<0.1,>=0.0.12 (from langchain)\n",
" Downloading langchain_core-0.0.12-py3-none-any.whl (181 kB)\n",
"\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m181.5/181.5 kB\u001b[0m \u001b[31m14.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
"\u001b[?25hCollecting langsmith<0.1.0,>=0.0.63 (from langchain)\n",
" Downloading langsmith-0.0.69-py3-none-any.whl (48 kB)\n",
"\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m48.2/48.2 kB\u001b[0m \u001b[31m5.7 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
"\u001b[?25hRequirement already satisfied: numpy<2,>=1 in /usr/local/lib/python3.10/dist-packages (from langchain) (1.23.5)\n",
"Requirement already satisfied: pydantic<3,>=1 in /usr/local/lib/python3.10/dist-packages (from langchain) (1.10.13)\n",
"Requirement already satisfied: requests<3,>=2 in /usr/local/lib/python3.10/dist-packages (from langchain) (2.31.0)\n",
"Requirement already satisfied: tenacity<9.0.0,>=8.1.0 in /usr/local/lib/python3.10/dist-packages (from langchain) (8.2.3)\n",
"Requirement already satisfied: anyio<4,>=3.5.0 in /usr/local/lib/python3.10/dist-packages (from openai) (3.7.1)\n",
"Requirement already satisfied: distro<2,>=1.7.0 in /usr/lib/python3/dist-packages (from openai) (1.7.0)\n",
"Collecting httpx<1,>=0.23.0 (from openai)\n",
" Downloading httpx-0.25.2-py3-none-any.whl (74 kB)\n",
"\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m75.0/75.0 kB\u001b[0m \u001b[31m9.3 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
"\u001b[?25hRequirement already satisfied: sniffio in /usr/local/lib/python3.10/dist-packages (from openai) (1.3.0)\n",
"Requirement already satisfied: tqdm>4 in /usr/local/lib/python3.10/dist-packages (from openai) (4.66.1)\n",
"Requirement already satisfied: typing-extensions<5,>=4.5 in /usr/local/lib/python3.10/dist-packages (from openai) (4.5.0)\n",
"Requirement already satisfied: attrs>=17.3.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp<4.0.0,>=3.8.3->langchain) (23.1.0)\n",
"Requirement already satisfied: multidict<7.0,>=4.5 in /usr/local/lib/python3.10/dist-packages (from aiohttp<4.0.0,>=3.8.3->langchain) (6.0.4)\n",
"Requirement already satisfied: yarl<2.0,>=1.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp<4.0.0,>=3.8.3->langchain) (1.9.3)\n",
"Requirement already satisfied: frozenlist>=1.1.1 in /usr/local/lib/python3.10/dist-packages (from aiohttp<4.0.0,>=3.8.3->langchain) (1.4.0)\n",
"Requirement already satisfied: aiosignal>=1.1.2 in /usr/local/lib/python3.10/dist-packages (from aiohttp<4.0.0,>=3.8.3->langchain) (1.3.1)\n",
"Requirement already satisfied: idna>=2.8 in /usr/local/lib/python3.10/dist-packages (from anyio<4,>=3.5.0->openai) (3.6)\n",
"Requirement already satisfied: exceptiongroup in /usr/local/lib/python3.10/dist-packages (from anyio<4,>=3.5.0->openai) (1.2.0)\n",
"Collecting marshmallow<4.0.0,>=3.18.0 (from dataclasses-json<0.7,>=0.5.7->langchain)\n",
" Downloading marshmallow-3.20.1-py3-none-any.whl (49 kB)\n",
"\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m49.4/49.4 kB\u001b[0m \u001b[31m5.5 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
"\u001b[?25hCollecting typing-inspect<1,>=0.4.0 (from dataclasses-json<0.7,>=0.5.7->langchain)\n",
" Downloading typing_inspect-0.9.0-py3-none-any.whl (8.8 kB)\n",
"Requirement already satisfied: certifi in /usr/local/lib/python3.10/dist-packages (from httpx<1,>=0.23.0->openai) (2023.11.17)\n",
"Collecting httpcore==1.* (from httpx<1,>=0.23.0->openai)\n",
" Downloading httpcore-1.0.2-py3-none-any.whl (76 kB)\n",
"\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m76.9/76.9 kB\u001b[0m \u001b[31m4.3 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
"\u001b[?25hCollecting h11<0.15,>=0.13 (from httpcore==1.*->httpx<1,>=0.23.0->openai)\n",
" Downloading h11-0.14.0-py3-none-any.whl (58 kB)\n",
"\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m58.3/58.3 kB\u001b[0m \u001b[31m6.7 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
"\u001b[?25hCollecting jsonpointer>=1.9 (from jsonpatch<2.0,>=1.33->langchain)\n",
" Downloading jsonpointer-2.4-py2.py3-none-any.whl (7.8 kB)\n",
"Requirement already satisfied: packaging<24.0,>=23.2 in /usr/local/lib/python3.10/dist-packages (from langchain-core<0.1,>=0.0.12->langchain) (23.2)\n",
"Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests<3,>=2->langchain) (3.3.2)\n",
"Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests<3,>=2->langchain) (2.0.7)\n",
"Requirement already satisfied: greenlet!=0.4.17 in /usr/local/lib/python3.10/dist-packages (from SQLAlchemy<3,>=1.4->langchain) (3.0.1)\n",
"Collecting mypy-extensions>=0.3.0 (from typing-inspect<1,>=0.4.0->dataclasses-json<0.7,>=0.5.7->langchain)\n",
" Downloading mypy_extensions-1.0.0-py3-none-any.whl (4.7 kB)\n",
"Installing collected packages: mypy-extensions, marshmallow, jsonpointer, h11, typing-inspect, langsmith, jsonpatch, httpcore, langchain-core, httpx, dataclasses-json, openai, langchain\n",
"\u001b[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.\n",
"llmx 0.0.15a0 requires cohere, which is not installed.\n",
"llmx 0.0.15a0 requires tiktoken, which is not installed.\u001b[0m\u001b[31m\n",
"\u001b[0mSuccessfully installed dataclasses-json-0.6.3 h11-0.14.0 httpcore-1.0.2 httpx-0.25.2 jsonpatch-1.33 jsonpointer-2.4 langchain-0.0.348 langchain-core-0.0.12 langsmith-0.0.69 marshmallow-3.20.1 mypy-extensions-1.0.0 openai-1.3.7 typing-inspect-0.9.0\n"
]
}
],
"source": [
"%pip install -U langchain openai pandas"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "ce3a7158-4e4b-43a7-845d-b3091ea7b9f5",
"metadata": {
"id": "ce3a7158-4e4b-43a7-845d-b3091ea7b9f5"
},
"outputs": [],
"source": [
"import os\n",
"\n",
"os.environ[\"LANGCHAIN_API_KEY\"] = \"key\"\n",
"os.environ[\"OPENAI_API_KEY\"] = \"key\""
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "9baf7b71-4a29-4a0b-835f-9929169bfc82",
"metadata": {
"id": "9baf7b71-4a29-4a0b-835f-9929169bfc82"
},
"outputs": [],
"source": [
"import uuid\n",
"\n",
"from langsmith import Client\n",
"\n",
"uid = uuid.uuid4()\n",
"client = Client()"
]
},
{
"cell_type": "markdown",
"id": "cfd8a62c-ac6b-46b5-ba55-2eb4f4f51a55",
"metadata": {
"id": "cfd8a62c-ac6b-46b5-ba55-2eb4f4f51a55"
},
"source": [
"# 1. Create dataset\n",
"\n",
"The key component of a multi-turn dialogue dataset is a list of conversational \"turns\". For each dataset example row, we will evaluate how well the chat bot can respond, taking the conversation history into account.\n",
"\n",
"To accomplish his, we will make a dataset using a list of chat message dictionaries. We have factored out the \"incoming\" user query as a separate field."
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "3ea885a4-83e4-45a5-b576-a14e085699d4",
"metadata": {
"id": "3ea885a4-83e4-45a5-b576-a14e085699d4"
},
"outputs": [],
"source": [
"examples = [\n",
" {\n",
" \"inputs\": {\n",
" \"question\": \"How does that apply?\",\n",
" \"chat_history\": [\n",
" {\n",
" \"role\": \"user\",\n",
" \"content\": \"I’m trying tto uunderstand instein’s theory.\",\n",
" },\n",
" {\n",
" \"role\": \"assistant\",\n",
" \"content\": \"Which one? He's known for several theories.\",\n",
" },\n",
" {\"role\": \"user\", \"content\": \"Thee one about time and spae.\"},\n",
" {\n",
" \"role\": \"assistant\",\n",
" \"content\": \"Ah, you're referring to the theory of relativity. There are two parts: special and general. Which one?\",\n",
" },\n",
" ],\n",
" },\n",
" \"outputs\": {\n",
" \"expected\": \"Special relativity, proposed by Einstein in 1905, deals with objects in uniform motion, especially those moving at the speed of light. It introduced the idea that time and space are relative and can change in relation to each other. For instance, time can appear to move slower for an object moving close to the speed of light.\"\n",
" },\n",
" },\n",
" {\n",
" \"inputs\": {\n",
" \"question\": \"What's the main difference?\",\n",
" \"chat_history\": [\n",
" {\"role\": \"user\", \"content\": \"Can yyou contrast DNA and RNA for me?\"},\n",
" {\n",
" \"role\": \"assistant\",\n",
" \"content\": \"Certainly. DNA and RNA are both nucleic acids but have different roles, structures, and properties. Do you want specifics?\",\n",
" },\n",
" ],\n",
" },\n",
" \"outputs\": {\n",
" \"expected\": \"The main structural differences between DNA and RNA include: 1) DNA is double-stranded while RNA is single-stranded. 2) The sugar in the backbone of RNA is ribose, whereas in DNA it's deoxyribose. 3) DNA uses the bases adenine (A), cytosine (C), guanine (G), and thymine (T); RNA uses adenine (A), cytosine (C), guanine (G), and uracil (U) instead of thymine.\"\n",
" },\n",
" },\n",
" {\n",
" \"inputs\": {\n",
" \"question\": \"what led them to such a draastic action?\",\n",
" \"chat_history\": [\n",
" {\"role\": \"user\", \"content\": \"tell me about the Boston Tea Party.\"},\n",
" {\n",
" \"role\": \"assistant\",\n",
" \"content\": \"The Boston Tea Party was a political protest by the American colonists against the British government in 1773. They were protesting the Tea Act, which allowed the British East India Company to sell tea directly to the colonies, bypassing colonial merchants.\",\n",
" },\n",
" ],\n",
" },\n",
" \"outputs\": {\n",
" \"expected\": \"The colonists undertook the Boston Tea Party as a drastic action due to multiple reasons: 1) They believed the Tea Act was a violation of their rights as Englishmen, as they were being taxed without their consent. 2) The act gave the British East India Company a monopoly on tea sales, threatening local businesses. 3) The act was seen as another example of the British government's increasing interference in colonial affairs. The protest was a way to show their strong opposition to British policies.\"\n",
" },\n",
" },\n",
" {\n",
" \"inputs\": {\n",
" \"question\": \"thats a scary one. can it be avoideed?\",\n",
" \"chat_history\": [\n",
" {\"role\": \"user\", \"content\": \"I'm learning bout genetic disorders.\"},\n",
" {\n",
" \"role\": \"assistant\",\n",
" \"content\": \"Genetic disorders are diseases caused by abnormalities in an individual's DNA. They can be inherited or result from mutations. One common one is Huntington's disease.\",\n",
" },\n",
" ],\n",
" },\n",
" \"outputs\": {\n",
" \"expected\": \"Huntington's disease is a hereditary genetic disorder caused by a mutation in the HTT gene. If a person inherits the defective gene, they will eventually develop the disease. Currently, there's no cure for Huntington's, but its onset can be delayed with treatment. Genetic counseling and testing can help prospective parents understand the risks of passing the mutation to their offspring.\"\n",
" },\n",
" },\n",
" {\n",
" \"inputs\": {\n",
" \"question\": \"Which one?\",\n",
" \"chat_history\": [\n",
" {\n",
" \"role\": \"user\",\n",
" \"content\": \"I'm confused aboutt stars. what even aaaare they?\",\n",
" },\n",
" {\n",
" \"role\": \"assistant\",\n",
" \"content\": \"Stars are celestial bodies made mostly of hydrogen and helium. They generate light and heat through nuclear fusion in their cores.\",\n",
" },\n",
" {\n",
" \"role\": \"user\",\n",
" \"content\": \"there''s a classification based on theirbrightness, right?\",\n",
" },\n",
" {\n",
" \"role\": \"assistant\",\n",
" \"content\": \"Yes\",\n",
" },\n",
" ],\n",
" },\n",
" \"outputs\": {\n",
" \"expected\": \"Yes, stars are classified based on their brightness using a system called the Hertzsprung-Russell (H-R) diagram. In this diagram, stars are categorized into main-sequence stars, giants, supergiants, and white dwarfs, based on their luminosity and temperature. The Sun, for instance, is a main-sequence star.\"\n",
" },\n",
" },\n",
"]"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "f029f8f7-ba87-4210-8a68-2f51fb91e30d",
"metadata": {
"id": "f029f8f7-ba87-4210-8a68-2f51fb91e30d"
},
"outputs": [],
"source": [
"dataset_name = f\"Chat Bot Evals Single-Turn Example - {uid}\"\n",
"dataset = client.create_dataset(dataset_name)\n",
"client.create_examples(\n",
" inputs=[e[\"inputs\"] for e in examples],\n",
" outputs=[e[\"outputs\"] for e in examples],\n",
" dataset_id=dataset.id,\n",
")"
]
},
{
"cell_type": "markdown",
"id": "08e57fe3-877c-4509-b487-4cacfc1ba1a7",
"metadata": {
"id": "08e57fe3-877c-4509-b487-4cacfc1ba1a7"
},
"source": [
"## Step 2: Define chat bot\n",
"\n",
"For this tutorial, our simple chat bot formats the messages for an LLM and responds with the resulting content without consulting any external resources."
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "2428d0d5-f5a7-4def-bd10-09f3130d3d75",
"metadata": {
"id": "2428d0d5-f5a7-4def-bd10-09f3130d3d75"
},
"outputs": [],
"source": [
"from langchain.chat_models import ChatOpenAI\n",
"from langchain.prompts import ChatPromptTemplate, MessagesPlaceholder\n",
"from langchain.schema.output_parser import StrOutputParser\n",
"\n",
"# An example chain\n",
"chain = (\n",
" ChatPromptTemplate.from_messages(\n",
" [\n",
" (\"system\", \"You are a helpful tutor AI. Your language is clear and direct.\"),\n",
" MessagesPlaceholder(variable_name=\"chat_history\"),\n",
" (\"user\", \"{input}\"),\n",
" ]\n",
" )\n",
" | ChatOpenAI(model=\"gpt-3.5-turbo-1106\")\n",
" | StrOutputParser()\n",
")"
]
},
{
"cell_type": "markdown",
"id": "314cba1d-c2c1-430d-a1e7-e16b6ecb8240",
"metadata": {
"id": "314cba1d-c2c1-430d-a1e7-e16b6ecb8240"
},
"source": [
"## Step 3: Evaluate\n",
"\n",
"Now that we have defined our dataset and chat bot, it's time to run the evaluation.\n",
"We will pass a constructor in to the `run_on_dataset` function to create a new instance\n",
"of our bot for each dataset example. This also includes some mapping logic to prepare\n",
"the data for the bot itself."
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "55cd3542-6df7-44f4-aa83-83359b167b40",
"metadata": {
"id": "55cd3542-6df7-44f4-aa83-83359b167b40",
"outputId": "9888aee1-ac75-49a2-a22a-733141c59c57",
"colab": {
"base_uri": "https://localhost:8080/"
}
},
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"View the evaluation results for project 'puzzled-play-59' at:\n",
"https://smith.langchain.com/o/0f7461cf-206f-5c85-aa8d-48c6c48bafc5/datasets/7ea0d5c7-71cc-4e4a-a3a9-fcf7454a9f16/compare?selectedSessions=87a80a0b-93dd-4a5f-8003-d620e99b3a57\n",
"\n",
"View all tests for Dataset Chat Bot Evals Single-Turn Example - 7ba434d1-3146-4eff-8035-99d296a52579 at:\n",
"https://smith.langchain.com/o/0f7461cf-206f-5c85-aa8d-48c6c48bafc5/datasets/7ea0d5c7-71cc-4e4a-a3a9-fcf7454a9f16\n",
"[------------------------------------------------->] 5/5"
]
}
],
"source": [
"from langchain.adapters.openai import convert_openai_messages\n",
"from langchain.smith import RunEvalConfig\n",
"\n",
"\n",
"def construct_chain():\n",
" # Add a step to convert the data from the dataset to a form the chain can consume\n",
" return {\n",
" \"input\": lambda x: x[\"question\"],\n",
" \"chat_history\": lambda x: convert_openai_messages(x[\"chat_history\"]),\n",
" } | chain\n",
"\n",
"\n",
"results = client.run_on_dataset(\n",
" dataset_name=dataset_name,\n",
" llm_or_chain_factory=construct_chain,\n",
" evaluation=RunEvalConfig(\n",
" evaluators=[\n",
" RunEvalConfig.LabeledScoreString(criteria=\"correctness\", normalize_by=10)\n",
" ],\n",
" # We must specify which key in the example inputs to pass to the evaluator\n",
" input_key=\"question\",\n",
" ),\n",
")"
]
},
{
"cell_type": "code",
"source": [
"print(results)"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "4hovtdXqovnq",
"outputId": "ac29e5f0-757e-4d5e-e7a4-048acf983c82"
},
"id": "4hovtdXqovnq",
"execution_count": 9,
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"{'project_name': 'puzzled-play-59', 'results': {'7ceb1d5f-1657-4f79-8396-fc522799f5be': {'input': {'question': 'Which one?', 'chat_history': [{'role': 'user', 'content': \"I'm confused aboutt stars. what even aaaare they?\"}, {'role': 'assistant', 'content': 'Stars are celestial bodies made mostly of hydrogen and helium. They generate light and heat through nuclear fusion in their cores.'}, {'role': 'user', 'content': \"there''s a classification based on theirbrightness, right?\"}, {'role': 'assistant', 'content': 'Yes'}]}, 'feedback': [EvaluationResult(key='score_string:correctness', score=0.6, value=None, comment=\"The assistant's response is partially correct. While it is true that stars are classified based on their brightness using the magnitude scale, the assistant failed to mention the Hertzsprung-Russell (H-R) diagram, which is the primary system for classifying stars based on their luminosity and temperature. The assistant's response is also lacking in detail and does not fully answer the user's question. Therefore, the response could be improved. Rating: [[6]]\", correction=None, evaluator_info={'__run': RunInfo(run_id=UUID('299b28e7-56e7-412d-b890-457ea97542c2'))}, source_run_id=None, target_run_id=None)], 'execution_time': 3.496849, 'run_id': '9e491b0d-6c3b-4936-b624-cd7fc372ac02', 'output': 'Stars are classified based on their brightness using the magnitude scale, where lower numbers indicate greater brightness. This is known as apparent magnitude, which measures how bright a star appears from Earth.', 'reference': {'expected': 'Yes, stars are classified based on their brightness using a system called the Hertzsprung-Russell (H-R) diagram. In this diagram, stars are categorized into main-sequence stars, giants, supergiants, and white dwarfs, based on their luminosity and temperature. The Sun, for instance, is a main-sequence star.'}}, 'acd1960d-a1c8-46bf-8881-51079f351143': {'input': {'question': 'thats a scary one. can it be avoideed?', 'chat_history': [{'role': 'user', 'content': \"I'm learning bout genetic disorders.\"}, {'role': 'assistant', 'content': \"Genetic disorders are diseases caused by abnormalities in an individual's DNA. They can be inherited or result from mutations. One common one is Huntington's disease.\"}]}, 'feedback': [EvaluationResult(key='score_string:correctness', score=1.0, value=None, comment=\"The assistant's response is correct and accurate. It correctly states that Huntington's disease cannot be avoided due to its genetic nature, but it also mentions the importance of genetic counseling and testing in understanding the risk and making informed decisions. The assistant's response is factual and aligns with the ground truth information provided. Rating: [[10]]\", correction=None, evaluator_info={'__run': RunInfo(run_id=UUID('f9d875ad-8e8b-4831-ba2a-9e5e0263e263'))}, source_run_id=None, target_run_id=None)], 'execution_time': 2.818487, 'run_id': '310ea42a-6646-4088-9b7d-4a2f5c0bcd29', 'output': \"Unfortunately, Huntington's disease cannot be avoided because it is caused by a genetic mutation. However, genetic counseling and testing can help individuals understand their risk and make informed decisions.\", 'reference': {'expected': \"Huntington's disease is a hereditary genetic disorder caused by a mutation in the HTT gene. If a person inherits the defective gene, they will eventually develop the disease. Currently, there's no cure for Huntington's, but its onset can be delayed with treatment. Genetic counseling and testing can help prospective parents understand the risks of passing the mutation to their offspring.\"}}, '0b98e3a9-b538-4be6-bc24-e7e4372e33ff': {'input': {'question': 'what led them to such a draastic action?', 'chat_history': [{'role': 'user', 'content': 'tell me about the Boston Tea Party.'}, {'role': 'assistant', 'content': 'The Boston Tea Party was a political protest by the American colonists against the British government in 1773. They were protesting the Tea Act, which allowed the British East India Company to sell tea directly to the colonies, bypassing colonial merchants.'}]}, 'feedback': [EvaluationResult(key='score_string:correctness', score=0.9, value=None, comment=\"The assistant's response is accurate and correct. It correctly identifies the Tea Act as the primary cause of the Boston Tea Party, and explains that the colonists felt it was a violation of their rights. The assistant also correctly notes that the act gave the British East India Company a monopoly on tea, and that the colonists were protesting against British taxation and control. The assistant's response is concise and directly answers the user's question. Therefore, the assistant's response is of high quality. Rating: [[9]]\", correction=None, evaluator_info={'__run': RunInfo(run_id=UUID('3b1616bf-e009-4bae-b1bc-362d33038423'))}, source_run_id=None, target_run_id=None)], 'execution_time': 4.661442, 'run_id': 'a9082094-dc3d-4183-84b9-cbeaf396796b', 'output': 'The colonists felt that the Tea Act was unfair and a violation of their rights, as it gave the British East India Company a monopoly on tea and imposed taxes on the colonies without their consent. This led to the protest in which colonists, dressed as Native Americans, boarded British ships and dumped the tea into the Boston Harbor as a form of resistance against British taxation and control.', 'reference': {'expected': \"The colonists undertook the Boston Tea Party as a drastic action due to multiple reasons: 1) They believed the Tea Act was a violation of their rights as Englishmen, as they were being taxed without their consent. 2) The act gave the British East India Company a monopoly on tea sales, threatening local businesses. 3) The act was seen as another example of the British government's increasing interference in colonial affairs. The protest was a way to show their strong opposition to British policies.\"}}, 'd81dc6c7-6c16-4c21-b32e-b0100a9e4960': {'input': {'question': \"What's the main difference?\", 'chat_history': [{'role': 'user', 'content': 'Can yyou contrast DNA and RNA for me?'}, {'role': 'assistant', 'content': 'Certainly. DNA and RNA are both nucleic acids but have different roles, structures, and properties. Do you want specifics?'}]}, 'feedback': [EvaluationResult(key='score_string:correctness', score=0.6, value=None, comment=\"The assistant's response is partially correct. It correctly identifies one of the main differences between DNA and RNA, which is that DNA is double-stranded while RNA is single-stranded. However, it fails to mention the other key differences, such as the different sugars in their backbones and the different bases they use. Therefore, while the information provided is accurate, it is not comprehensive. Rating: [[6]]\", correction=None, evaluator_info={'__run': RunInfo(run_id=UUID('a4bc255a-fef7-45ae-8f75-6eaad4718cf7'))}, source_run_id=None, target_run_id=None)], 'execution_time': 4.234119, 'run_id': '091dc487-7340-450b-bf72-d3fa6f694db0', 'output': 'The main difference between DNA and RNA is that DNA is a double-stranded molecule, while RNA is typically single-stranded. This structural difference has important implications for their respective functions and roles in the cell.', 'reference': {'expected': \"The main structural differences between DNA and RNA include: 1) DNA is double-stranded while RNA is single-stranded. 2) The sugar in the backbone of RNA is ribose, whereas in DNA it's deoxyribose. 3) DNA uses the bases adenine (A), cytosine (C), guanine (G), and thymine (T); RNA uses adenine (A), cytosine (C), guanine (G), and uracil (U) instead of thymine.\"}}, 'de48a2ff-9141-4560-86bc-7a13074ed00b': {'input': {'question': 'How does that apply?', 'chat_history': [{'role': 'user', 'content': 'I’m trying tto uunderstand instein’s theory.'}, {'role': 'assistant', 'content': \"Which one? He's known for several theories.\"}, {'role': 'user', 'content': 'Thee one about time and spae.'}, {'role': 'assistant', 'content': \"Ah, you're referring to the theory of relativity. There are two parts: special and general. Which one?\"}]}, 'feedback': [EvaluationResult(key='score_string:correctness', score=0.7, value=None, comment=\"The assistant's response is accurate and factual. It correctly explains the special theory of relativity and its implications, including the famous equation E=mc^2. It also correctly distinguishes between the special and general theories of relativity. However, the assistant does not directly answer the user's question about how the theory applies. The user's question is vague, but the assistant could have asked for clarification or provided examples of how the theory applies in practical situations. Rating: [[7]]\", correction=None, evaluator_info={'__run': RunInfo(run_id=UUID('ca0aca65-bde5-48fe-8479-8f95f56c3fed'))}, source_run_id=None, target_run_id=None)], 'execution_time': 4.234166, 'run_id': 'd43a8a01-623b-4c83-abb5-f7ec8700c4ce', 'output': 'The special theory of relativity deals with the relationship between space and time for objects moving at a constant velocity. It introduces the idea that the laws of physics are the same for all non-accelerating observers, and it also explains the famous equation E=mc^2, which relates energy (E) to mass (m) and the speed of light (c). The general theory of relativity, on the other hand, extends these ideas to include the effects of gravity and acceleration on the fabric of space and time.', 'reference': {'expected': 'Special relativity, proposed by Einstein in 1905, deals with objects in uniform motion, especially those moving at the speed of light. It introduced the idea that time and space are relative and can change in relation to each other. For instance, time can appear to move slower for an object moving close to the speed of light.'}}}}\n"
]
}
]
},
{
"cell_type": "markdown",
"id": "87be599b-6165-4684-8c71-42c3ed270d29",
"metadata": {
"id": "87be599b-6165-4684-8c71-42c3ed270d29"
},
"source": [
"## Conclusion\n",
"\n",
"We've covered the steps to create a conversational dataset to measure your chat bot's ability to handle conversation history.\n",
"\n",
"Effective evaluation focuses on ensuring the bot maintains consistent and accurate interactions across turns. With the methods outlined here, you're set up to measure and improve this aspect of your chat bot's performance."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.2"
},
"colab": {
"provenance": [],
"include_colab_link": true
}
},
"nbformat": 4,
"nbformat_minor": 5
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment