Skip to content

Instantly share code, notes, and snippets.

Last active January 2, 2024 22:41
Show Gist options
  • Save caleb-kaiser/389cafefd8728b7ea4b042f3e9f51698 to your computer and use it in GitHub Desktop.
Save caleb-kaiser/389cafefd8728b7ea4b042f3e9f51698 to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
"cells": [
"cell_type": "markdown",
"metadata": {
"id": "jzobXlHUAKVT"
"source": [
"## LLM-Powered Evaluation System\n",
"In this section, we build an LLM-Powered evaluator system for the ML Tagging system we previous built. We will be logging and assessing the results in Comet."
"cell_type": "code",
"source": [
"! pip install openai==0.28 comet-llm --quiet"
"metadata": {
"id": "xpOuYYcoBHrg"
"execution_count": null,
"outputs": []
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "2It17rwYAKVU"
"outputs": [],
"source": [
"import openai\n",
"import os\n",
"import IPython\n",
"import json\n",
"import pandas as pd\n",
"import numpy as np\n",
"from urllib.request import urlopen\n",
"# API configuration\n",
"openai.api_key = \"OPENAI_API_KEY\""
"cell_type": "markdown",
"metadata": {
"id": "K5XAKG5bAKVV"
"source": [
"Let's load the helper function to generate responses from the model:"
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "m5yskvP8AKVV"
"outputs": [],
"source": [
"def get_completion(messages, model=\"gpt-3.5-turbo\", temperature=0, max_tokens=300):\n",
" response = openai.ChatCompletion.create(\n",
" model=model,\n",
" messages=messages,\n",
" temperature=temperature,\n",
" max_tokens=max_tokens,\n",
" )\n",
" return response.choices[0].message[\"content\"]"
"cell_type": "markdown",
"metadata": {
"id": "meIxmpreAKVV"
"source": [
"### Load the Data\n",
"The code below loads both the few-shot demonstrations and validation dataset:"
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "RN6jhWPCAKVW"
"outputs": [],
"source": [
"# print markdown\n",
"def print_markdown(text):\n",
" \"\"\"Prints text as markdown\"\"\"\n",
" IPython.display.display(IPython.display.Markdown(text))\n",
"# load validation data\n",
"response = urlopen(\"\")\n",
"val_data = json.loads(\n",
"# load few shot data\n",
"response = urlopen(\"\")\n",
"few_shot_data = json.loads("
"cell_type": "markdown",
"metadata": {
"id": "cBkS_lhGAKVW"
"source": [
"### Few-shot\n",
"First, we define a few-shot template which will leverage the few-shot demonstration data loaded previously."
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "7QD-QpEyAKVW"
"outputs": [],
"source": [
"# function to define the few-shot template\n",
"def get_few_shot_template(few_shot_prefix, few_shot_suffix, few_shot_examples):\n",
" return few_shot_prefix + \"\\n\\n\" + \"\\n\".join([ \"Abstract: \"+ ex[\"abstract\"] + \"\\n\" + \"Tags: \" + str(ex[\"tags\"]) + \"\\n\" for ex in few_shot_examples]) + \"\\n\\n\" + few_shot_suffix\n",
"# function to sample few shot data\n",
"def random_sample_data (data, n):\n",
" return np.random.choice(few_shot_data, n, replace=False)\n",
"# the few-shot prefix and suffix\n",
"few_shot_prefix = \"\"\"Your task is to extract model names from machine learning paper abstracts. Your response is an an array of the model names in the format [\\\"model_name\\\"]. If you don't find model names in the abstract or you are not sure, return [\\\"NA\\\"]\"\"\"\n",
"few_shot_suffix = \"\"\"Abstract: {input}\\nTags:\"\"\"\n",
"# load 3 samples from few shot data\n",
"few_shot_template = get_few_shot_template(few_shot_prefix, few_shot_suffix, random_sample_data(few_shot_data, 3))"
"cell_type": "markdown",
"metadata": {
"id": "OKrxdJIdAKVW"
"source": [
"### Zero-shot\n",
"The code below defines the zero-shot template. Note that we use the same instruction from the few-shot prompt template. But in this case, we don't use the demonstrations."
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "9YLy7XLoAKVX"
"outputs": [],
"source": [
"zero_shot_template = \"\"\"\n",
"Your task is extract model names from machine learning paper abstracts. Your response is an an array of the model names in the format [\\\"model_name\\\"]. If you don't find model names in the abstract or you are not sure, return [\\\"NA\\\"]\n",
"Abstract: {input}\n",
"cell_type": "markdown",
"metadata": {
"id": "l7hUEtg6AKVX"
"source": [
"### Evaluate\n",
"In this subsection, we perform the evaluation and logs the results to Comet.\n",
"The following is a helper function to obtain the final predictions from the model given a prompt template (e.g., zero-shot or few-shot) and the provided input data."
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "11NQgSlxAKVX"
"outputs": [],
"source": [
"def get_predictions(prompt_template, inputs):\n",
" responses = []\n",
" for i in range(len(inputs)):\n",
" messages = messages = [\n",
" {\n",
" \"role\": \"system\",\n",
" \"content\": prompt_template.format(input=inputs[i])\n",
" }\n",
" ]\n",
" response = get_completion(messages)\n",
" responses.append(response)\n",
" return responses"
"cell_type": "markdown",
"metadata": {
"id": "z55pKOuaAKVX"
"source": [
"We then generated all the predictions using the validation data as inputs:"
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "7qfOlF8EAKVY"
"outputs": [],
"source": [
"# extract abstract from val_data\n",
"abstracts = [val_data[i][\"abstract\"] for i in range(len(val_data))]\n",
"few_shot_predictions = get_predictions(few_shot_template, abstracts)\n",
"zero_shot_predictions = get_predictions(zero_shot_template, abstracts)\n",
"expected_tags = [str(val_data[i][\"tags\"]) for i in range(len(val_data))]"
"cell_type": "markdown",
"metadata": {
"id": "eAV4PNDgAKVY"
"source": [
"After obtaining the predictions, we now build a system prompt that will perform the automatic LLM-powered evaluation of the results obtained from the different prompt templates. Note that the system prompt expects the expected answers and the predictions from the different prompt we tried."
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "DCObGrSeAKVY"
"outputs": [],
"source": [
"# llm-powered evaluation\n",
"system_prompt = \"\"\"\"You are a teacher grading a quiz. You will be given the expected answers (delimited by ```) and the answers from a student (delimited by ###). Your task is to grade the student. You will output either CORRECT or INCORRECT for each question. \\n\\nGrade the question as CORRECT if the student's answer overlaps with the expected answer. Ignore differences in punctuation and phrasing between the student's answer and the expected answer. The student's answer is CORRECT if it contains more information than the expected answer, but it should at least cover what's in the expected answer. The order of the items in each answer is also not a problem.\\n\\nGrade the question as INCORRECT if the student's answer is not factual or doesn't overlap with the expected answer.\\n\\nHere are the expected answers:\\n```{expected_answers}```\\n\\nHere are the student's answers:\\n###{predictions}###\\n\\nThe output format will be:\\n[\\\"<grade for item 1>\\\", \\\"<grade for item 2>\\\",...]\"\"\"\n",
"# function to get the final llm grading\n",
"def get_llm_grading(expected_answers, predictions, system_prompt):\n",
" response = openai.ChatCompletion.create(\n",
" model=\"gpt-3.5-turbo\",\n",
" messages=[\n",
" {\n",
" \"role\": \"system\",\n",
" \"content\": system_prompt.format(expected_answers=expected_answers, predictions=predictions)\n",
" }\n",
" ],\n",
" temperature=0,\n",
" max_tokens=256,\n",
" frequency_penalty=0,\n",
" presence_penalty=0\n",
" )\n",
" return response.choices[0].message[\"content\"]\n",
"# run the llm grading using the predictions obtained before\n",
"zero_shot_eval_predictions = eval(get_llm_grading(expected_tags, zero_shot_predictions, system_prompt))\n",
"few_shot_eval_predictions = eval(get_llm_grading(expected_tags, few_shot_predictions, system_prompt))"
"cell_type": "markdown",
"metadata": {
"id": "PRS71kmzAKVY"
"source": [
"### Logging Prompt Results to Comet\n",
"Once we have those predictions from the LLM evaluator system prompt, we can log the results to Comet. We are logging several pieces of information like the model, the prompt template type, the expected results, the grading, and more. All of this information will help us assess how the good the LLM-powered evaluator is for this use case."
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "i3SxFvgBAKVY",
"outputId": "0e5c7496-6bb0-4076-d20f-06b3220eea43"
"outputs": [
"name": "stdout",
"output_type": "stream",
"text": [
"Prompt logged to\n"
"source": [
"# log prediction for both few-shot and zero-shot using Comet\n",
"import comet_llm\n",
"comet_llm.init(project=\"tagger-llm-evaluator\", api_key=COMET_API_KEY)\n",
"for i in range(len(val_data)):\n",
" # log zero-shot predictions\n",
" comet_llm.log_prompt(\n",
" prompt = system_prompt.format(expected_answers=expected_tags[i], predictions=zero_shot_predictions[i]),\n",
" tags = [\"gpt-3.5-turbo\", \"zero-shot\"],\n",
" metadata = {\n",
" \"model_name\": \"gpt-3.5-turbo\",\n",
" \"temperature\": 0,\n",
" \"expected_output\": expected_tags[i],\n",
" \"model_output\": zero_shot_predictions[i]\n",
" },\n",
" output = zero_shot_eval_predictions[i]\n",
" )\n",
" # log few-shot predictions\n",
" comet_llm.log_prompt(\n",
" prompt = system_prompt.format(expected_answers=expected_tags[i], predictions=few_shot_predictions[i]),\n",
" tags = [\"gpt-3.5-turbo\", \"few-shot\"],\n",
" metadata = {\n",
" \"model_name\": \"gpt-3.5-turbo\",\n",
" \"temperature\": 0,\n",
" \"expected_output\": expected_tags[i],\n",
" \"model_output\": few_shot_predictions[i]\n",
" },\n",
" output = few_shot_eval_predictions[i]\n",
" )\n"
"metadata": {
"kernelspec": {
"display_name": "comet",
"language": "python",
"name": "python3"
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.17"
"orig_nbformat": 4,
"colab": {
"provenance": [],
"include_colab_link": true
"nbformat": 4,
"nbformat_minor": 0
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment