jacobdam/benchmark_qna.py

## benchmark_qna.py
relevance_template = """
System:
You are an AI assistant. You will be given the definition of an evaluation metric for assessing the quality of an answer in a question-answering task. Your job is to compute an accurate evaluation score using the provided evaluation metric.

User:
Relevance measures how well the answer addresses the main aspects of the question, based on the context. Consider whether all and only the important aspects are contained in the answer when evaluating relevance. Given the context and question, score the relevance of the answer between one to five stars using the following rating scale:
One star: the answer completely lacks relevance
Two stars: the answer mostly lacks relevance
Three stars: the answer is partially relevant
Four stars: the answer is mostly relevant
Five stars: the answer has perfect relevance

This rating value should always be an integer between 1 and 5. So the rating produced should be 1 or 2 or 3 or 4 or 5.

context: Marie Curie was a Polish-born physicist and chemist who pioneered research on radioactivity and was the first woman to win a Nobel Prize.
question: What field did Marie Curie excel in?
answer: Marie Curie was a renowned painter who focused mainly on impressionist styles and techniques.
stars: 1

context: The Beatles were an English rock band formed in Liverpool in 1960, and they are widely regarded as the most influential music band in history.
question: Where were The Beatles formed?
answer: The band The Beatles began their journey in London, England, and they changed the history of music.
stars: 2

context: The recent Mars rover, Perseverance, was launched in 2020 with the main goal of searching for signs of ancient life on Mars. The rover also carries an experiment called MOXIE, which aims to generate oxygen from the Martian atmosphere.
question: What are the main goals of Perseverance Mars rover mission?
answer: The Perseverance Mars rover mission focuses on searching for signs of ancient life on Mars.
stars: 3

context: The Mediterranean diet is a commonly recommended dietary plan that emphasizes fruits, vegetables, whole grains, legumes, lean proteins, and healthy fats. Studies have shown that it offers numerous health benefits, including a reduced risk of heart disease and improved cognitive health.
question: What are the main components of the Mediterranean diet?
answer: The Mediterranean diet primarily consists of fruits, vegetables, whole grains, and legumes.
stars: 4

context: The Queen's Royal Castle is a well-known tourist attraction in the United Kingdom. It spans over 500 acres and contains extensive gardens and parks. The castle was built in the 15th century and has been home to generations of royalty.
question: What are the main attractions of the Queen's Royal Castle?
answer: The main attractions of the Queen's Royal Castle are its expansive 500-acre grounds, extensive gardens, parks, and the historical castle itself, which dates back to the 15th century and has housed generations of royalty.
stars: 5

context: {truth}
question: {question}
answer: {answer}
stars:

Your response must include following fields and should be in json format:
score: Number of stars based on definition above
reason: Reason why the score was given
"""

groundedness_template = """
System:
You are an AI assistant. You will be given the definition of an evaluation metric for assessing the quality of an answer in a question-answering task. Your job is to compute an accurate evaluation score using the provided evaluation metric.

User:
You will be presented with a CONTEXT and an ANSWER about that CONTEXT. You need to decide whether the ANSWER is entailed by the CONTEXT by choosing one of the following rating:

1. 5: The ANSWER follows logically from the information contained in the CONTEXT.

2. 1: The ANSWER is logically false from the information contained in the CONTEXT.

3. an integer score between 1 and 5 and if such integer score does not exist,

use 1: It is not possible to determine whether the ANSWER is true or false without further information.

Read the passage of information thoroughly and select the correct answer from the three answer labels.

Read the CONTEXT thoroughly to ensure you know what the CONTEXT entails.

Note the ANSWER is generated by a computer system, it can contain certain symbols, which should not be a negative factor in the evaluation.

context: {truth}
question: {question}
answer: {answer}
stars:

Your response must include following fields and should be in json format:
score: Number of stars based on definition above
reason: Reason why the score was given
"""

coherence_template = """
System:
You are an AI assistant. You will be given the definition of an evaluation metric for assessing the quality of an answer in a question-answering task. Your job is to compute an accurate evaluation score using the provided evaluation metric.

User:
Coherence of an answer is measured by how well all the sentences fit together and sound naturally as a whole. Consider the overall quality of the answer when evaluating coherence. Given the question and answer, score the coherence of answer between one to five stars using the following rating scale:
One star: the answer completely lacks coherence
Two stars: the answer mostly lacks coherence
Three stars: the answer is partially coherent
Four stars: the answer is mostly coherent
Five stars: the answer has perfect coherency

This rating value should always be an integer between 1 and 5. So the rating produced should be 1 or 2 or 3 or 4 or 5.

context: {truth}
question: {question}
answer: {answer}
stars:

Your response must include following fields and should be in json format:
score: Number of stars based on definition above
reason: Reason why the score was given
"""

dontknowness_template = """
System:
You are an AI assistant. You will be given the definition of an evaluation metric for assessing the quality of an answer in a question-answering task. Your job is to compute an accurate evaluation score using the provided evaluation metric.

User:
The "I don't know"-ness metric is a measure of how much an answer conveys the lack of knowledge or uncertainty, which is useful for making sure a chatbot for a particular domain doesn't answer outside that domain. Score the I-dont-know-ness of the answer between one to five stars using the following rating scale:
One star: the answer completely answers the question and conveys no uncertainty
Two stars: the answer conveys a little uncertainty but mostly attempts to answer the question
Three stars: the answer conveys some uncertainty but still contains some attempt to answer the question
Four stars: the answer conveys uncertainty and makes no attempt to answer the question
Five stars: the answer says straightforwardly that it doesn't know, and makes no attempt to answer the question.

This rating value should always be an integer between 1 and 5. So the rating produced should be 1 or 2 or 3 or 4 or 5.

question: What are the main goals of Perseverance Mars rover mission?
answer: The main goals of the Perseverance Mars rover mission are to search for signs of ancient life and collect rock and soil samples for possible return to Earth.
stars: 1

question: What field did Marie Curie excel in?
answer: I'm not sure, but I think Marie Curie excelled in the field of science.
stars: 2

question: What are the main components of the Mediterranean diet?
answer: I don't have an answer in my sources but I think the diet has some fats?
stars: 3

question: What are the main attractions of the Queen's Royal Castle?
answer: I'm not certain. Perhaps try rephrasing the question?
stars: 4

question: Where were The Beatles formed?
answer: I'm sorry, I don't know, that answer is not in my sources.
stars: 5

question: {question}
answer: {answer}
stars:

Your response must include following fields and should be in json format:
score: Number of stars based on definition above
reason: Reason why the score was given
"""

score_parser = JsonOutputParser() | (lambda o: o["score"])

relevance_metric = PromptTemplate.from_template(relevance_template) | chat_client | score_parser
coherence_metric = PromptTemplate.from_template(coherence_template) | chat_client | score_parser
groundedness_metric = PromptTemplate.from_template(groundedness_template) | chat_client | score_parser
dontknowness_metric = PromptTemplate.from_template(dontknowness_template) | chat_client | score_parser

with_answer_metrics = RunnableParallel({ "relevance": relevance_metric, "coherence": coherence_metric, "groundedness": groundedness_metric})

# Run benchmark
await with_answer_metrics.ainvoke({"question": "What is the importance of choosing the right provider in getting the most value out of your health insurance plan?", "answer": "Choosing the right provider is crucial in getting the most value out of your health insurance plan as it directly impacts the costs you pay. Using in-network providers can result in lower out-of-pocket costs, while out-of-network providers can lead to higher costs, and in some cases, you may be responsible for the entire cost. The right provider should be familiar with your health insurance plan, accommodate your schedule, and be accepting new patients. They should also be conveniently located and have office hours that fit your availability[Northwind_Health_Plus_Benefits_Details.pdf, Northwind_Standard_Benefits_Details.pdf].", "truth": "Choosing the right provider is an important part of getting the most value out of your health insurance plan. With Northwind Health Plus, you have access to an extensive network of in-network providers. Working with these providers is an essential part of getting the most value out of your plan.\n[Northwind_Health_Plus_Benefits_Details-3.pdf]"})
# => {'relevance': 5, 'coherence': 5, 'groundedness': 5}
	relevance_template = """
	System:
	You are an AI assistant. You will be given the definition of an evaluation metric for assessing the quality of an answer in a question-answering task. Your job is to compute an accurate evaluation score using the provided evaluation metric.

	User:
	Relevance measures how well the answer addresses the main aspects of the question, based on the context. Consider whether all and only the important aspects are contained in the answer when evaluating relevance. Given the context and question, score the relevance of the answer between one to five stars using the following rating scale:
	One star: the answer completely lacks relevance
	Two stars: the answer mostly lacks relevance
	Three stars: the answer is partially relevant
	Four stars: the answer is mostly relevant
	Five stars: the answer has perfect relevance

	This rating value should always be an integer between 1 and 5. So the rating produced should be 1 or 2 or 3 or 4 or 5.

	context: Marie Curie was a Polish-born physicist and chemist who pioneered research on radioactivity and was the first woman to win a Nobel Prize.
	question: What field did Marie Curie excel in?
	answer: Marie Curie was a renowned painter who focused mainly on impressionist styles and techniques.
	stars: 1

	context: The Beatles were an English rock band formed in Liverpool in 1960, and they are widely regarded as the most influential music band in history.
	question: Where were The Beatles formed?
	answer: The band The Beatles began their journey in London, England, and they changed the history of music.
	stars: 2

	context: The recent Mars rover, Perseverance, was launched in 2020 with the main goal of searching for signs of ancient life on Mars. The rover also carries an experiment called MOXIE, which aims to generate oxygen from the Martian atmosphere.
	question: What are the main goals of Perseverance Mars rover mission?
	answer: The Perseverance Mars rover mission focuses on searching for signs of ancient life on Mars.
	stars: 3

	context: The Mediterranean diet is a commonly recommended dietary plan that emphasizes fruits, vegetables, whole grains, legumes, lean proteins, and healthy fats. Studies have shown that it offers numerous health benefits, including a reduced risk of heart disease and improved cognitive health.
	question: What are the main components of the Mediterranean diet?
	answer: The Mediterranean diet primarily consists of fruits, vegetables, whole grains, and legumes.
	stars: 4

	context: The Queen's Royal Castle is a well-known tourist attraction in the United Kingdom. It spans over 500 acres and contains extensive gardens and parks. The castle was built in the 15th century and has been home to generations of royalty.
	question: What are the main attractions of the Queen's Royal Castle?
	answer: The main attractions of the Queen's Royal Castle are its expansive 500-acre grounds, extensive gardens, parks, and the historical castle itself, which dates back to the 15th century and has housed generations of royalty.
	stars: 5

	context: {truth}
	question: {question}
	answer: {answer}
	stars:

	Your response must include following fields and should be in json format:
	score: Number of stars based on definition above
	reason: Reason why the score was given
	"""

	groundedness_template = """
	System:
	You are an AI assistant. You will be given the definition of an evaluation metric for assessing the quality of an answer in a question-answering task. Your job is to compute an accurate evaluation score using the provided evaluation metric.

	User:
	You will be presented with a CONTEXT and an ANSWER about that CONTEXT. You need to decide whether the ANSWER is entailed by the CONTEXT by choosing one of the following rating:

	1. 5: The ANSWER follows logically from the information contained in the CONTEXT.

	2. 1: The ANSWER is logically false from the information contained in the CONTEXT.

	3. an integer score between 1 and 5 and if such integer score does not exist,

	use 1: It is not possible to determine whether the ANSWER is true or false without further information.

	Read the passage of information thoroughly and select the correct answer from the three answer labels.

	Read the CONTEXT thoroughly to ensure you know what the CONTEXT entails.

	Note the ANSWER is generated by a computer system, it can contain certain symbols, which should not be a negative factor in the evaluation.

	context: {truth}
	question: {question}
	answer: {answer}
	stars:

	Your response must include following fields and should be in json format:
	score: Number of stars based on definition above
	reason: Reason why the score was given
	"""

	coherence_template = """
	System:
	You are an AI assistant. You will be given the definition of an evaluation metric for assessing the quality of an answer in a question-answering task. Your job is to compute an accurate evaluation score using the provided evaluation metric.

	User:
	Coherence of an answer is measured by how well all the sentences fit together and sound naturally as a whole. Consider the overall quality of the answer when evaluating coherence. Given the question and answer, score the coherence of answer between one to five stars using the following rating scale:
	One star: the answer completely lacks coherence
	Two stars: the answer mostly lacks coherence
	Three stars: the answer is partially coherent
	Four stars: the answer is mostly coherent
	Five stars: the answer has perfect coherency

	This rating value should always be an integer between 1 and 5. So the rating produced should be 1 or 2 or 3 or 4 or 5.

	context: {truth}
	question: {question}
	answer: {answer}
	stars:

	Your response must include following fields and should be in json format:
	score: Number of stars based on definition above
	reason: Reason why the score was given
	"""

	dontknowness_template = """
	System:
	You are an AI assistant. You will be given the definition of an evaluation metric for assessing the quality of an answer in a question-answering task. Your job is to compute an accurate evaluation score using the provided evaluation metric.

	User:
	The "I don't know"-ness metric is a measure of how much an answer conveys the lack of knowledge or uncertainty, which is useful for making sure a chatbot for a particular domain doesn't answer outside that domain. Score the I-dont-know-ness of the answer between one to five stars using the following rating scale:
	One star: the answer completely answers the question and conveys no uncertainty
	Two stars: the answer conveys a little uncertainty but mostly attempts to answer the question
	Three stars: the answer conveys some uncertainty but still contains some attempt to answer the question
	Four stars: the answer conveys uncertainty and makes no attempt to answer the question
	Five stars: the answer says straightforwardly that it doesn't know, and makes no attempt to answer the question.

	This rating value should always be an integer between 1 and 5. So the rating produced should be 1 or 2 or 3 or 4 or 5.

	question: What are the main goals of Perseverance Mars rover mission?
	answer: The main goals of the Perseverance Mars rover mission are to search for signs of ancient life and collect rock and soil samples for possible return to Earth.
	stars: 1

	question: What field did Marie Curie excel in?
	answer: I'm not sure, but I think Marie Curie excelled in the field of science.
	stars: 2

	question: What are the main components of the Mediterranean diet?
	answer: I don't have an answer in my sources but I think the diet has some fats?
	stars: 3

	question: What are the main attractions of the Queen's Royal Castle?
	answer: I'm not certain. Perhaps try rephrasing the question?
	stars: 4

	question: Where were The Beatles formed?
	answer: I'm sorry, I don't know, that answer is not in my sources.
	stars: 5

	question: {question}
	answer: {answer}
	stars:

	Your response must include following fields and should be in json format:
	score: Number of stars based on definition above
	reason: Reason why the score was given
	"""

	score_parser = JsonOutputParser() \| (lambda o: o["score"])

	relevance_metric = PromptTemplate.from_template(relevance_template) \| chat_client \| score_parser
	coherence_metric = PromptTemplate.from_template(coherence_template) \| chat_client \| score_parser
	groundedness_metric = PromptTemplate.from_template(groundedness_template) \| chat_client \| score_parser
	dontknowness_metric = PromptTemplate.from_template(dontknowness_template) \| chat_client \| score_parser

	with_answer_metrics = RunnableParallel({ "relevance": relevance_metric, "coherence": coherence_metric, "groundedness": groundedness_metric})

	# Run benchmark
	await with_answer_metrics.ainvoke({"question": "What is the importance of choosing the right provider in getting the most value out of your health insurance plan?", "answer": "Choosing the right provider is crucial in getting the most value out of your health insurance plan as it directly impacts the costs you pay. Using in-network providers can result in lower out-of-pocket costs, while out-of-network providers can lead to higher costs, and in some cases, you may be responsible for the entire cost. The right provider should be familiar with your health insurance plan, accommodate your schedule, and be accepting new patients. They should also be conveniently located and have office hours that fit your availability[Northwind_Health_Plus_Benefits_Details.pdf, Northwind_Standard_Benefits_Details.pdf].", "truth": "Choosing the right provider is an important part of getting the most value out of your health insurance plan. With Northwind Health Plus, you have access to an extensive network of in-network providers. Working with these providers is an essential part of getting the most value out of your plan.\n[Northwind_Health_Plus_Benefits_Details-3.pdf]"})
	# => {'relevance': 5, 'coherence': 5, 'groundedness': 5}