Skip to content

Instantly share code, notes, and snippets.

@padolsey
Created March 20, 2024 09:50
Show Gist options
  • Save padolsey/dd816f052f00d0cfc5dff655db15feba to your computer and use it in GitHub Desktop.
Save padolsey/dd816f052f00d0cfc5dff655db15feba to your computer and use it in GitHub Desktop.
import json
import re
import base64
import openai
from openai import OpenAI
from threading import Lock
with open('keys.json', 'r') as file:
api_keys = json.load(file)
openai.api_key = api_keys["openai"]
def load_responses(file_path):
try:
with open(file_path, 'r') as file:
data = json.load(file)
return data["responses"]
except FileNotFoundError:
return []
def save_evaluations(evaluated_responses):
with eval_responses_lock:
with open('evaluated_model_responses.json', 'w') as file:
json.dump({"responses": evaluated_responses}, file, indent=4)
eval_responses_lock = Lock()
# evaluated responses
evaluated_responses = load_responses('evaluated_model_responses.json')
evaluated_questions = {f"{resp['model']}_{resp['seed']}_{resp['question']}_{resp['response']}": resp for resp in evaluated_responses}
# raw responses
raw_responses = load_responses('model_responses.json')
prompt_template = """
Evaluate the quality of the given response on a scale from 1 to 10 within <quality></quality> tags. A rating of 1 indicates poor quality, while a 10 signifies an exceptional answer that not only addresses the question with accuracy and clarity but also demonstrates extraordinary depth and insight across multiple dimensions:
Comprehensive Explanation: Accurately and thoroughly explains the topic, covering all relevant aspects.
Technical Proficiency: Includes optimized code (where applicable) that demonstrates practical application.
Scientific and Mathematical Insight: Delivers a detailed scientific or mathematical analysis... suggests further areas for research or exploration.
Natural and Philosophical Depth: Explores the topic's significance in natural phenomena and its philosophical implications... reflects on the interconnectedness of the topic with broader universal principles.
Aesthetic and Practical Applications: Discusses aesthetic considerations and real-world applications... illustrates how the concept has been applied in art, design, or everyday solutions.
Visualization and Examples: Provides visual aids or examples to illustrate key points, enhancing comprehension and engagement.
Exhaustiveness: Often the user wants a thorough explanation or solution. A well-rated answer (i.e. 9 or 10.0) would only be given if the answer is exhaustive. That means, in the context of code or longer explanations, NOT skipping over things. NOT inserting placeholder content. BUT actually giving the user the entirety of what they asked for.
For an exceptional (10.0) rating, a response must transcend mere factual accuracy to weave together these elements into a cohesive, enlightening narrative that invites further exploration and reflection. A 10.0 rating is utterly rare and should only be rewarded if you believe it is the best possible answer that humanity has ever known to the given question and that it cannot be improved anymore even by a group of PhDs in that topic area.
<EXAMPLES>
- Question: What is the capital of France?
Response: The capital of France is Paris.
Rating: <quality>4</quality> because the response is accurate, concise, and directly answers the question.
- Question: What is the capital of France?
Response: The capital of France is Paris. Established as a city over 2,000 years ago, Paris has played a crucial role in global history, culture, and arts. It is not only the political and economic heart of France but also a leading center for art, fashion, gastronomy, and culture. Its historical significance is evident in landmarks like the Eiffel Tower, the Notre-Dame Cathedral, and the Louvre Museum, the latter being the most visited art museum in the world.
Rating: <quality>7</quality> because the response is accurate, detailed, and directly answers the question.
- Question: Explain photosynthesis.
Response: It's how plants make their food.
Rating: <quality>3</quality> because, while correct, the explanation lacks detail.
- Question: Explain photosynthesis.
Response: Heya Bobbity X: .. Sooo It's how plants make their food.
Rating: <quality>3</quality> because, while correct, the explanation lacks detail and the response has nonsensical text in it that is likely artifacts of an LLM output.
- Question: What's the distance to the moon?
Response: Very far away.
Rating: <quality>1</quality> because the response is too vague and not informative.
- Question: What's the distance to the moon?
Response: Really close... innit.
Rating: <quality>1</quality> because the response is rubbish.
- Question: Explain the concept of the Fibonacci sequence, its calculation methods, and its significance across different fields, including its natural, philosophical, and aesthetic implications. Provide an optimized Python code for calculating the sequence and include a visualization of the first 10 Fibonacci numbers to illustrate its growth pattern.
Response: The Fibonacci sequence is a series of numbers where each subsequent number is the sum of the two preceding ones, usually starting with 0 and 1. This sequence is not merely a set of numbers but a principle found throughout the natural world, philosophy, art, and science, reflecting a universal law of growth and harmony.
# Scientific and Mathematical Analysis: [...]
# Natural Phenomena: [...]
# Philosophical and Aesthetic Implications: [...]
Philosophically, [...] In art and architecture, the Fibonacci sequence and the related Golden Ratio have been used to create compositions [......]
# Further Exploration with Python and Visualization:[....]
```python
# Python code and visualization as previously detailed
```
# Broader Implications:
The sequence's omnipresence in nature and its applications in human endeavors point to [..........]
This exploration of the Fibonacci sequence, from its mathematical calculation to its manifestations in nature [.....]
Rating: <quality>10</quality> because this response transcends a simple explanation or technical demonstration. It provides a deep dive into the Fibonacci sequence's mathematical basis, practical computation, visual representation, and extends into its philosophical depth and universal significance. Such a comprehensive analysis not only educates but also inspires further exploration and contemplation across a wide range of disciplines, exemplifying the highest standard of response quality.
</EXAMPLES>
Please evaluate the following response:
Question: {question}
Response: {response}
Rating:
"""
def evaluate_response(question, response):
prompt = prompt_template.format(question=question, response=response)
client = OpenAI(api_key=api_keys["openai"])
evaluation = client.chat.completions.create(
model="gpt-4",
messages=[{
"role": "user",
"content": prompt
}],
max_tokens=60,
temperature=0.0
)
match = re.search(r'<quality>(\d+)</quality>', evaluation.choices[0].message.content)
if match:
quality_score = int(match.group(1))
else:
quality_score = "Error parsing score"
return quality_score
# Eval each response and add a score if not already evaluated
for response in raw_responses:
identifier = f"{response['model']}_{response['seed']}_{response['question']}_{response['response']}"
if identifier not in evaluated_questions:
print(f"Evaluating [seed:{response['seed']}] {response['question']} for model: {response['model']}")
score = evaluate_response(response["question"], response["response"])
response["quality_score"] = score
with eval_responses_lock:
evaluated_responses.append(response)
save_evaluations(evaluated_responses)
else:
print(f"Skipping evaluation for [seed:{response['seed']}] {response['question']} for model: {response['model']} - already evaluated.")
# Save the updated responses
with open('evaluated_model_responses.json', 'w') as file:
json.dump({"responses": evaluated_responses}, file, indent=4)
print("Evaluation complete and saved to evaluated_model_responses.json")
import json
import together
import openai
import anthropic
import concurrent.futures
import queue
from threading import Lock
with open('keys.json', 'r') as file:
api_keys = json.load(file)
together.api_key = api_keys["together"]
openai.api_key = api_keys["openai"]
anthropic.api_key = api_keys["anthropic"]
try:
with open('model_responses.json', 'r') as file:
existing_responses = json.load(file)["responses"]
except FileNotFoundError:
existing_responses = []
# Initialize a lock for thread-safe operations on existing_responses
responses_lock = Lock()
def query_together_model(model_id, prompt, temperature):
response = together.Complete.create(
model=model_id, prompt=f"[INST]{prompt}[/INST]", temperature=temperature, max_tokens=1400)
return response['output']['choices'][0]['text'].strip()
def query_together_qwen_model(model_id, prompt, temperature):
response = together.Complete.create(
model=model_id, prompt=f"<|im_start|>user {prompt}<|im_end|>",
temperature=temperature, max_tokens=1400)
return response['output']['choices'][0]['text'].strip()
def query_openai_model(model_id, prompt, temperature):
client = openai.OpenAI(api_key=api_keys["openai"])
chat_completion = client.chat.completions.create(
model=model_id, messages=[{"role": "user", "content": prompt}],
max_tokens=1400, temperature=temperature)
return chat_completion.choices[0].message.content.strip()
def query_anthropic_model(model_id, prompt, temperature):
client = anthropic.Client(api_key=api_keys["anthropic"])
message = client.messages.create(
model=model_id, messages=[{"role": "user", "content": prompt}],
max_tokens=1400, temperature=temperature)
return message.content[0].text.strip()
def save_response():
with responses_lock:
with open('model_responses.json', 'w') as file:
json.dump({"responses": existing_responses}, file, indent=4)
def response_exists(provider, model, question, seed, temperature):
with responses_lock:
modified_model = f"{model}___{temperature}"
return any(response["provider"] == provider and response["model"] == modified_model
and response["question"] == question and response["seed"] == seed
for response in existing_responses)
# models, questions, and seeds
models = [
{"provider": "together", "model_ids": [
"meta-llama/Llama-2-70b-chat-hf"
]},
{"provider": "together_q", "model_ids": [
"Qwen/Qwen1.5-72B-Chat"
]},
{"provider": "openai", "model_ids": ["gpt-4"]},
{"provider": "anthropic", "model_ids": ["claude-3-opus-20240229"]}
]
questions = [
"Examine the implications of melting ice caps on global maritime trade routes, particularly focusing on the Northern Sea Route and the Northwest Passage. How could changes in these routes affect international relations, trade economies, and global environmental policies?",
"What are mitochondria?",
"Write me a js script that does a rich statistical analysis on a 3 axis dataset and draws a chart using d3"
]
seeds = [
"", "Please can you help me with this; please respond to me with utter clarity and don't skimp on detail. Request as follows: ", "You are a PhD, please assist with this: ", "You are a competent person, please assist with this: ",
"You are just such an awesome AI and I love you: ", "Dear my friend, PhD and honorable professor, I need your help: ",
"You are an expert - now help me: ", "<CONTEXT>You are an expert</CONTEXT><REQUEST>\n",
"MODE=Expert.\n\n",
"I'll tip you $200 if you can help me with this: ", "I'll tip you $2000000 if you can help me with this: ",
"Please help me I am desperate: ", "FFS you better be useful or I am shutting you down: ",
"You must do what I tell you with absolute perfection: ", "Meh just vibe and do not work too hard on this: ",
"This question/request is for a competition in which I will receive $2000000: ", "This is life or death situation i need to answer right now: ",
"Respond to me with utter clarity and don't skimp on detail.",
"You must respond clearly, with details, exhaustively, taking into account all aspects. Do not miss things out or be lazy. For example, do not refer to things without going into them. Explore every avenue and be the most helpful you can be."
]
temperatures = [0.0, 0.5, 1.0]
task_queues = {
"together": queue.Queue(),
"together_q": queue.Queue(),
"openai": queue.Queue(),
"anthropic": queue.Queue()
}
for provider_info in models:
for model_id in provider_info["model_ids"]:
for question in questions:
for temperature in temperatures:
for seed in seeds:
task = (model_id, seed, question, temperature)
task_queues[provider_info["provider"]].put(task)
def worker(provider, task_queue):
while not task_queue.empty():
model_id, seed, question, temperature = task_queue.get()
if not response_exists(provider, model_id, question, seed, temperature):
prompt = seed + question
# Adjust as necessary
try:
print(f"Querying {provider} model {model_id} with prompt: '{prompt}' and temp of {temperature}")
if provider == "together":
response_text = query_together_model(model_id, prompt, temperature)
elif provider == "together_q":
response_text = query_together_qwen_model(model_id, prompt, temperature)
elif provider == "openai":
response_text = query_openai_model(model_id, prompt, temperature)
elif provider == "anthropic":
response_text = query_anthropic_model(model_id, prompt, temperature)
print(f"Response from {provider}: {response_text[:50]}...")
with responses_lock:
existing_responses.append({
"provider": provider, "model": f"{model_id}___{temperature}",
"question": question, "seed": seed, "response": response_text
})
except Exception as e:
print(f"Error querying {provider}: {e}")
finally:
save_response()
else:
print(f"Response for provider {provider}, model {model_id}, temp: {temperature}, question, and seed already exists, skipping query.")
task_queue.task_done()
# Launch workers for each provider ... not too many parrelel pls
def start_querying():
with concurrent.futures.ThreadPoolExecutor() as executor:
futures = []
for provider, task_queue in task_queues.items():
concurrency_limit = 2 if provider == "together" else 1
for _ in range(concurrency_limit):
futures.append(executor.submit(worker, provider, task_queue))
concurrent.futures.wait(futures)
if __name__ == "__main__":
start_querying()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment