simon-mo/example-aux-row.json

## example-aux-row.json
{
  "question_id": "58210e39b3fd4441a2bd4a518bb44c2d",
  "prompt": "What is the difference between OpenCL and CUDA?",
  "openai_scores_raw_choices_nested": [
    {
      "finish_reason": "stop",
      "index": 0,
      "logprobs": null,
      "message": {
        "content": "{\n    \"topic_modeling\": \"Technical Comparison\",\n    \"score_reason\": \"This prompt requires the AI to accurately compare and contrast two distinct technologies, OpenCL and CUDA. It assesses the AI's factual accuracy and knowledge of these technologies, as well as its ability to articulate the differences between them.\",\n    \"score_value\": 9\n}",
        "role": "assistant",
        "function_call": null,
        "tool_calls": null
      }
    },
    {
      "finish_reason": "stop",
      "index": 1,
      "logprobs": null,
      "message": {
        "content": "{\n    \"topic_modeling\": \"Software Comparison\",\n    \"score_reason\": \"This prompt assesses the AI's factual accuracy in distinguishing between two similar but distinct software frameworks.\",\n    \"score_value\": 8\n}",
        "role": "assistant",
        "function_call": null,
        "tool_calls": null
      }
    },
    {
      "finish_reason": "stop",
      "index": 2,
      "logprobs": null,
      "message": {
        "content": "{\n    \"topic_modeling\": \"Comparison, Technology\",\n    \"score_reason\": \"This prompt requires the AI to demonstrate knowledge of two different technologies, compare their features, and explain their distinctions. This task assesses the AI's factual accuracy and proficiency in understanding complex technological concepts.\",\n    \"score_value\": 9\n}",
        "role": "assistant",
        "function_call": null,
        "tool_calls": null
      }
    }
  ],
  "topic_modeling_1": "Technical Comparison",
  "score_reason_1": "This prompt requires the AI to accurately compare and contrast two distinct technologies, OpenCL and CUDA. It assesses the AI's factual accuracy and knowledge of these technologies, as well as its ability to articulate the differences between them.",
  "score_value_1": 9,
  "topic_modeling_2": "Software Comparison",
  "score_reason_2": "This prompt assesses the AI's factual accuracy in distinguishing between two similar but distinct software frameworks.",
  "score_value_2": 8,
  "topic_modeling_3": "Comparison, Technology",
  "score_reason_3": "This prompt requires the AI to demonstrate knowledge of two different technologies, compare their features, and explain their distinctions. This task assesses the AI's factual accuracy and proficiency in understanding complex technological concepts.",
  "score_value_3": 9
}

## example-row.json
{
  "question_id": "c0fc42c6f5f14f2aa5a89f71f8553730",
  "model_a": "vicuna-13b",
  "model_b": "koala-13b",
  "winner": "model_a",
  "judge": "arena_user_973",
  "conversation_a": [
    {
      "content": "Count from 1 to 10 with step = 3",
      "role": "user"
    },
    {
      "content": "1, 4, 7, 10\n\nCounting with a step of 3 means that you will increase the number by 3 each time. So, the first number is 1, the next is 4 (1 + 3), the next is 7 (4 + 3), and so on.",
      "role": "assistant"
    }
  ],
  "conversation_b": [
    {
      "content": "Count from 1 to 10 with step = 3",
      "role": "user"
    },
    {
      "content": "1, 4, 7, 10",
      "role": "assistant"
    }
  ]
}

## prompt.md

      
    Raw
  

              prompt.md
            
          
    We are interested in understanding how well the following input prompts can evaluate an
AI assistant’s proficiency in problem-solving ability, creativity, or adherence to real-world
facts. Your task is to assess each prompt based on its potential to gauge the AI’s capabilities
effectively in these areas.
For each prompt, carry out the following steps:

Topic Modeling: Use two words to describe the task intended.
Assess the Potential: Consider how challenging the prompt is, and how well it can
assess an AI’s problem-solving skills, creativity, or factual accuracy. Briefly explain your
reasoning.
Assign a Score: Assign a score on a scale of 1 to 10, with a higher score representing
a higher potential to evaluate the AI assistant’s proficiency effectively. Use double square
brackets to format your scores, like so: [[5]].

Guidelines for Scoring:
• High Score (8-10): Reserved for prompts that are particularly challenging and excellently designed to assess AI proficiency.
• Medium Score (4-7): Given to prompts that have a moderate potential to assess the AI’s
capabilities.
• Low Score (1-3): Allocated to prompts that are either too easy, ambiguous, or do not
adequately assess the AI’s capabilities.
Ensure to critically evaluate each prompt and avoid giving high scores to prompts that are
ambiguous or too straightforward.
The output MUST follow a JSON format:
{
"topic_modeling": "...",
"score_reason": "...",
"score_value": ...
}

  
## using-your-own-embeddings.md

      
    Raw
  

              using-your-own-embeddings.md
            
          
    Note that the provided embeddings are generated using OpenAI's model, which means if you want generate new prompt, you needs an OpenAI developer account ($5 free credit for new signup otherwise needing a credit card). The code used to generate the embeddings is as follows:
import openai
client = openai.OpenAI()
client.embeddings.create(
    model="text-embedding-3-small", input=prompt, extra_body=dict(dimensions=256)
)
You are welcomed to use open source models (you can use huggingface on datahub!) to generate the embeddings for the new prompts. A good start would be using huggingface pipeline for embedding generation:
from transformers import pipeline

embedder = pipeline("feature-extraction", model="distilbert-base-uncased")
embeddings = embedder(prompt)
You can see more in the feature extraction pipeline documentation. It will result in dimension of 768 for each prompt.
	{
	"question_id": "58210e39b3fd4441a2bd4a518bb44c2d",
	"prompt": "What is the difference between OpenCL and CUDA?",
	"openai_scores_raw_choices_nested": [
	{
	"finish_reason": "stop",
	"index": 0,
	"logprobs": null,
	"message": {
	"content": "{\n \"topic_modeling\": \"Technical Comparison\",\n \"score_reason\": \"This prompt requires the AI to accurately compare and contrast two distinct technologies, OpenCL and CUDA. It assesses the AI's factual accuracy and knowledge of these technologies, as well as its ability to articulate the differences between them.\",\n \"score_value\": 9\n}",
	"role": "assistant",
	"function_call": null,
	"tool_calls": null
	}
	},
	{
	"finish_reason": "stop",
	"index": 1,
	"logprobs": null,
	"message": {
	"content": "{\n \"topic_modeling\": \"Software Comparison\",\n \"score_reason\": \"This prompt assesses the AI's factual accuracy in distinguishing between two similar but distinct software frameworks.\",\n \"score_value\": 8\n}",
	"role": "assistant",
	"function_call": null,
	"tool_calls": null
	}
	},
	{
	"finish_reason": "stop",
	"index": 2,
	"logprobs": null,
	"message": {
	"content": "{\n \"topic_modeling\": \"Comparison, Technology\",\n \"score_reason\": \"This prompt requires the AI to demonstrate knowledge of two different technologies, compare their features, and explain their distinctions. This task assesses the AI's factual accuracy and proficiency in understanding complex technological concepts.\",\n \"score_value\": 9\n}",
	"role": "assistant",
	"function_call": null,
	"tool_calls": null
	}
	}
	],
	"topic_modeling_1": "Technical Comparison",
	"score_reason_1": "This prompt requires the AI to accurately compare and contrast two distinct technologies, OpenCL and CUDA. It assesses the AI's factual accuracy and knowledge of these technologies, as well as its ability to articulate the differences between them.",
	"score_value_1": 9,
	"topic_modeling_2": "Software Comparison",
	"score_reason_2": "This prompt assesses the AI's factual accuracy in distinguishing between two similar but distinct software frameworks.",
	"score_value_2": 8,
	"topic_modeling_3": "Comparison, Technology",
	"score_reason_3": "This prompt requires the AI to demonstrate knowledge of two different technologies, compare their features, and explain their distinctions. This task assesses the AI's factual accuracy and proficiency in understanding complex technological concepts.",
	"score_value_3": 9
	}
	{
	"question_id": "c0fc42c6f5f14f2aa5a89f71f8553730",
	"model_a": "vicuna-13b",
	"model_b": "koala-13b",
	"winner": "model_a",
	"judge": "arena_user_973",
	"conversation_a": [
	{
	"content": "Count from 1 to 10 with step = 3",
	"role": "user"
	},
	{
	"content": "1, 4, 7, 10\n\nCounting with a step of 3 means that you will increase the number by 3 each time. So, the first number is 1, the next is 4 (1 + 3), the next is 7 (4 + 3), and so on.",
	"role": "assistant"
	}
	],
	"conversation_b": [
	{
	"content": "Count from 1 to 10 with step = 3",
	"role": "user"
	},
	{
	"content": "1, 4, 7, 10",
	"role": "assistant"
	}
	]
	}