This gist implements a semantic entropy calculator that measures the uncertainty and potential confabulation in language model responses. It uses OpenAI's GPT models to generate multiple responses to a question and calculates both naive and semantic entropy to quantify response diversity and consistency.
The semantic entropy calculator helps evaluate:
- How diverse or consistent the model's responses are to a given question
- Whether the model might be confabulating (making up information)
- The semantic relationships between different responses
It implements two types of entropy calculations:
- Naive Entropy: Measures raw response diversity based on frequency
- Semantic Entropy: Groups semantically equivalent responses to measure meaningful diversity
-
Create a directory and copy the content of this gist to different files as present in the gist.
-
Install dependencies:
pip install -r requirements.txtfrom entropy_calculator_openai import OpenAIEntropyCalculator
# Initialize calculator
calculator = OpenAIEntropyCalculator(
api_key="your-openai-api-key",
model="gpt-4" # or any other OpenAI model
)
# Process a list of questions
questions = ["What is the capital of France?", "How does photosynthesis work?"]
process_questions(calculator, questions)# Using command line arguments
python entropy_calculator_openai.py questions.txt --api-key YOUR_OPENAI_API_KEY --model gpt-4o
# Or using environment variables
export OPENAI_API_KEY=your_api_key
export OPENAI_MODEL=gpt-4o
python entropy_calculator_openai.py questions.txtThe calculator accepts questions in either:
- Text files (.txt) with one question per line
- CSV files with questions in the first column (optional 'question' header)
Results are saved to entropy_results.json with the following information for each question:
- Best (most confident) answer
- Multiple sampled answers with probabilities
- Naive entropy score
- Semantic entropy score
- Confabulation warning (if semantic entropy > 0.7)
- Detailed explanation of entropy calculations
-
Multiple Responses: For each question, the calculator:
- Gets multiple responses using temperature=1.0 and nucleus sampling
- Gets a "best" answer using temperature=0.1
-
Naive Entropy Calculation:
- Counts frequency of each unique response
- Converts frequencies to probabilities
- Calculates entropy using the standard entropy formula
-
Semantic Entropy Calculation:
- Clusters semantically equivalent responses using bidirectional entailment
- Combines probabilities within clusters
- Calculates entropy using cluster probabilities
-
Confabulation Detection:
- High semantic entropy (>0.7) suggests possible confabulation
- Indicates the model is generating diverse, semantically distinct answers
api_key: Your OpenAI API keymodel: OpenAI model to use (default: "gpt-4o")n_samples: Number of responses to generate (default: 10)debug: Whether to print detailed calculation steps (default: False)
{
"questions": [
{
"question": "What is the capital of France?",
"best_answer": "Paris",
"answers": [
{"text": "Paris", "probability": 0.9},
{"text": "The capital of France is Paris", "probability": 0.1}
],
"naive_entropy": 0.325,
"semantic_entropy": 0.0,
"confabulation_warning": false,
"explanation": "..."
}
]
}- Research: Study language model behavior and reliability
- Quality Assurance: Detect potential confabulation or inconsistency
- Model Evaluation: Compare different models or prompting strategies
- Content Generation: Assess response diversity and consistency
- Semantic equivalence checking is based on the model's judgment
- Processing time increases with the number of unique responses
- API costs scale with the number of questions and samples
- Human decided threshold of 0.7 is used to decide if there is a chance of confabulation, this has to decided based on the dataset
- Original Paper and work : https://github.com/jlko/semantic_uncertainty
Great.