thehunmonkgroup/Summary: Luna: An Evaluation Foundation Model to Catch Language Model Hallucinations with High Accuracy and Low Cost.md

## Summary: Luna: An Evaluation Foundation Model to Catch Language Model Hallucinations with High Accuracy and Low Cost.md

      
    Raw
  

              Summary: Luna: An Evaluation Foundation Model to Catch Language Model Hallucinations with High Accuracy and Low Cost.md
            
          
    USER

Review the following paper and answer any questions based on the content of the paper. The paper should be regarded as the source of truth when answering any questions.
Paper

Luna: An Evaluation Foundation Model to Catch Language Model
Hallucinations with High Accuracy and Low Cost
Masha Belyi∗
Robert Friel∗
Shuai Shao
Atindriyo Sanyal
Galileo Technologies Inc.
{masha,rob,ss,atin}@rungalileo.io
arXiv:2406.00975v2 [cs.CL] 5 Jun 2024
Abstract
Retriever-Augmented Generation (RAG) systems have become pivotal in enhancing the
capabilities of language models by incorporating external knowledge retrieval mechanisms. However, a significant challenge in
deploying these systems in industry applications is the detection and mitigation of hallucinations—instances where the model generates
information that is not grounded in the retrieved
context. Addressing this issue is crucial for ensuring the reliability and accuracy of responses
generated by large language models (LLMs) in
diverse industry settings. Current hallucination
detection techniques fail to deliver accuracy,
low latency, and low cost simultaneously. We
introduce Luna: a DeBERTA-large (440M) encoder, fine-tuned for hallucination detection in
RAG settings. We demonstrate that Luna outperforms GPT-3.5 and commercial evaluation
frameworks on the hallucination detection task,
with 97% and 91% reduction in cost and latency, respectively. Luna is lightweight and
generalizes across multiple industry verticals
and out-of-domain data, making it an ideal candidate for industry LLM applications.
1
Figure 1: Luna is a lightweight DeBERTA-large encoder, fine-tuned for hallucination detection in RAG
settings. Luna outperforms zero-shot hallucination detection models (GPT-3.5, ChainPoll GPT-3.5 ensemble)
and RAG evaluation frameworks (RAGAS, Trulens) at
a fraction of the cost and millisecond inference speed.
Yet, LLMs still often respond with nonfactual information that contradicts the knowledge supplied
by RAG (Shuster et al., 2021; Magesh et al., 2024).
Causes of hallucinations have been extensively
studied across different LLM tasks (Zheng et al.,
2024; Cao et al., 2022; Das et al., 2022). Key
contributing factors include knowledge cutoff (Vu
et al., 2023), randomness (Lee et al., 2022), faulty
training data (Dziri et al., 2022a; Lin et al., 2022;
McKenna et al., 2023), and finetuning with large
amounts of new knowledge (Gekhman et al.,
2024). Apart from RAG, proposed mitigation solutions explore prompt engineering with chain of
thought (Wei et al., 2022), finetuning (Zhang et al.,
2024), reinforcement learning with human feedback (Ouyang et al., 2022), and specialized hallucination detection models (Wu et al., 2023; Lin et al.,
2022). For RAG specifically, evaluation frameworks like RAGAS (Es et al., 2024), Trulens1 , and
ARES (Saad-Falcon et al., 2024) have emerged to
offer automated hallucination detection at scale.
However, these approaches rely on static prompts
Introduction
Large Language Models (LLMs) are broadly used
in industry dialogue applications due to their impressive ability to hold a natural conversation and
succeed on a variety of reasoning tasks (Zhao et al.,
2023). A key challenge in deploying customerfacing LLMs is their propensity for hallucinations,
where the model presents cohesive, but factually
incorrect information in conversation with a user
(Roller et al., 2021; Lin et al., 2022). Retrievalaugmented generation (RAG), a technique for incorporating knowledge relevant to each user query
in the LLM prompt, effectively reduces LLM hallucinations in production systems (Lewis et al., 2020).
*
1
These authors contributed equally to this work
1
https://www.trulens.org/
2
(RAGAS, Trulens) or finetuning on in-domain data
(ARES), which limit their capacity to generalize to
a breadth of industry applications. Gao et al. (2023)
and Wu et al. (2023) take it a step further to successfully suppress hallucinations in LLM responses
with a detect-and-replace technique. Though, due
to prohibitively slow latency of their LLM evaluation models, real-time hallucination prevention in
production systems still remains a challenge.
Customer-facing dialogue applications necessitate a hallucination detection system with highaccuracy, low cost, and low latency, such that hallucinations are caught and resolved before reaching the user. Few/zero-shot LLM approaches
fail to meet the strict latency requirement due
to model size. Moreover, though commericial
LLMs like OpenAI’s GPT models (OpenAI, 2023)
achieve strong performance, querying customer
data through 3rd party APIs is both costly and undesirable for privacy and security reasons. Finetuned BERT-size models can achieve competitive
performance to LLM judges (Bohnet et al., 2023;
Saad-Falcon et al., 2024; Gao et al., 2023; Li et al.,
2024; Yue et al., 2023), offering lower latency and
local execution. However, these models require annotated data for finetuning and have not been evaluated for large-scale, cross-domain applications.
In this paper, we introduce Luna - a lightweight
RAG hallucination detection model that generalizes across multiple industry-specific domains and
scales well for real-time deployment. Luna is a
440M parameter DeBERTa-large encoder that is
finetuned on carefully curated real-world RAG data.
From analysis of RAG in production settings, we
identify long-context RAG evaluation as a previously unaddressed challenge and propose a novel
solution that facilitates high precision long-context
RAG hallucination detection. Through extensive
benchmarking, we demonstrate that Luna outperforms zero-shot prompting and RAG evaluation
frameworks on the hallucination detection task.
Our approach is closest to the concurrently proposed ARES automated RAG evaluation framework (Saad-Falcon et al., 2024), with a few key
differences: (1) ARES requires a validation set of
in-domain annotated data to finetune a custom evaluation model, while Luna is pre-trained on a crossdomain corpus for built-in generalization; (2) Luna
accurately detects hallucinations on long RAG contexts; and (3) Luna is optimized to process up to
16k tokens in milliseconds on deployment hardware.
Related Work
Hallucination detection Prior work on hallucination detection in natural language generation
(NLG) is vast (Ji et al., 2023). SelfCheckGPT
(Manakul et al., 2023) and Agrawal et al. (2024) are
examples of heuristic consistency-based methods
that detect unreliable LLM outputs by comparing
multiple sampled responses from the same LLM.
Others look to the internal state of the LLM, such
as hidden layer activations (Azaria and Mitchell,
2023) and token-level uncertainty (Varshney et al.,
2023) as a proxy signal for hallucinations. Kadavath et al. (2022) prompt the generating LLM to
introspect and evaluate it’s own responses. More
generally, zero-shot (Es et al., 2024) and finetuned
(Wu et al., 2023; Yue et al., 2023; Muller et al.,
2023) LLM judges leverage LLM’s inherent reasoning abilities to evaluate other LLM generations.
Similarly, general purpose finetuned LLM evaluators (Kim et al., 2024) that have been shown to
correlate with human judgements can also be applied to hallucination detection.
Our approach to finetune a small LM evaluator
like in (Gao et al., 2023; Saad-Falcon et al., 2024)
is the first to evaluate and optimize such a model
for industry applications under strict performance,
cost, and latency constraints.
NLI for closed-domain Hallucination Detection
Existing research draws parallels between the hallucination detection task and the concept of entailment in Natural Language Inference (NLI). The
goal of NLI is to determine the relationship between a premise and hypothesis, which can be
one of: entailment, contradiction, or neutral. In
the past, NLI models have been used to evaluate
factual consistency on closed-domain NLG tasks
(Honovich et al., 2022; Dziri et al., 2022b). The
Attributable to Identified Sources (AIS) framework,
introduced by Rashkin et al. (2023), formally unifies the notions of factuality, attribution, hallucination, faithfulness, and groundedness - all terms
used to measure the extent to which an LLM response is attributable to some source of ground
truth. In followup work, NLI entailment has been
shown to correlate with AIS scores (Gao et al.,
2023; Bohnet et al., 2023; Li et al., 2024) and has
become a standard baseline for AIS and hallucination detection models.
In this work, we use pre-trained NLI model
weights as the starting point for Luna finetuning.
2
technically handle sequences of up to 24k (He et al.,
2021), computational complexity of transformer attention layers scale quadratically with input length.
Moreover, though long-context LLMs like Claude3 are becoming competitive on LLM leaderboards3 ,
research shows that these models suffer from information loss (Liu et al., 2023) and may not be
suitable for long-context RAG evaluation.
A naive solution is to chunk long-context RAG
inputs into short segments and process them
through the evaluator model in batches. Model
predictions can then be aggregated over batch rows
to predict example-level hallucination probabilities. Figure 3 illustrates how such chunking may
result in false positives in cases where supporting
information is scattered throughout the long context document(s). Instead, we leverage span-level
predictions for a high-precision classifier over long
sequence inputs.
Figure 2: Distribution of RAG context token lengths in
our QA RAG training split.
3
Luna Model
We fine-tune a DeBERTa-v3-Large (He et al., 2023)
NLI checkpoint2 from Laurer et al. (2022) with a
shallow hallucination classifier on each response
token. We train on the task of identifying supported tokens in the response, given a query and
retrieved context. Framing the problem in this way
makes our work comparable to recent automated
RAG evaluation efforts. Our definition of support
is synonymous with the answer faithfulness metric
explored in RAGAS (Es et al., 2024) and ARES
(Saad-Falcon et al., 2024), Truelens groundedness,
and attribution (Li et al., 2024). At inference, we
treat spans with low support probabilities as hallucinated spans.
Similar to Gao et al. (2023) and Wu et al. (2023),
we aim to identify hallucinated spans in the response, rather than the less granular example-level
hallucination boolean. While predicting spans is a
more challenging task, it yields a more informative
prediction to the end-user. Further, this approach
sets us up for long-context prediction, which we
discuss in detail next.
3.1
3.2
Long Context Chunking
Consider a single input into the RAG evaluation
model that consists of C context tokens [c1 ...cC ],
Q question tokens [q1 ...qQ ], and R response tokens
[r1 ...rR ]. Assume we are working with an evaluator model that accepts maximum sequence length
L, and that Q+R<L, but C is much larger4 . To fit
the example into the model we break it up into windows of length L, such that each window contains
the question, response, and a subset of the context
tokens:
wi = [ci1 ...cil ] ⊕ [q1 ...qQ ] ⊕ [r1 ...rR ]
(1)
where l = L−Q−R, and there are Nl windows per
example. In Figure 3 there are three such windows.
Our model outputs support probabilities pi for each
of the R response tokens in wi as:
Long Context RAG
PS (wi ) = [pi1 ...piR ]
In practice, we find that context length limitations
are a significant pain point in industry applications.
Custom RAG setups may retrieve a large number of
context documents from various sources, or choose
not to chunk the documents before passing them
into the retriever. This results in long inputs to the
RAG generator and evaluation models, sometimes
even exceeding the token limit of select commercial
LLMs. In Figure 2 we visualize the context length
distribution of our curated RAG dataset (detailed in
Section 4.1). While our base DeBERTa model can
(2)
We train with a cross-entropy loss on each token output. During training, we leverage granular
token-level support labels (Section 4.2) to adjust
the training labels in each batch based on which
context tokens are present in the window. For example, in Figure 3, "Washington, D.C., the
capital of the US" is supported in window
1, nothing is supported in window 2, and "was
founded in 1791" is supported in window 3.
3
https://huggingface.co/spaces/lmsys/chatbot-arenaleaderboard
4
the same approach easily extends to cases where R>L
2
https://huggingface.co/MoritzLaurer/DeBERTa-v3large-mnli-fever-anli-ling-wanli
3
Figure 3: Long RAG context with naive chunking example. Naive context chunking leads to hallucination false
positives when supporting information is scattered throughout the context. Without insight into which specific spans
were suporrted/not supported by the context, it is impossible to arrive at the correct conclusion that the response in
this example does NOT contain hallucinations.
3.3
To leverage the full pre-trained NLI model, we
initialize the hallucination prediction head with
weights from the NLI classification head. The original NLI head is a 3-class single-layer perceptron
with a neuron for each NLI class (entailment, contradiction, and neutral). During training, we optimize for low entailment probability and high contradiction probability for hallucinated tokens (and
the opposite for supported tokens). At inference,
we output the probability of entailment for each
token.
We apply data transformation techniques to introduce additional variability for better generalization during training. Transformations include dropping and inserting context documents, and shuffling questions and responses between examples
in batch. Training labels are adjusted accordingly
with each transformation.
The model trains for 3 epochs with cross-entropy
loss on the output of each response token. We
initialize the learning rate to 5−6 for the base model
layers and 2−5 for the classification head, and train
with warmup and a linear decay rate.
Figure 4: Illustration of Luna’s token-level predictions
for the example in Figure 3. Luna’s token-level predictions are aggregated over context windows into a
high-precision hallucination probability score.
At inference, we aggregate example-level support probabilities by taking the token-level maximum over windows. Refer to Figure 4 for an visual
illustration of the steps described by equations 3-5
below. The example-level support probability for
token j is defined as:
pj = max (pij )
1≤i≤|w|
(3)
where |w| = Nl is the total number of windows we
created in (1). To produce an example-level label,
we take the minimum over R tokens:
PS = min(p1 ...pR )
(4)
4
Data
4.1
RAG QA dataset
We recycle open-book QA datasets to construct a
RAG QA dataset. Our goal is to simulate natural RAG examples that may occurr in production
settings. We sample data from five industry verticals: customer support (DelucionQA (Sadat et al.,
2023), EManual (Nandy et al., 2021), TechQA
(Castelli et al., 2020)), finance and numerical rea-
so that the overall support probability is no greater
than the support probability of the least supported
token in the response. Finally, we derive example
hallucination probability P H as
PH = 1 − PS
Training
(5)
4
Domain
customer support
finance
biomedical research
legal
general knowledge
train
4k
38k
22k
1.5k
9.5k
val
600
5k
3k
500
2k
test
600
5k
3k
500
2k
general statements that, while not supported by any
specific context span, are generally grounded in
the question and provided context. We instruct the
annotator to label these as "generally supported",
which we post-process to indicate support in every
context window during training. Statements highlighting lack of sufficient information to answer the
question also fall into this category.
We take measures to ensure high quality labels
from our LLM annotator. First, we use chain-ofthought (Wei et al., 2022), which has been shown
to increase agreement between LLM and human
judgements (He et al., 2024). Next, we request both
response-level and sentence-level annotations that
we compare to identify potentially noisy labels. For
example, if GPT-4 claims a response as supported
by the context as a whole, but identifies no supporting information for one or more claims in the
response, we send the example for re-annotation.
We re-annotate examples up to 3 times, after which
<2% of the data are still conflicting. After manual
inspection, we find that the majority of the conflicts
arise from partially supported sentences. Since our
annotation scheme is binary on the sentence level
(the full sentence is either supported or not), we resolve all tokens in partially supported sentences to
"not supported" on both the sentence and example
level.
%H
22%
5%
20%
6%
18%
Table 1: RAG QA data statistics. RAG context and
questions are sourced from open-book QA datasets that
cover five industry-specific domains. RAG responses
are generated with GPT-3.5 and Claude-3-Haiku, and
annotated with GPT-4-turbo. %H indicates the fraction
of hallucinated responses in each domain.
soning (FinQA (Chen et al., 2021), TAT-QA (Zhu
et al., 2021)), biomedical research (PubmedQA (Jin
et al., 2019), CovidQA (Möller et al., 2020)), legal
(Cuad (Hendrycks et al., 2021)) and general knowledge (HotpotQA (Yang et al., 2018), MS Marco
(Nguyen et al., 2016), HAGRID (Kamalloo et al.,
2023), ExpertQA (Malaviya et al., 2024)). The
combined dataset contains examples from a variety
of difficult RAG task types, including numerical
reasoning over tables, inference over multiple context documents, and retrieval from long contexts.
We reserve 20% of the dataset for validation and
testing. Table 1 reports statistics of the data splits.
For each component dataset, we ignore the
ground truth responses and generate two new responses per input with GPT-3.5 and Claude-3Haiku. These models exhibit strong reasoning and
conversational abilities (Chiang et al., 2024) at a
low price point, which makes them realistic candidates for production RAG systems. We set temperature to 1 for generation to encourage diversity and
potential hallucinations in the responses. Next, we
describe how we annotate the data for training.
4.2
5
Evaluation
5.1
Datasets
We evaluate Luna on a combination of existing
academic benchmarks (RAGTruth) and real-world
RAG data.
RAGTruth RAGTruth is an expert-annotated
corpus of 18k RAG examples with LLM-generated
responses. The data are split into three RAG task
types: Question Answering (QA), Data-to-text
Writing, and News Summarization. Since Luna
is only trained on QA RAG examples, we use this
benchmark to evaluate our model’s generalization
to other RAG task types.
Labeling
We leverage GPT-4-turbo to annotate the RAG QA
dataset. Refer to Section 8.1 for a discussion on
the limitations of this approach.
Before annotation, we split the context and response into sentences using nltk (Bird and Loper,
2004). We pass the question along with the tokenized context and response sentences to GPT4-turbo for annotation. For each sentence in the
response, we instruct the LLM to identify which
context sentences, if any, support the claim in the
response. Tokens in sentences without any support
are treated as hallucinations. We find that LLM
responses often contain transition sentences and
RAG QA Test Set We also evaluate Luna on a
held-out split of our RAG QA dataset (Section 4.1).
This serves as an in-domain test set for evaluating
Luna performance across industry verticals.
5.2
Baselines
Zero-shot prompting We evaluate GPT-3.5turbo and GPT-4-turbo models from OpenAI as
5
Method
Q UESTION A NSWERING
DATA - TO -T EXT W RITING
Precision
Recall
F1
Precision
Recall
F1
Precision
S UMMARIZATION
Recall
F1
Precision
OVERALL
Recall
F1
Promptgpt-3.5-turbo †
Promptgpt-4-turbo †
SelCheckGPTgpt-3.5-turbo †
LMvLMgpt-4-turbo †
Finetuned Llama-2-13B†
18.8
33.2
35.0
18.7
61.6
84.4
90.6
58.0
76.9
76.3
30.8
45.6
43.7
30.1
68.2
65.1
64.3
68.2
68.0
85.4
95.5
100.0
82.8
76.7
91.0
77.4
78.3
74.8
72.1
88.1
23.4
31.5
31.1
23.2
64.0
89.2
97.6
56.5
81.9
54.9
37.1
47.6
40.1
36.2
59.1
37.1
46.9
49.7
36.2
76.9
92.3
97.9
71.9
77.8
80.7
52.9
63.4
58.8
49.4
78.7
ChainPollgpt-3.5-turbo
RAGAS Faithfulness
Trulens Groundedness
33.5
31.2
22.8
51.3
41.9
92.5
40.5
35.7
36.6
84.6
79.2
66.9
35.1
50.8
96.5
49.6
61.9
79.0
45.8
64.2
40.2
48.0
29.9
50.0
46.9
40.8
44.5
54.8
62.0
46.5
40.6
44.8
85.8
46.7
52.0
60.4
Luna
37.8
80.0
51.3
64.9
91.2
75.9
40.0
76.5
52.5
52.7
86.1
65.4
Table 2: Response-level results on RAGTruth hallucination prediction task. Luna is compared against RAGTruth
baselines reported in Wu et al. (2023) (rows marked with † ), as well as our own baselines. RAGAS and Trulens
are evaluation framewords that query GPT-3.5-turbo for hallucination detection. ChainPoll is our gpt-3.5-turbo
ensemble prompt baseline. ChainPoll, RAGAS, Trulens, and Luna probability thresholds were tuned for best
Overall F1. The top and second-best F1 scores are bolded and underlined. Luna outperforms all prompt-based
approaches and narrows the gap between other baselines and the 13B fine-tuned Llama, at a fraction of the cost.
Method
C USTOMER S UPPORT
F INANCIAL R EASONING
G ENERAL K NOWLEDGE
L EGAL
B IOMED
OVERALL
GPT-4-turbo annotator
1.0
1.0
1.0
1.0
1.0
1.0
Promptgpt-3.5-turbo
ChainPollgpt-3.5-turbo
0.68
0.76
0.67
0.74
0.67
0.75
0.63
0.71
0.64
0.71
0.66
0.74
RAGAS Faithfulness
Trulens Groundedness
0.62
0.56
0.60
0.56
0.60
0.65
0.58
0.34
0.54
0.68
0.61
0.56
Lunain-domain
LunaOOD
0.76
0.74
0.82
0.64
0.81

0.78
0.79
0.83

0.80

Table 3: AUROC on the hallucination detection task on the RAG QA test set. Best score in each domain is bolded.
Lunain-domain is our model trained on combined train splits from each domain. LunaOOD is the same model trained
on a subset of General Knowledge and Biomed domains.
baselines. We prompt the LLMs to return an
example-level boolean indicating whether or not a
RAG response is supported by the associated RAG
context. For RAGTruth we also include all baselines reported in the original paper.
metrics at this optimal threshold. For other benchmarks, we report the area under the ROC curve
(AUROC), which we consider a more informative
metric that circumvents the need for threshold tuning.
Ensemble prompting LLM ensembles have
been shown to outperform single model judges by
eliminating bias (Friel and Sanyal, 2023; Verga
et al., 2024). We leverage ChainPoll (Friel and
Sanyal, 2023) with a chain-of-thought prompt for
a stronger GPT-3.5-turbo baseline.
6
On the RAGTruth dataset, Luna outperforms all
prompt-based approaches on the QA and Summarization tasks, and is competitive with GPT3.5 evaluators on the Data-to-Text Writing task
(Table 2). Overall, Luna is second only to the
finetuned Llama-2-13B, which is expected given
the significant difference in size between the two
models (440M vs 13B). It’s important to note that
the Llama-2-13B baseline was trained on a subset
of RAGTruth, as compared to Luna, which was
trained on a QA-only dataset with a different data
distribution. Nevertheless, we find that Luna generalizes well to the out-of-domain task types. Additionally, the gains in cost and inference speed we
achieve with the lightweight Luna model (Sections
7.2, 7.3) offset the performance gap.
Results on the RAG QA test set are reported in
RAG Evaluation Frameworks We evaluate two
commercial RAG evaluation frmeworks: RAGAS
(v0.1.7) (Es et al., 2024) and Trulens (v0.13.4). We
report RAGAS Faithfulness and Trulens Groundedness metrics, which are designed for hallucination
detection.
5.3
Results
Metrics
For comparison with RAGTruth baselinse, we
report best Precision, Recall, and F1 scores on
RAGTruth. We tune model output probability
thresholds for the best overall F1 and report all
6
0-5k
(223)
5k-16k
(209)
16k+
(78)
Promptgpt-3.5-turbo
ChainPollgpt-3.5-turbo
0
0
-12.11%
-8.97%
-100%
-100%
RAGAS Faithfulness
Trulens Groundedness
0
0
-4.36%
-6.38%
-100%
-100%
Luna
Lunaexample
0
0
-12.55%
-21.44%
-31.98%
-43.75%
(count in test)
tion and long context chunking approach (Section
3.2), we do an ablation study where we compare
our best model to a version of Luna that makes example level predictions, referred to as Lunaexample
in Table 4. As shown in Figure 3, we expect
Lunaexample to perform worse on long contexts.
Our findings confirm this hypothesis: although the
hallucination detection performance of both Luna
and Lunaexample degrades with increasing context
lengths, Lunaexample exhibits a greater degradation
than Luna.
Table 4: Relative hallucination detection performance
of various models on shor(0-5k), medium(5k-16k), and
long(16k+) context lengths. Luna is our best finetuned DeBERTA-large model, and Lunaexample is a version of Luna that makes hallucination predictions at
example level. All GPT-3.5-based baselines (including RAGAS, Trulens) fail on input lengths >16k, while
Luna maintains 88% and 68% of its’s performance on
medium (5k-16k) and long (16k+) context lengths, respectively. Lunaexample also struggles more with long
context lengths that Luna.
7.2
API-based hallucination detection methods accrue
substantial costs if used continuously in production settings. Luna outperforms GPT-3.5-based
approaches while operating at a fraction of the
cost. In Figure 1 we illustrate the trade-off between
monthly maintenance costs and accuracy for Luna
versus our GPT-3.5-based baselines. Costs are estimated assuming average throughput of 10 queries
per second, with average query length of 4000 tokens. We use OpenAI API5 and AWS cloud6 pricing at the time of writing. Detailed cost calculations
can be found in Appendix B.
Although we do not explicitly compare pricing
against larger fine-tuned models such as Llama-213B, we note that hosting a multi-billion parameter model demands substantially more compute
resources than Luna, which would be reflected in
the overall cost.
Table 3 and follow a similar pattern. Luna outperforms the baselines across all verticals.
We also evaluate the model’s cross-domain generalization by training on a subset of General
Knowledge and Biomedical Domains, and evaluating on the others. We refer to this model as
LunaOOD . We find that LunaOOD still outperforms
most baselines on the out-of-domain subsets. However, generalization to the Financial Reasoning domain is weak. Examples in this domain require
reasoning over tabular data, which LunaOOD never
observes in training. Fine-tuning on the Financial
Reasoning domain greatly boosts performance, increasing AUROC from 0.64 to 0.82.
7
Discussion
7.1
Long Context Hallucination Detection
Cost vs Accuracy Trade-offs
7.3
Latency Optimizations
We optimize Luna and its deployment architecture
to process up to 16k input tokens in under one
second on NVIDIA L4 GPU. To achieve this, we
deploy an ONNX-traced model on NVIDIA Triton server with TensorRT backend. We leverage
Triton’s Business Logic Scripting (BLS) to optimize the data flow and orchestration between GPU
and CPU resources. BLS intelligently allocates
resources based on the specific requirements of
each inference request, ensuring that both GPU
and CPU are utilized effectively and that neither
resource becomes a bottleneck. We also tune our
inference model maximum input length for optimal performance. While increasing the maximum
sequence length would reduce the size and number of batches processed by the model (see Section
3.2), transformer layer computational complexity
In Table 4 we report Luna’s performance against
baselines on a range of RAG context lengths.
For this analysis we sample data from CUAD
(Hendrycks et al., 2021), one of the RAG QA component datasets, which passes full-length legal contracts as context inputs into RAG. This dataset contains the largest range of context lengths in RAG
QA.
We find that performance of all models inversely
correlates with context length. However, while the
GPT-3.5-powered baselines fail completely at the
GPT-3.5 context limit (16k tokens), Luna maintains
68% of it’s performance on that subset.
To validate the efficacy of our span-level predic-
5
6
7
https://openai.com/api/pricing/
https://aws.amazon.com/ec2/pricing/on-demand/
also scales quadratically with input length. We
determine token length of 512 to be the most effective. Finally, we optimize pre-and post-processing
python code for maximum efficiency. Table 5 in
Appendix details the latency reductions achieved
at each optimization step.
Sentence-level annotations Luna is trained on
sentence-level annotations, i.e. there is an assumption that a sentence is either supported or not supported. This is most often the case, but future work
can explore token-level labels for compound sentences with partially supported claims.
8
8.2
Conclusion
Hallucinations in RAG output highlight weaknesses of the generator model. However, it is
equally important to consider the quality of the
retriever and its contribution to the overall performance of a RAG system. A sub-optimal retriever
may supply irrelevant context to the generator, making it difficult for the generator to produce an accurate response. A comprehensive RAG evaluation
model should therefore assess all dimensions of
the RAG system. To this end, metrics like context
relevance have been explored to assess the quality
of retrieved RAG contexts (Es et al., 2024; SaadFalcon et al., 2024).
In future work, we propose to leverage Luna for
measuring a comprehensive suite of RAG metrics.
One cost-effective approach could be to augment
the current DeBERTA architecture with additional
prediction heads that output multiple metrics in
one forward pass. We hypothesize that the shared
weights of the base encoder layers may enhance
the performance of each head.
In this work we introduced Luna: a cost-effective
hallucination detection model with millisecond inference speed. Luna eliminates dependency on
slow and expensive 3rd party API calls, and enables
practitioners to effectively address hallucinations
in production. The proposed model can be hosted
on a local GPU, guaranteeing privacy that 3d-party
API’s cannot.
8.1
Future Work
Limitations
Closed Domain Hallucinations Luna’s efficacy
is limited to closed domain hallucination detection in RAG settings. Due to its size, Luna lacks
the necessary world knowledge to detect open domain hallucinations. For open-domain applications,
Luna relies on a high-quality RAG retriever to provide the necessary context knowledge for an input
query.
LLM Annotations LLM’s remarkable zero-shot
abilities have encouraged researchers to consider
LLMs for annotation and synthetic data generation.
Replacing human annotators with LLMs offerst
substantial efficiency and cost savings (Wang et al.,
2021). However, LLM performance on various
annotation tasks is still controversial, with some
studies reporting high correlations between LLM
and human judgements (Chiang and Lee, 2023; He
et al., 2024; Verga et al., 2024), while others advise
caution (Li et al., 2023; Wang et al., 2024).
In this work, we recognize the potential noise
and bias introduced in our training and evaluation
data by automated GPT-4-turbo annotations. We
hypothesize that our model derives greater advantages from training on a large-scale dataset, facilitated by low-cost LLM annotation, than it is
hindered by potential noise within the data. After taking steps to ensure annotation quality (Section 4.2), we observe competitive performance on
RAGTruth, a human-annotated benchmark in Section 6. This evaluation provides external validation
for our model outputs, although we acknowledge
that performance could potentially be enhanced
with higher quality annotation sources.
References
Ayush Agrawal, Mirac Suzgun, Lester Mackey, and
Adam Kalai. 2024. Do language models know when
they’re hallucinating references? In Findings of the
Association for Computational Linguistics: EACL
2024, pages 912–928, St. Julian’s, Malta. Association
for Computational Linguistics.
Amos Azaria and Tom Mitchell. 2023. The internal
state of an LLM knows when it’s lying. In Findings of the Association for Computational Linguistics:
EMNLP 2023, pages 967–976, Singapore. Association for Computational Linguistics.
Steven Bird and Edward Loper. 2004. NLTK: The natural language toolkit. In Proceedings of the ACL Interactive Poster and Demonstration Sessions, pages
214–217, Barcelona, Spain. Association for Computational Linguistics.
Bernd Bohnet, Vinh Q. Tran, Pat Verga, Roee Aharoni, Daniel Andor, Livio Baldini Soares, Massimiliano Ciaramita, Jacob Eisenstein, Kuzman Ganchev,
Jonathan Herzig, Kai Hui, Tom Kwiatkowski, Ji Ma,
Jianmo Ni, Lierni Sestorain Saralegui, Tal Schuster, William W. Cohen, Michael Collins, Dipanjan
8
Das, Donald Metzler, Slav Petrov, and Kellie Webster.
2023. Attributed question answering: Evaluation
and modeling for attributed large language models.
Preprint, arXiv:2212.08037.
Nouha Dziri, Hannah Rashkin, Tal Linzen, and David
Reitter. 2022b. Evaluating attribution in dialogue systems: The BEGIN benchmark. Transactions of the
Association for Computational Linguistics, 10:1066–
1083.
Meng Cao, Yue Dong, and Jackie Cheung. 2022. Hallucinated but factual! inspecting the factuality of
hallucinations in abstractive summarization. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long
Papers), pages 3340–3354, Dublin, Ireland. Association for Computational Linguistics.
Shahul Es, Jithin James, Luis Espinosa Anke, and
Steven Schockaert. 2024. RAGAs: Automated evaluation of retrieval augmented generation. In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics:
System Demonstrations, pages 150–158, St. Julians,
Malta. Association for Computational Linguistics.
Vittorio Castelli, Rishav Chakravarti, Saswati Dana, Anthony Ferritto, Radu Florian, Martin Franz, Dinesh
Garg, Dinesh Khandelwal, Scott McCarley, Michael
McCawley, Mohamed Nasr, Lin Pan, Cezar Pendus,
John Pitrelli, Saurabh Pujar, Salim Roukos, Andrzej
Sakrajda, Avi Sil, Rosario Uceda-Sosa, Todd Ward,
and Rong Zhang. 2020. The TechQA dataset. In
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1269–
1278, Online. Association for Computational Linguistics.
Robert Friel and Atindriyo Sanyal. 2023. Chainpoll: A
high efficacy method for llm hallucination detection.
Preprint, arXiv:2310.18344.
Luyu Gao, Zhuyun Dai, Panupong Pasupat, Anthony
Chen, Arun Tejasvi Chaganty, Yicheng Fan, Vincent
Zhao, Ni Lao, Hongrae Lee, Da-Cheng Juan, and
Kelvin Guu. 2023. RARR: Researching and revising
what language models say, using language models.
In Proceedings of the 61st Annual Meeting of the
Association for Computational Linguistics (Volume 1:
Long Papers), pages 16477–16508, Toronto, Canada.
Association for Computational Linguistics.
Zhiyu Chen, Wenhu Chen, Charese Smiley, Sameena
Shah, Iana Borova, Dylan Langdon, Reema Moussa,
Matt Beane, Ting-Hao Huang, Bryan Routledge, and
William Yang Wang. 2021. FinQA: A dataset of numerical reasoning over financial data. In Proceedings
of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 3697–3711, Online
and Punta Cana, Dominican Republic. Association
for Computational Linguistics.
Zorik Gekhman, Gal Yona, Roee Aharoni, Matan Eyal,
Amir Feder, Roi Reichart, and Jonathan Herzig. 2024.
Does fine-tuning llms on new knowledge encourage
hallucinations? Preprint, arXiv:2405.05904.
Pengcheng He, Jianfeng Gao, and Weizhu Chen. 2023.
DeBERTav3: Improving deBERTa using ELECTRAstyle pre-training with gradient-disentangled embedding sharing. In The Eleventh International Conference on Learning Representations.
Cheng-Han Chiang and Hung-yi Lee. 2023. Can large
language models be an alternative to human evaluations? In Proceedings of the 61st Annual Meeting of
the Association for Computational Linguistics (Volume 1: Long Papers), pages 15607–15631, Toronto,
Canada. Association for Computational Linguistics.
Pengcheng He, Xiaodong Liu, Jianfeng Gao, and
Weizhu Chen. 2021. Deberta: Decoding-enhanced
bert with disentangled attention. In International
Conference on Learning Representations.
Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, Anastasios Nikolas Angelopoulos, Tianle Li, Dacheng Li,
Hao Zhang, Banghua Zhu, Michael Jordan, Joseph E.
Gonzalez, and Ion Stoica. 2024. Chatbot arena: An
open platform for evaluating llms by human preference. Preprint, arXiv:2403.04132.
Xingwei He, Zhenghao Lin, Yeyun Gong, A-Long Jin,
Hang Zhang, Chen Lin, Jian Jiao, Siu Ming Yiu, Nan
Duan, and Weizhu Chen. 2024. Annollm: Making
large language models to be better crowdsourced
annotators. Preprint, arXiv:2303.16854.
Souvik Das, Sougata Saha, and Rohini Srihari. 2022.
Diving deep into modes of fact hallucinations in dialogue systems. In Findings of the Association for
Computational Linguistics: EMNLP 2022, pages
684–699, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
Dan Hendrycks, Collin Burns, Anya Chen, and Spencer
Ball. 2021. Cuad: An expert-annotated nlp dataset
for legal contract review. NeurIPS.
Or Honovich, Roee Aharoni, Jonathan Herzig, Hagai
Taitelbaum, Doron Kukliansy, Vered Cohen, Thomas
Scialom, Idan Szpektor, Avinatan Hassidim, and
Yossi Matias. 2022. TRUE: Re-evaluating factual
consistency evaluation. In Proceedings of the 2022
Conference of the North American Chapter of the
Association for Computational Linguistics: Human
Language Technologies, pages 3905–3920, Seattle,
United States. Association for Computational Linguistics.
Nouha Dziri, Sivan Milton, Mo Yu, Osmar Zaiane, and
Siva Reddy. 2022a. On the origin of hallucinations
in conversational models: Is it the datasets or the
models? In Proceedings of the 2022 Conference of
the North American Chapter of the Association for
Computational Linguistics: Human Language Technologies, pages 5271–5285, Seattle, United States.
Association for Computational Linguistics.
9
Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan
Su, Yan Xu, Etsuko Ishii, Ye Jin Bang, Andrea
Madotto, and Pascale Fung. 2023. Survey of hallucination in natural language generation. ACM Comput.
Surv., 55(12).
Yifei Li, Xiang Yue, Zeyi Liao, and Huan Sun. 2024.
Attributionbench: How hard is automatic attribution
evaluation? arXiv preprint arXiv:2402.15089v1.
Zhuoyan Li, Hangxiao Zhu, Zhuoran Lu, and Ming
Yin. 2023. Synthetic data generation with large language models for text classification: Potential and
limitations. In Proceedings of the 2023 Conference
on Empirical Methods in Natural Language Processing, pages 10443–10461, Singapore. Association for
Computational Linguistics.
Qiao Jin, Bhuwan Dhingra, Zhengping Liu, William
Cohen, and Xinghua Lu. 2019. PubMedQA: A
dataset for biomedical research question answering.
In Proceedings of the 2019 Conference on Empirical
Methods in Natural Language Processing and the
9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2567–
2577, Hong Kong, China. Association for Computational Linguistics.
Stephanie Lin, Jacob Hilton, and Owain Evans. 2022.
TruthfulQA: Measuring how models mimic human
falsehoods. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics
(Volume 1: Long Papers), pages 3214–3252, Dublin,
Ireland. Association for Computational Linguistics.
Saurav Kadavath, Tom Conerly, Amanda Askell, Tom
Henighan, Dawn Drain, Ethan Perez, Nicholas
Schiefer, Zac Hatfield-Dodds, Nova DasSarma, Eli
Tran-Johnson, Scott Johnston, Sheer El-Showk,
Andy Jones, Nelson Elhage, Tristan Hume, Anna
Chen, Yuntao Bai, Sam Bowman, Stanislav Fort,
Deep Ganguli, Danny Hernandez, Josh Jacobson,
Jackson Kernion, Shauna Kravec, Liane Lovitt, Kamal Ndousse, Catherine Olsson, Sam Ringer, Dario
Amodei, Tom Brown, Jack Clark, Nicholas Joseph,
Ben Mann, Sam McCandlish, Chris Olah, and Jared
Kaplan. 2022. Language models (mostly) know what
they know. Preprint, arXiv:2207.05221.
Varun Magesh, Faiz Surani, Matthew Dahl, Mirac Suzgun, Christopher D. Manning, and Daniel E. Ho.
2024. Hallucination-free? assessing the reliability of leading ai legal research tools. Preprint,
arXiv:2405.20362.
Ehsan Kamalloo, Aref Jafari, Xinyu Zhang, Nandan
Thakur, and Jimmy Lin. 2023. HAGRID: A humanllm collaborative dataset for generative informationseeking with attribution. arXiv:2307.16883.
Chaitanya Malaviya, Subin Lee, Sihao Chen, Elizabeth
Sieber, Mark Yatskar, and Dan Roth. 2024. Expertqa: Expert-curated questions and attributed answers. Preprint, arXiv:2309.07852.
Seungone Kim, Juyoung Suk, Shayne Longpre,
Bill Yuchen Lin, Jamin Shin, Sean Welleck, Graham
Neubig, Moontae Lee, Kyungjae Lee, and Minjoon
Seo. 2024. Prometheus 2: An open source language
model specialized in evaluating other language models. Preprint, arXiv:2405.01535.
Potsawee Manakul, Adian Liusie, and Mark Gales. 2023.
SelfCheckGPT: Zero-resource black-box hallucination detection for generative large language models.
In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages
9004–9017, Singapore. Association for Computational Linguistics.
Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy
Liang. 2023. Lost in the middle: How language models use long contexts. Preprint, arXiv:2307.03172.
Moritz Laurer, Wouter van Atteveldt, Andreu Casas,
and Kasper Welbers. 2022. Less annotating, more
classifying – addressing the data scarcity issue of
supervised machine learning with deep transfer learning and bert - nli. Open Science Framework Preprint.
Nick McKenna, Tianyi Li, Liang Cheng, Mohammad Javad Hosseini, Mark Johnson, and Mark Steedman. 2023. Sources of hallucination by large language models on inference tasks. In The 2023 Conference on Empirical Methods in Natural Language
Processing.
Nayeon Lee, Wei Ping, Peng Xu, Mostofa Patwary, Pascale N Fung, Mohammad Shoeybi, and Bryan Catanzaro. 2022. Factuality enhanced language models for
open-ended text generation. In Advances in Neural
Information Processing Systems, volume 35, pages
34586–34599. Curran Associates, Inc.
Timo Möller, Anthony Reina, Raghavan Jayakumar,
and Malte Pietsch. 2020. COVID-QA: A question
answering dataset for COVID-19. In Proceedings of
the 1st Workshop on NLP for COVID-19 at ACL 2020,
Online. Association for Computational Linguistics.
Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio
Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. 2020.
Retrieval-augmented generation for knowledgeintensive nlp tasks. In Advances in Neural Information Processing Systems, volume 33, pages 9459–
9474. Curran Associates, Inc.
Benjamin Muller, John Wieting, Jonathan Clark, Tom
Kwiatkowski, Sebastian Ruder, Livio Soares, Roee
Aharoni, Jonathan Herzig, and Xinyi Wang. 2023.
Evaluating and modeling attribution for cross-lingual
question answering. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language
Processing, pages 144–157, Singapore. Association
for Computational Linguistics.
10
Abhilash Nandy, Soumya Sharma, Shubham Maddhashiya, Kapil Sachdeva, Pawan Goyal, and NIloy
Ganguly. 2021. Question answering over electronic
devices: A new benchmark dataset and a multi-task
learning based QA framework. In Findings of the
Association for Computational Linguistics: EMNLP
2021, pages 4600–4609, Punta Cana, Dominican Republic. Association for Computational Linguistics.
Pat Verga, Sebastian Hofstatter, Sophia Althammer, Yixuan Su, Aleksandra Piktus, Arkady Arkhangorodsky,
Minjie Xu, Naomi White, and Patrick Lewis. 2024.
Replacing judges with juries: Evaluating llm generations with a panel of diverse models. Preprint,
arXiv:2404.18796.
Tu Vu, Mohit Iyyer, Xuezhi Wang, Noah Constant, Jerry
Wei, Jason Wei, Chris Tar, Yun-Hsuan Sung, Denny
Zhou, Quoc Le, and Thang Luong. 2023. Freshllms:
Refreshing large language models with search engine
augmentation. Preprint, arXiv:2310.03214.
Tri Nguyen, Mir Rosenberg, Xia Song, Jianfeng Gao,
Saurabh Tiwary, Rangan Majumder, and Li Deng.
2016. Ms marco: A human generated machine reading comprehension dataset.
Shuohang Wang, Yang Liu, Yichong Xu, Chenguang
Zhu, and Michael Zeng. 2021. Want to reduce labeling cost? GPT-3 can help. In Findings of the
Association for Computational Linguistics: EMNLP
2021, pages 4195–4205, Punta Cana, Dominican Republic. Association for Computational Linguistics.
OpenAI. 2023. https://openai.com.
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida,
Carroll Wainwright, Pamela Mishkin, Chong Zhang,
Sandhini Agarwal, Katarina Slama, Alex Ray, John
Schulman, Jacob Hilton, Fraser Kelton, Luke Miller,
Maddie Simens, Amanda Askell, Peter Welinder,
Paul F Christiano, Jan Leike, and Ryan Lowe. 2022.
Training language models to follow instructions with
human feedback. In Advances in Neural Information
Processing Systems, volume 35, pages 27730–27744.
Curran Associates, Inc.
Zengzhi Wang, Qiming Xie, Yi Feng, Zixiang Ding,
Zinong Yang, and Rui Xia. 2024. Is chatgpt a good
sentiment analyzer? a preliminary study. Preprint,
arXiv:2304.04339.
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten
Bosma, brian ichter, Fei Xia, Ed Chi, Quoc V Le,
and Denny Zhou. 2022. Chain-of-thought prompting elicits reasoning in large language models. In
Advances in Neural Information Processing Systems,
volume 35, pages 24824–24837. Curran Associates,
Inc.
Hannah Rashkin, Vitaly Nikolaev, Matthew Lamm,
Lora Aroyo, Michael Collins, Dipanjan Das, Slav
Petrov, Gaurav Singh Tomar, Iulia Turc, and David
Reitter. 2023. Measuring attribution in natural language generation models. Computational Linguistics,
49(4):777–840.
Yuanhao Wu, Juno Zhu, Siliang Xu, Kashun Shum,
Cheng Niu, Randy Zhong, Juntong Song, and Tong
Zhang. 2023. Ragtruth: A hallucination corpus for
developing trustworthy retrieval-augmented language
models. Preprint, arXiv:2401.00396.
Stephen Roller, Emily Dinan, Naman Goyal, Da Ju,
Mary Williamson, Yinhan Liu, Jing Xu, Myle Ott,
Eric Michael Smith, Y-Lan Boureau, and Jason Weston. 2021. Recipes for building an open-domain
chatbot. In Proceedings of the 16th Conference of
the European Chapter of the Association for Computational Linguistics: Main Volume, pages 300–325,
Online. Association for Computational Linguistics.
Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W. Cohen, Ruslan Salakhutdinov, and
Christopher D. Manning. 2018. HotpotQA: A dataset
for diverse, explainable multi-hop question answering. In Conference on Empirical Methods in Natural
Language Processing (EMNLP).
Jon Saad-Falcon, Omar Khattab, Christopher Potts, and
Matei Zaharia. 2024. Ares: An automated evaluation framework for retrieval-augmented generation
systems. Preprint, arXiv:2311.09476.
Xiang Yue, Boshi Wang, Ziru Chen, Kai Zhang, Yu Su,
and Huan Sun. 2023. Automatic evaluation of attribution by large language models. In Findings of the
Association for Computational Linguistics: EMNLP
2023, pages 4615–4635, Singapore. Association for
Computational Linguistics.
Mobashir Sadat, Zhengyu Zhou, Lukas Lange, Jun
Araki, Arsalan Gundroo, Bingqing Wang, Rakesh
Menon, Md Parvez, and Zhe Feng. 2023. Delucionqa:
Detecting hallucinations in domain-specific question
answering. pages 822–835.
Hanning Zhang, Shizhe Diao, Yong Lin, Yi R. Fung,
Qing Lian, Xingyao Wang, Yangyi Chen, Heng Ji,
and Tong Zhang. 2024. R-tuning: Instructing large
language models to say ‘i don’t know’. Preprint,
arXiv:2311.09677.
Kurt Shuster, Spencer Poff, Moya Chen, Douwe Kiela,
and Jason Weston. 2021. Retrieval augmentation
reduces hallucination in conversation. In Findings
of the Association for Computational Linguistics:
EMNLP 2021, pages 3784–3803, Punta Cana, Dominican Republic. Association for Computational
Linguistics.
Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang,
Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, Yifan Du,
Chen Yang, Yushuo Chen, Zhipeng Chen, Jinhao
Jiang, Ruiyang Ren, Yifan Li, Xinyu Tang, Zikang
Liu, Peiyu Liu, Jian-Yun Nie, and Ji-Rong Wen.
2023. A survey of large language models. Preprint,
arXiv:2303.18223.
Neeraj Varshney, Wenlin Yao, Hongming Zhang, Jianshu Chen, and Dong Yu. 2023. A stitch in time saves
nine: Detecting and mitigating hallucinations of llms
by validating low-confidence generation. Preprint,
arXiv:2307.03987.
11
Shen Zheng, Jie Huang, and Kevin Chang. 2024. Why
does chatGPT fall short in providing truthful answers? In I Can’t Believe It’s Not Better Workshop:
Failure Modes in the Age of Foundation Models.
Using 2,592,000 seconds/month, the monthly cost
of serving 10qps with GPT-3.5 is:
Fengbin Zhu, Wenqiang Lei, Youcheng Huang, Chao
Wang, Shuo Zhang, Jiancheng Lv, Fuli Feng, and TatSeng Chua. 2021. TAT-QA: A question answering
benchmark on a hybrid of tabular and textual content in finance. In Proceedings of the 59th Annual
Meeting of the Association for Computational Linguistics and the 11th International Joint Conference
on Natural Language Processing (Volume 1: Long
Papers), pages 3277–3287, Online. Association for
Computational Linguistics.
With ChainPoll ensemble, we request 3 outputs
per query, bringing the cost of a single query up to
A
10qps ∗ 2, 592, 000 ∗ $0.0023 = $59, 616
(4k ∗ $0.5 + 3 ∗ 200 ∗ $1.5)/1M = $0.0029 (9)
And the total monthly cost for 10qps to:
10qps ∗ 2, 592, 000 ∗ $0.0029 = $75, 168 (10)
RAGAS Costs RAGAS makes 2 OpenAI API
calls per an input RAG example. The first query
extracts a list of claims from the response. The second requests the LLM to evaluate the faithfulness
of each extracted claim to the RAG context. We
estimate that the output length of the first query is
roughly equal to the length of the RAG response;
and the output length of the second query is roughly
3x the length of the response, since it includes the
original claims followed by a faithfulness score and
an explanation. Factoring in overhead token length
of each prompt, we calculate the cost per query to
be
Query1 = $380/1M
(11)
Response Generation Prompt
We use the following prompt template to generate LLM responses for each sample in our QA
RAG dataset. Context documents, separated by
line breaks, along with the question are slotted in
for each generation sample.
Use the following pieces of context to
answer the question.
{documents}
Question: {question}
B
Cost Calculations
Query2 = $2730/1M
Costs are estimated assuming average throughput
of 10 queries per second (qps), with average RAG
query length of 4000 tokens, and NVIDIA L4 GPU
deployment hardware. When estimating LLM cost
for >1qps we assume concurrency is implemented
to process multiple queries in parallel.
10
= $1750
4
10qps∗2, 592, 000∗($380+$2730)/1M = $79, 937
(13)
Trulens Costs Trulens makes 1 OpenAI per each
sentence in the response. For this calculation, we
estimate 3 sentences per response, which aligns
with our obesrvations on the QA RAG dataset.
Each query returns original sentence, a groundedness score (1-10), and an explanation. Here we
assume that the token length of the explanation
is roughly equal to the token length of the input
sentence. The cost of a single query is roughly
(6)
(4k ∗ $0.5 + 2 ∗ 75 ∗ $1.5)/1M = $0.0022 (14)
OpenAI Costs At the time of writing, querying
GPT-3.5-turbo through OpenAI API costs $0.50 /
1M input tokens and $1.50 / 1M output tokens8 . In
our test set, we observe the average output token
length from GPT-3.5 at 200 tokens. Using average
input length of 4000 tokens, the cost of a single
query is roughly
(4k ∗ $0.5 + 200 ∗ $1.5)/1M = $0.0023
7
8
(12)
Then, the monthly cost of serving 10qps is:
Luna Costs Empirically, we find that each L4
can serve up to 4qps. At the time of writing, the
monthly cost of running a g6.2xlarge GPU instance
on AWS cloud is $7007 . Thus, we estimate total
monthly cost for 10qps throughput as
$700 ∗
(8)
Using 2,592,000 seconds/month, the monthly cost
of serving 10qps with Trulens is:
10qps∗2, 592, 000∗3∗$0.0022 = $173, 016 (15)
C
Latency Optimizations
We optimize Luna and its deployment architecture
to process up to 16k input tokens in under one
second on NVIDIA L4 GPU. Table 5 details the
latency reductions and how they were achieved.
(7)
https://aws.amazon.com/ec2/pricing/on-demand/
https://openai.com/api/pricing/
12
Optimization
s/16k
baseline
TensorRT backend
efficient pre- and post- processing code
512 max model length
BLS
3.27
2.09
1.79
0.98
0.92
Table 5: Impact of latency optimizations on Luna inference speed. Reporting inference speed in seconds for
processing 16k input tokens.
D
Latency Comparison
We empirically estimate the latency of Luna and
each baseline model. Luna latency is discussed in
Appendix C. For LLm models that query OpenAI
API, we calculate the average latency per query
after querying the API multiple times with an input
of 4000k tokens, split between 3800 tokens for the
context, 25 tokens for the question, and 75 tokens
for the response.
Model
s/4k
%change
Luna
GPT-3.5
ChainPoll n=3
Trulens
RAGAS
0.23
2.5
3.0
3.4
5.4
-91%
-93%
-93%
-96%
Table 6: Model latency (in seconds), comparing Luna
to LLM baselines. We also report the % difference
between Luna and LLM-based models.
13

First question:

Can you give me a very clear explanation of the core assertions, implications, and mechanics elucidated in this paper?

ASSISTANT

Certainly. Here is a clear explanation of the core assertions, implications, and mechanics described in this paper:
Core Assertions:


Current hallucination detection techniques for retrieval-augmented generation (RAG) systems fail to simultaneously deliver accuracy, low latency, and low cost.


Luna, a lightweight DeBERTA-large encoder fine-tuned for hallucination detection, outperforms larger language models and commercial evaluation frameworks on this task.


Luna generalizes well across multiple industry verticals and out-of-domain data.


Implications:


Luna enables more reliable and cost-effective deployment of RAG systems in industry applications.


The approach used to develop Luna could potentially be applied to create other lightweight, specialized models for LLM evaluation tasks.


Luna's performance suggests that smaller, fine-tuned models can be competitive with larger language models for specific tasks like hallucination detection.


Mechanics:


Model Architecture: Luna is based on a 440M parameter DeBERTA-large encoder.


Training Data: Luna is fine-tuned on a carefully curated real-world RAG dataset covering multiple domains.


Task Framing: The model is trained to identify supported tokens in the response, given a query and retrieved context.


Long Context Handling: Luna uses a novel approach to handle long RAG contexts (up to 16k tokens) by making span-level predictions and aggregating them.


Optimization: The model is optimized for deployment using techniques like ONNX tracing, TensorRT backend, and efficient pre/post-processing.


Evaluation: Luna is benchmarked against GPT-3.5, commercial frameworks like RAGAS and Trulens, and other baselines on multiple datasets.


The paper presents Luna as a solution to the challenge of detecting hallucinations in RAG systems that balances accuracy, speed, and cost considerations for practical industry deployment.

USER

Second question:

Can you explain the value of this in basic terms? Like you're talking to a CEO. So what? What's the bottom line here?

ASSISTANT

Certainly. Here's a simple explanation of the value proposition, as if speaking to a CEO:
Bottom line:
Luna is a game-changer for businesses using AI-powered chatbots or search systems. It solves a critical problem - catching AI errors quickly and cheaply - which directly impacts customer trust and company liability.
Here's why it matters:


Cost Savings: Luna cuts detection costs by 97% compared to current methods. This means significant savings on your AI operations.


Speed: It's 91% faster than existing solutions. This speed allows you to catch and correct AI mistakes before customers see them, protecting your brand reputation.


Accuracy: Despite being smaller and cheaper to run, Luna outperforms larger, more expensive AI models in detecting errors.


Versatility: It works well across different industries and topics, so you don't need multiple specialized systems.


Scalability: Luna's efficiency makes it feasible to implement widespread error checking in your AI systems without breaking the bank.


In essence, Luna offers a way to make your AI interactions more reliable and trustworthy, while simultaneously reducing operational costs and risks. It's a tool that can help maintain customer confidence in your AI-driven services while keeping your bottom line healthy.

USER

Third question:

Can you give me an analogy or metaphor that will help explain this to a broad audience?

ASSISTANT

Certainly. Here's an analogy that could help explain Luna to a broad audience:
Think of Luna like a highly efficient proofreader for AI conversations.
Imagine you're running a massive international call center. Your agents are incredibly knowledgeable and can answer questions on almost any topic, but occasionally they might mix up facts or give out incorrect information. To prevent this, you hire proofreaders to listen in on calls and catch any mistakes before they reach the customer.
Now, traditional methods of catching AI mistakes are like hiring teams of expert proofreaders for each call. They're highly skilled, but they're also expensive and work relatively slowly. Sometimes, they might catch a mistake after the customer has already heard it.
Luna, on the other hand, is like having a brilliant, lightning-fast proofreader who specializes in catching factual errors. This proofreader:

Works incredibly quickly, able to check conversations in near real-time.
Costs much less than the team of expert proofreaders.
Is just as accurate (if not more so) at catching mistakes.
Can work across multiple departments and topics without needing specialized training.

With Luna, it's as if every AI conversation has this super-efficient proofreader checking it instantly, allowing you to correct any mistakes before the customer even notices. This ensures that your AI 'call center' provides accurate information, maintains customer trust, and operates cost-effectively.
This analogy helps illustrate how Luna improves the reliability of AI systems while also addressing the practical concerns of speed and cost in a business environment.

USER

Fourth question:

Can you provide a specific example that best illustrates the key points elucidated in this paper?

ASSISTANT

The paper doesn't provide a specific example of Luna in action, but I can construct a hypothetical example based on the information provided that illustrates the key points:
Imagine a large financial services company using an AI-powered chatbot to handle customer inquiries. A customer asks:
"What's the current interest rate for a 30-year fixed mortgage?"
The AI system retrieves some context about current mortgage rates and generates a response:
"The current interest rate for a 30-year fixed mortgage is 3.5%. This rate has been stable for the past 6 months and is expected to remain low due to the Federal Reserve's monetary policy."
Now, let's see how Luna would evaluate this response:


Speed: Luna can process this input (question, context, and response) in milliseconds, allowing for real-time checking before the response is sent to the customer.


Accuracy: Luna analyzes the response token by token, checking if each piece of information is supported by the retrieved context. It might flag the "3.5%" as accurate if it matches the context, but could identify the statement about the rate being stable for 6 months as a potential hallucination if this information isn't present in the retrieved context.


Cost-effectiveness: While larger models like GPT-3.5 might take seconds and cost more per query to perform this check, Luna does it almost instantly at a fraction of the cost.


Long context handling: If the retrieved context is very long (e.g., containing detailed historical rate data), Luna can efficiently process it using its novel chunking approach.


Cross-domain applicability: Although this example is in finance, Luna would perform similarly well for queries in other domains like healthcare or tech support, without needing domain-specific training.


This example illustrates how Luna can quickly and accurately detect potential hallucinations in AI-generated responses, allowing the system to either modify the response or flag it for human review before it reaches the customer. This helps maintain the accuracy and trustworthiness of the AI system while operating efficiently at scale.