Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save thehunmonkgroup/44db3e4b5f514f2ebb139ab9709451cb to your computer and use it in GitHub Desktop.
Save thehunmonkgroup/44db3e4b5f514f2ebb139ab9709451cb to your computer and use it in GitHub Desktop.
Summary: Luna: An Evaluation Foundation Model to Catch Language Model Hallucinations with High Accuracy and Low Cost

USER

Review the following paper and answer any questions based on the content of the paper. The paper should be regarded as the source of truth when answering any questions.

Paper

Luna: An Evaluation Foundation Model to Catch Language Model Hallucinations with High Accuracy and Low Cost Masha Belyi∗

Robert Friel∗

Shuai Shao

Atindriyo Sanyal

Galileo Technologies Inc. {masha,rob,ss,atin}@rungalileo.io

arXiv:2406.00975v2 [cs.CL] 5 Jun 2024

Abstract Retriever-Augmented Generation (RAG) systems have become pivotal in enhancing the capabilities of language models by incorporating external knowledge retrieval mechanisms. However, a significant challenge in deploying these systems in industry applications is the detection and mitigation of hallucinations—instances where the model generates information that is not grounded in the retrieved context. Addressing this issue is crucial for ensuring the reliability and accuracy of responses generated by large language models (LLMs) in diverse industry settings. Current hallucination detection techniques fail to deliver accuracy, low latency, and low cost simultaneously. We introduce Luna: a DeBERTA-large (440M) encoder, fine-tuned for hallucination detection in RAG settings. We demonstrate that Luna outperforms GPT-3.5 and commercial evaluation frameworks on the hallucination detection task, with 97% and 91% reduction in cost and latency, respectively. Luna is lightweight and generalizes across multiple industry verticals and out-of-domain data, making it an ideal candidate for industry LLM applications.

1

Figure 1: Luna is a lightweight DeBERTA-large encoder, fine-tuned for hallucination detection in RAG settings. Luna outperforms zero-shot hallucination detection models (GPT-3.5, ChainPoll GPT-3.5 ensemble) and RAG evaluation frameworks (RAGAS, Trulens) at a fraction of the cost and millisecond inference speed.

Yet, LLMs still often respond with nonfactual information that contradicts the knowledge supplied by RAG (Shuster et al., 2021; Magesh et al., 2024). Causes of hallucinations have been extensively studied across different LLM tasks (Zheng et al., 2024; Cao et al., 2022; Das et al., 2022). Key contributing factors include knowledge cutoff (Vu et al., 2023), randomness (Lee et al., 2022), faulty training data (Dziri et al., 2022a; Lin et al., 2022; McKenna et al., 2023), and finetuning with large amounts of new knowledge (Gekhman et al., 2024). Apart from RAG, proposed mitigation solutions explore prompt engineering with chain of thought (Wei et al., 2022), finetuning (Zhang et al., 2024), reinforcement learning with human feedback (Ouyang et al., 2022), and specialized hallucination detection models (Wu et al., 2023; Lin et al., 2022). For RAG specifically, evaluation frameworks like RAGAS (Es et al., 2024), Trulens1 , and ARES (Saad-Falcon et al., 2024) have emerged to offer automated hallucination detection at scale. However, these approaches rely on static prompts

Introduction

Large Language Models (LLMs) are broadly used in industry dialogue applications due to their impressive ability to hold a natural conversation and succeed on a variety of reasoning tasks (Zhao et al., 2023). A key challenge in deploying customerfacing LLMs is their propensity for hallucinations, where the model presents cohesive, but factually incorrect information in conversation with a user (Roller et al., 2021; Lin et al., 2022). Retrievalaugmented generation (RAG), a technique for incorporating knowledge relevant to each user query in the LLM prompt, effectively reduces LLM hallucinations in production systems (Lewis et al., 2020). *

1

These authors contributed equally to this work

1

https://www.trulens.org/

2

(RAGAS, Trulens) or finetuning on in-domain data (ARES), which limit their capacity to generalize to a breadth of industry applications. Gao et al. (2023) and Wu et al. (2023) take it a step further to successfully suppress hallucinations in LLM responses with a detect-and-replace technique. Though, due to prohibitively slow latency of their LLM evaluation models, real-time hallucination prevention in production systems still remains a challenge. Customer-facing dialogue applications necessitate a hallucination detection system with highaccuracy, low cost, and low latency, such that hallucinations are caught and resolved before reaching the user. Few/zero-shot LLM approaches fail to meet the strict latency requirement due to model size. Moreover, though commericial LLMs like OpenAI’s GPT models (OpenAI, 2023) achieve strong performance, querying customer data through 3rd party APIs is both costly and undesirable for privacy and security reasons. Finetuned BERT-size models can achieve competitive performance to LLM judges (Bohnet et al., 2023; Saad-Falcon et al., 2024; Gao et al., 2023; Li et al., 2024; Yue et al., 2023), offering lower latency and local execution. However, these models require annotated data for finetuning and have not been evaluated for large-scale, cross-domain applications. In this paper, we introduce Luna - a lightweight RAG hallucination detection model that generalizes across multiple industry-specific domains and scales well for real-time deployment. Luna is a 440M parameter DeBERTa-large encoder that is finetuned on carefully curated real-world RAG data. From analysis of RAG in production settings, we identify long-context RAG evaluation as a previously unaddressed challenge and propose a novel solution that facilitates high precision long-context RAG hallucination detection. Through extensive benchmarking, we demonstrate that Luna outperforms zero-shot prompting and RAG evaluation frameworks on the hallucination detection task. Our approach is closest to the concurrently proposed ARES automated RAG evaluation framework (Saad-Falcon et al., 2024), with a few key differences: (1) ARES requires a validation set of in-domain annotated data to finetune a custom evaluation model, while Luna is pre-trained on a crossdomain corpus for built-in generalization; (2) Luna accurately detects hallucinations on long RAG contexts; and (3) Luna is optimized to process up to 16k tokens in milliseconds on deployment hardware.

Related Work

Hallucination detection Prior work on hallucination detection in natural language generation (NLG) is vast (Ji et al., 2023). SelfCheckGPT (Manakul et al., 2023) and Agrawal et al. (2024) are examples of heuristic consistency-based methods that detect unreliable LLM outputs by comparing multiple sampled responses from the same LLM. Others look to the internal state of the LLM, such as hidden layer activations (Azaria and Mitchell, 2023) and token-level uncertainty (Varshney et al., 2023) as a proxy signal for hallucinations. Kadavath et al. (2022) prompt the generating LLM to introspect and evaluate it’s own responses. More generally, zero-shot (Es et al., 2024) and finetuned (Wu et al., 2023; Yue et al., 2023; Muller et al., 2023) LLM judges leverage LLM’s inherent reasoning abilities to evaluate other LLM generations. Similarly, general purpose finetuned LLM evaluators (Kim et al., 2024) that have been shown to correlate with human judgements can also be applied to hallucination detection. Our approach to finetune a small LM evaluator like in (Gao et al., 2023; Saad-Falcon et al., 2024) is the first to evaluate and optimize such a model for industry applications under strict performance, cost, and latency constraints. NLI for closed-domain Hallucination Detection Existing research draws parallels between the hallucination detection task and the concept of entailment in Natural Language Inference (NLI). The goal of NLI is to determine the relationship between a premise and hypothesis, which can be one of: entailment, contradiction, or neutral. In the past, NLI models have been used to evaluate factual consistency on closed-domain NLG tasks (Honovich et al., 2022; Dziri et al., 2022b). The Attributable to Identified Sources (AIS) framework, introduced by Rashkin et al. (2023), formally unifies the notions of factuality, attribution, hallucination, faithfulness, and groundedness - all terms used to measure the extent to which an LLM response is attributable to some source of ground truth. In followup work, NLI entailment has been shown to correlate with AIS scores (Gao et al., 2023; Bohnet et al., 2023; Li et al., 2024) and has become a standard baseline for AIS and hallucination detection models. In this work, we use pre-trained NLI model weights as the starting point for Luna finetuning. 2

technically handle sequences of up to 24k (He et al., 2021), computational complexity of transformer attention layers scale quadratically with input length. Moreover, though long-context LLMs like Claude3 are becoming competitive on LLM leaderboards3 , research shows that these models suffer from information loss (Liu et al., 2023) and may not be suitable for long-context RAG evaluation. A naive solution is to chunk long-context RAG inputs into short segments and process them through the evaluator model in batches. Model predictions can then be aggregated over batch rows to predict example-level hallucination probabilities. Figure 3 illustrates how such chunking may result in false positives in cases where supporting information is scattered throughout the long context document(s). Instead, we leverage span-level predictions for a high-precision classifier over long sequence inputs.

Figure 2: Distribution of RAG context token lengths in our QA RAG training split.

3

Luna Model

We fine-tune a DeBERTa-v3-Large (He et al., 2023) NLI checkpoint2 from Laurer et al. (2022) with a shallow hallucination classifier on each response token. We train on the task of identifying supported tokens in the response, given a query and retrieved context. Framing the problem in this way makes our work comparable to recent automated RAG evaluation efforts. Our definition of support is synonymous with the answer faithfulness metric explored in RAGAS (Es et al., 2024) and ARES (Saad-Falcon et al., 2024), Truelens groundedness, and attribution (Li et al., 2024). At inference, we treat spans with low support probabilities as hallucinated spans. Similar to Gao et al. (2023) and Wu et al. (2023), we aim to identify hallucinated spans in the response, rather than the less granular example-level hallucination boolean. While predicting spans is a more challenging task, it yields a more informative prediction to the end-user. Further, this approach sets us up for long-context prediction, which we discuss in detail next. 3.1

3.2

Long Context Chunking

Consider a single input into the RAG evaluation model that consists of C context tokens [c1 ...cC ], Q question tokens [q1 ...qQ ], and R response tokens [r1 ...rR ]. Assume we are working with an evaluator model that accepts maximum sequence length L, and that Q+R<L, but C is much larger4 . To fit the example into the model we break it up into windows of length L, such that each window contains the question, response, and a subset of the context tokens: wi = [ci1 ...cil ] ⊕ [q1 ...qQ ] ⊕ [r1 ...rR ]

(1)

where l = L−Q−R, and there are Nl windows per example. In Figure 3 there are three such windows. Our model outputs support probabilities pi for each of the R response tokens in wi as:

Long Context RAG PS (wi ) = [pi1 ...piR ]

In practice, we find that context length limitations are a significant pain point in industry applications. Custom RAG setups may retrieve a large number of context documents from various sources, or choose not to chunk the documents before passing them into the retriever. This results in long inputs to the RAG generator and evaluation models, sometimes even exceeding the token limit of select commercial LLMs. In Figure 2 we visualize the context length distribution of our curated RAG dataset (detailed in Section 4.1). While our base DeBERTa model can

(2)

We train with a cross-entropy loss on each token output. During training, we leverage granular token-level support labels (Section 4.2) to adjust the training labels in each batch based on which context tokens are present in the window. For example, in Figure 3, "Washington, D.C., the capital of the US" is supported in window 1, nothing is supported in window 2, and "was founded in 1791" is supported in window 3. 3 https://huggingface.co/spaces/lmsys/chatbot-arenaleaderboard 4 the same approach easily extends to cases where R>L

2

https://huggingface.co/MoritzLaurer/DeBERTa-v3large-mnli-fever-anli-ling-wanli

3

Figure 3: Long RAG context with naive chunking example. Naive context chunking leads to hallucination false positives when supporting information is scattered throughout the context. Without insight into which specific spans were suporrted/not supported by the context, it is impossible to arrive at the correct conclusion that the response in this example does NOT contain hallucinations.

3.3

To leverage the full pre-trained NLI model, we initialize the hallucination prediction head with weights from the NLI classification head. The original NLI head is a 3-class single-layer perceptron with a neuron for each NLI class (entailment, contradiction, and neutral). During training, we optimize for low entailment probability and high contradiction probability for hallucinated tokens (and the opposite for supported tokens). At inference, we output the probability of entailment for each token. We apply data transformation techniques to introduce additional variability for better generalization during training. Transformations include dropping and inserting context documents, and shuffling questions and responses between examples in batch. Training labels are adjusted accordingly with each transformation. The model trains for 3 epochs with cross-entropy loss on the output of each response token. We initialize the learning rate to 5−6 for the base model layers and 2−5 for the classification head, and train with warmup and a linear decay rate.

Figure 4: Illustration of Luna’s token-level predictions for the example in Figure 3. Luna’s token-level predictions are aggregated over context windows into a high-precision hallucination probability score.

At inference, we aggregate example-level support probabilities by taking the token-level maximum over windows. Refer to Figure 4 for an visual illustration of the steps described by equations 3-5 below. The example-level support probability for token j is defined as: pj = max (pij ) 1≤i≤|w|

(3)

where |w| = Nl is the total number of windows we created in (1). To produce an example-level label, we take the minimum over R tokens: PS = min(p1 ...pR )

(4)

4

Data

4.1

RAG QA dataset

We recycle open-book QA datasets to construct a RAG QA dataset. Our goal is to simulate natural RAG examples that may occurr in production settings. We sample data from five industry verticals: customer support (DelucionQA (Sadat et al., 2023), EManual (Nandy et al., 2021), TechQA (Castelli et al., 2020)), finance and numerical rea-

so that the overall support probability is no greater than the support probability of the least supported token in the response. Finally, we derive example hallucination probability P H as PH = 1 − PS

Training

(5) 4

Domain customer support finance biomedical research legal general knowledge

train 4k 38k 22k 1.5k 9.5k

val 600 5k 3k 500 2k

test 600 5k 3k 500 2k

general statements that, while not supported by any specific context span, are generally grounded in the question and provided context. We instruct the annotator to label these as "generally supported", which we post-process to indicate support in every context window during training. Statements highlighting lack of sufficient information to answer the question also fall into this category. We take measures to ensure high quality labels from our LLM annotator. First, we use chain-ofthought (Wei et al., 2022), which has been shown to increase agreement between LLM and human judgements (He et al., 2024). Next, we request both response-level and sentence-level annotations that we compare to identify potentially noisy labels. For example, if GPT-4 claims a response as supported by the context as a whole, but identifies no supporting information for one or more claims in the response, we send the example for re-annotation. We re-annotate examples up to 3 times, after which <2% of the data are still conflicting. After manual inspection, we find that the majority of the conflicts arise from partially supported sentences. Since our annotation scheme is binary on the sentence level (the full sentence is either supported or not), we resolve all tokens in partially supported sentences to "not supported" on both the sentence and example level.

%H 22% 5% 20% 6% 18%

Table 1: RAG QA data statistics. RAG context and questions are sourced from open-book QA datasets that cover five industry-specific domains. RAG responses are generated with GPT-3.5 and Claude-3-Haiku, and annotated with GPT-4-turbo. %H indicates the fraction of hallucinated responses in each domain.

soning (FinQA (Chen et al., 2021), TAT-QA (Zhu et al., 2021)), biomedical research (PubmedQA (Jin et al., 2019), CovidQA (Möller et al., 2020)), legal (Cuad (Hendrycks et al., 2021)) and general knowledge (HotpotQA (Yang et al., 2018), MS Marco (Nguyen et al., 2016), HAGRID (Kamalloo et al., 2023), ExpertQA (Malaviya et al., 2024)). The combined dataset contains examples from a variety of difficult RAG task types, including numerical reasoning over tables, inference over multiple context documents, and retrieval from long contexts. We reserve 20% of the dataset for validation and testing. Table 1 reports statistics of the data splits. For each component dataset, we ignore the ground truth responses and generate two new responses per input with GPT-3.5 and Claude-3Haiku. These models exhibit strong reasoning and conversational abilities (Chiang et al., 2024) at a low price point, which makes them realistic candidates for production RAG systems. We set temperature to 1 for generation to encourage diversity and potential hallucinations in the responses. Next, we describe how we annotate the data for training. 4.2

5

Evaluation

5.1

Datasets

We evaluate Luna on a combination of existing academic benchmarks (RAGTruth) and real-world RAG data. RAGTruth RAGTruth is an expert-annotated corpus of 18k RAG examples with LLM-generated responses. The data are split into three RAG task types: Question Answering (QA), Data-to-text Writing, and News Summarization. Since Luna is only trained on QA RAG examples, we use this benchmark to evaluate our model’s generalization to other RAG task types.

Labeling

We leverage GPT-4-turbo to annotate the RAG QA dataset. Refer to Section 8.1 for a discussion on the limitations of this approach. Before annotation, we split the context and response into sentences using nltk (Bird and Loper, 2004). We pass the question along with the tokenized context and response sentences to GPT4-turbo for annotation. For each sentence in the response, we instruct the LLM to identify which context sentences, if any, support the claim in the response. Tokens in sentences without any support are treated as hallucinations. We find that LLM responses often contain transition sentences and

RAG QA Test Set We also evaluate Luna on a held-out split of our RAG QA dataset (Section 4.1). This serves as an in-domain test set for evaluating Luna performance across industry verticals. 5.2

Baselines

Zero-shot prompting We evaluate GPT-3.5turbo and GPT-4-turbo models from OpenAI as 5

Method

Q UESTION A NSWERING

DATA - TO -T EXT W RITING

Precision

Recall

F1

Precision

Recall

F1

Precision

S UMMARIZATION Recall

F1

Precision

OVERALL Recall

F1

Promptgpt-3.5-turbo † Promptgpt-4-turbo † SelCheckGPTgpt-3.5-turbo † LMvLMgpt-4-turbo † Finetuned Llama-2-13B†

18.8 33.2 35.0 18.7 61.6

84.4 90.6 58.0 76.9 76.3

30.8 45.6 43.7 30.1 68.2

65.1 64.3 68.2 68.0 85.4

95.5 100.0 82.8 76.7 91.0

77.4 78.3 74.8 72.1 88.1

23.4 31.5 31.1 23.2 64.0

89.2 97.6 56.5 81.9 54.9

37.1 47.6 40.1 36.2 59.1

37.1 46.9 49.7 36.2 76.9

92.3 97.9 71.9 77.8 80.7

52.9 63.4 58.8 49.4 78.7

ChainPollgpt-3.5-turbo RAGAS Faithfulness Trulens Groundedness

33.5 31.2 22.8

51.3 41.9 92.5

40.5 35.7 36.6

84.6 79.2 66.9

35.1 50.8 96.5

49.6 61.9 79.0

45.8 64.2 40.2

48.0 29.9 50.0

46.9 40.8 44.5

54.8 62.0 46.5

40.6 44.8 85.8

46.7 52.0 60.4

Luna

37.8

80.0

51.3

64.9

91.2

75.9

40.0

76.5

52.5

52.7

86.1

65.4

Table 2: Response-level results on RAGTruth hallucination prediction task. Luna is compared against RAGTruth baselines reported in Wu et al. (2023) (rows marked with † ), as well as our own baselines. RAGAS and Trulens are evaluation framewords that query GPT-3.5-turbo for hallucination detection. ChainPoll is our gpt-3.5-turbo ensemble prompt baseline. ChainPoll, RAGAS, Trulens, and Luna probability thresholds were tuned for best Overall F1. The top and second-best F1 scores are bolded and underlined. Luna outperforms all prompt-based approaches and narrows the gap between other baselines and the 13B fine-tuned Llama, at a fraction of the cost. Method

C USTOMER S UPPORT

F INANCIAL R EASONING

G ENERAL K NOWLEDGE

L EGAL

B IOMED

OVERALL

GPT-4-turbo annotator

1.0

1.0

1.0

1.0

1.0

1.0

Promptgpt-3.5-turbo ChainPollgpt-3.5-turbo

0.68 0.76

0.67 0.74

0.67 0.75

0.63 0.71

0.64 0.71

0.66 0.74

RAGAS Faithfulness Trulens Groundedness

0.62 0.56

0.60 0.56

0.60 0.65

0.58 0.34

0.54 0.68

0.61 0.56

Lunain-domain LunaOOD

0.76 0.74

0.82 0.64

0.81

0.78 0.79

0.83

0.80

Table 3: AUROC on the hallucination detection task on the RAG QA test set. Best score in each domain is bolded. Lunain-domain is our model trained on combined train splits from each domain. LunaOOD is the same model trained on a subset of General Knowledge and Biomed domains.

baselines. We prompt the LLMs to return an example-level boolean indicating whether or not a RAG response is supported by the associated RAG context. For RAGTruth we also include all baselines reported in the original paper.

metrics at this optimal threshold. For other benchmarks, we report the area under the ROC curve (AUROC), which we consider a more informative metric that circumvents the need for threshold tuning.

Ensemble prompting LLM ensembles have been shown to outperform single model judges by eliminating bias (Friel and Sanyal, 2023; Verga et al., 2024). We leverage ChainPoll (Friel and Sanyal, 2023) with a chain-of-thought prompt for a stronger GPT-3.5-turbo baseline.

6

On the RAGTruth dataset, Luna outperforms all prompt-based approaches on the QA and Summarization tasks, and is competitive with GPT3.5 evaluators on the Data-to-Text Writing task (Table 2). Overall, Luna is second only to the finetuned Llama-2-13B, which is expected given the significant difference in size between the two models (440M vs 13B). It’s important to note that the Llama-2-13B baseline was trained on a subset of RAGTruth, as compared to Luna, which was trained on a QA-only dataset with a different data distribution. Nevertheless, we find that Luna generalizes well to the out-of-domain task types. Additionally, the gains in cost and inference speed we achieve with the lightweight Luna model (Sections 7.2, 7.3) offset the performance gap. Results on the RAG QA test set are reported in

RAG Evaluation Frameworks We evaluate two commercial RAG evaluation frmeworks: RAGAS (v0.1.7) (Es et al., 2024) and Trulens (v0.13.4). We report RAGAS Faithfulness and Trulens Groundedness metrics, which are designed for hallucination detection. 5.3

Results

Metrics

For comparison with RAGTruth baselinse, we report best Precision, Recall, and F1 scores on RAGTruth. We tune model output probability thresholds for the best overall F1 and report all 6

0-5k (223)

5k-16k (209)

16k+ (78)

Promptgpt-3.5-turbo ChainPollgpt-3.5-turbo

0 0

-12.11% -8.97%

-100% -100%

RAGAS Faithfulness Trulens Groundedness

0 0

-4.36% -6.38%

-100% -100%

Luna Lunaexample

0 0

-12.55% -21.44%

-31.98% -43.75%

(count in test)

tion and long context chunking approach (Section 3.2), we do an ablation study where we compare our best model to a version of Luna that makes example level predictions, referred to as Lunaexample in Table 4. As shown in Figure 3, we expect Lunaexample to perform worse on long contexts. Our findings confirm this hypothesis: although the hallucination detection performance of both Luna and Lunaexample degrades with increasing context lengths, Lunaexample exhibits a greater degradation than Luna.

Table 4: Relative hallucination detection performance of various models on shor(0-5k), medium(5k-16k), and long(16k+) context lengths. Luna is our best finetuned DeBERTA-large model, and Lunaexample is a version of Luna that makes hallucination predictions at example level. All GPT-3.5-based baselines (including RAGAS, Trulens) fail on input lengths >16k, while Luna maintains 88% and 68% of its’s performance on medium (5k-16k) and long (16k+) context lengths, respectively. Lunaexample also struggles more with long context lengths that Luna.

7.2

API-based hallucination detection methods accrue substantial costs if used continuously in production settings. Luna outperforms GPT-3.5-based approaches while operating at a fraction of the cost. In Figure 1 we illustrate the trade-off between monthly maintenance costs and accuracy for Luna versus our GPT-3.5-based baselines. Costs are estimated assuming average throughput of 10 queries per second, with average query length of 4000 tokens. We use OpenAI API5 and AWS cloud6 pricing at the time of writing. Detailed cost calculations can be found in Appendix B. Although we do not explicitly compare pricing against larger fine-tuned models such as Llama-213B, we note that hosting a multi-billion parameter model demands substantially more compute resources than Luna, which would be reflected in the overall cost.

Table 3 and follow a similar pattern. Luna outperforms the baselines across all verticals. We also evaluate the model’s cross-domain generalization by training on a subset of General Knowledge and Biomedical Domains, and evaluating on the others. We refer to this model as LunaOOD . We find that LunaOOD still outperforms most baselines on the out-of-domain subsets. However, generalization to the Financial Reasoning domain is weak. Examples in this domain require reasoning over tabular data, which LunaOOD never observes in training. Fine-tuning on the Financial Reasoning domain greatly boosts performance, increasing AUROC from 0.64 to 0.82.

7

Discussion

7.1

Long Context Hallucination Detection

Cost vs Accuracy Trade-offs

7.3

Latency Optimizations

We optimize Luna and its deployment architecture to process up to 16k input tokens in under one second on NVIDIA L4 GPU. To achieve this, we deploy an ONNX-traced model on NVIDIA Triton server with TensorRT backend. We leverage Triton’s Business Logic Scripting (BLS) to optimize the data flow and orchestration between GPU and CPU resources. BLS intelligently allocates resources based on the specific requirements of each inference request, ensuring that both GPU and CPU are utilized effectively and that neither resource becomes a bottleneck. We also tune our inference model maximum input length for optimal performance. While increasing the maximum sequence length would reduce the size and number of batches processed by the model (see Section 3.2), transformer layer computational complexity

In Table 4 we report Luna’s performance against baselines on a range of RAG context lengths. For this analysis we sample data from CUAD (Hendrycks et al., 2021), one of the RAG QA component datasets, which passes full-length legal contracts as context inputs into RAG. This dataset contains the largest range of context lengths in RAG QA. We find that performance of all models inversely correlates with context length. However, while the GPT-3.5-powered baselines fail completely at the GPT-3.5 context limit (16k tokens), Luna maintains 68% of it’s performance on that subset. To validate the efficacy of our span-level predic-

5 6

7

https://openai.com/api/pricing/ https://aws.amazon.com/ec2/pricing/on-demand/

also scales quadratically with input length. We determine token length of 512 to be the most effective. Finally, we optimize pre-and post-processing python code for maximum efficiency. Table 5 in Appendix details the latency reductions achieved at each optimization step.

Sentence-level annotations Luna is trained on sentence-level annotations, i.e. there is an assumption that a sentence is either supported or not supported. This is most often the case, but future work can explore token-level labels for compound sentences with partially supported claims.

8

8.2

Conclusion

Hallucinations in RAG output highlight weaknesses of the generator model. However, it is equally important to consider the quality of the retriever and its contribution to the overall performance of a RAG system. A sub-optimal retriever may supply irrelevant context to the generator, making it difficult for the generator to produce an accurate response. A comprehensive RAG evaluation model should therefore assess all dimensions of the RAG system. To this end, metrics like context relevance have been explored to assess the quality of retrieved RAG contexts (Es et al., 2024; SaadFalcon et al., 2024). In future work, we propose to leverage Luna for measuring a comprehensive suite of RAG metrics. One cost-effective approach could be to augment the current DeBERTA architecture with additional prediction heads that output multiple metrics in one forward pass. We hypothesize that the shared weights of the base encoder layers may enhance the performance of each head.

In this work we introduced Luna: a cost-effective hallucination detection model with millisecond inference speed. Luna eliminates dependency on slow and expensive 3rd party API calls, and enables practitioners to effectively address hallucinations in production. The proposed model can be hosted on a local GPU, guaranteeing privacy that 3d-party API’s cannot. 8.1

Future Work

Limitations

Closed Domain Hallucinations Luna’s efficacy is limited to closed domain hallucination detection in RAG settings. Due to its size, Luna lacks the necessary world knowledge to detect open domain hallucinations. For open-domain applications, Luna relies on a high-quality RAG retriever to provide the necessary context knowledge for an input query. LLM Annotations LLM’s remarkable zero-shot abilities have encouraged researchers to consider LLMs for annotation and synthetic data generation. Replacing human annotators with LLMs offerst substantial efficiency and cost savings (Wang et al., 2021). However, LLM performance on various annotation tasks is still controversial, with some studies reporting high correlations between LLM and human judgements (Chiang and Lee, 2023; He et al., 2024; Verga et al., 2024), while others advise caution (Li et al., 2023; Wang et al., 2024). In this work, we recognize the potential noise and bias introduced in our training and evaluation data by automated GPT-4-turbo annotations. We hypothesize that our model derives greater advantages from training on a large-scale dataset, facilitated by low-cost LLM annotation, than it is hindered by potential noise within the data. After taking steps to ensure annotation quality (Section 4.2), we observe competitive performance on RAGTruth, a human-annotated benchmark in Section 6. This evaluation provides external validation for our model outputs, although we acknowledge that performance could potentially be enhanced with higher quality annotation sources.

References Ayush Agrawal, Mirac Suzgun, Lester Mackey, and Adam Kalai. 2024. Do language models know when they’re hallucinating references? In Findings of the Association for Computational Linguistics: EACL 2024, pages 912–928, St. Julian’s, Malta. Association for Computational Linguistics. Amos Azaria and Tom Mitchell. 2023. The internal state of an LLM knows when it’s lying. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 967–976, Singapore. Association for Computational Linguistics. Steven Bird and Edward Loper. 2004. NLTK: The natural language toolkit. In Proceedings of the ACL Interactive Poster and Demonstration Sessions, pages 214–217, Barcelona, Spain. Association for Computational Linguistics. Bernd Bohnet, Vinh Q. Tran, Pat Verga, Roee Aharoni, Daniel Andor, Livio Baldini Soares, Massimiliano Ciaramita, Jacob Eisenstein, Kuzman Ganchev, Jonathan Herzig, Kai Hui, Tom Kwiatkowski, Ji Ma, Jianmo Ni, Lierni Sestorain Saralegui, Tal Schuster, William W. Cohen, Michael Collins, Dipanjan

8

Das, Donald Metzler, Slav Petrov, and Kellie Webster. 2023. Attributed question answering: Evaluation and modeling for attributed large language models. Preprint, arXiv:2212.08037.

Nouha Dziri, Hannah Rashkin, Tal Linzen, and David Reitter. 2022b. Evaluating attribution in dialogue systems: The BEGIN benchmark. Transactions of the Association for Computational Linguistics, 10:1066– 1083.

Meng Cao, Yue Dong, and Jackie Cheung. 2022. Hallucinated but factual! inspecting the factuality of hallucinations in abstractive summarization. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3340–3354, Dublin, Ireland. Association for Computational Linguistics.

Shahul Es, Jithin James, Luis Espinosa Anke, and Steven Schockaert. 2024. RAGAs: Automated evaluation of retrieval augmented generation. In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations, pages 150–158, St. Julians, Malta. Association for Computational Linguistics.

Vittorio Castelli, Rishav Chakravarti, Saswati Dana, Anthony Ferritto, Radu Florian, Martin Franz, Dinesh Garg, Dinesh Khandelwal, Scott McCarley, Michael McCawley, Mohamed Nasr, Lin Pan, Cezar Pendus, John Pitrelli, Saurabh Pujar, Salim Roukos, Andrzej Sakrajda, Avi Sil, Rosario Uceda-Sosa, Todd Ward, and Rong Zhang. 2020. The TechQA dataset. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1269– 1278, Online. Association for Computational Linguistics.

Robert Friel and Atindriyo Sanyal. 2023. Chainpoll: A high efficacy method for llm hallucination detection. Preprint, arXiv:2310.18344. Luyu Gao, Zhuyun Dai, Panupong Pasupat, Anthony Chen, Arun Tejasvi Chaganty, Yicheng Fan, Vincent Zhao, Ni Lao, Hongrae Lee, Da-Cheng Juan, and Kelvin Guu. 2023. RARR: Researching and revising what language models say, using language models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 16477–16508, Toronto, Canada. Association for Computational Linguistics.

Zhiyu Chen, Wenhu Chen, Charese Smiley, Sameena Shah, Iana Borova, Dylan Langdon, Reema Moussa, Matt Beane, Ting-Hao Huang, Bryan Routledge, and William Yang Wang. 2021. FinQA: A dataset of numerical reasoning over financial data. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 3697–3711, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.

Zorik Gekhman, Gal Yona, Roee Aharoni, Matan Eyal, Amir Feder, Roi Reichart, and Jonathan Herzig. 2024. Does fine-tuning llms on new knowledge encourage hallucinations? Preprint, arXiv:2405.05904. Pengcheng He, Jianfeng Gao, and Weizhu Chen. 2023. DeBERTav3: Improving deBERTa using ELECTRAstyle pre-training with gradient-disentangled embedding sharing. In The Eleventh International Conference on Learning Representations.

Cheng-Han Chiang and Hung-yi Lee. 2023. Can large language models be an alternative to human evaluations? In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15607–15631, Toronto, Canada. Association for Computational Linguistics.

Pengcheng He, Xiaodong Liu, Jianfeng Gao, and Weizhu Chen. 2021. Deberta: Decoding-enhanced bert with disentangled attention. In International Conference on Learning Representations.

Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, Anastasios Nikolas Angelopoulos, Tianle Li, Dacheng Li, Hao Zhang, Banghua Zhu, Michael Jordan, Joseph E. Gonzalez, and Ion Stoica. 2024. Chatbot arena: An open platform for evaluating llms by human preference. Preprint, arXiv:2403.04132.

Xingwei He, Zhenghao Lin, Yeyun Gong, A-Long Jin, Hang Zhang, Chen Lin, Jian Jiao, Siu Ming Yiu, Nan Duan, and Weizhu Chen. 2024. Annollm: Making large language models to be better crowdsourced annotators. Preprint, arXiv:2303.16854.

Souvik Das, Sougata Saha, and Rohini Srihari. 2022. Diving deep into modes of fact hallucinations in dialogue systems. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 684–699, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.

Dan Hendrycks, Collin Burns, Anya Chen, and Spencer Ball. 2021. Cuad: An expert-annotated nlp dataset for legal contract review. NeurIPS. Or Honovich, Roee Aharoni, Jonathan Herzig, Hagai Taitelbaum, Doron Kukliansy, Vered Cohen, Thomas Scialom, Idan Szpektor, Avinatan Hassidim, and Yossi Matias. 2022. TRUE: Re-evaluating factual consistency evaluation. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3905–3920, Seattle, United States. Association for Computational Linguistics.

Nouha Dziri, Sivan Milton, Mo Yu, Osmar Zaiane, and Siva Reddy. 2022a. On the origin of hallucinations in conversational models: Is it the datasets or the models? In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 5271–5285, Seattle, United States. Association for Computational Linguistics.

9

Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Ye Jin Bang, Andrea Madotto, and Pascale Fung. 2023. Survey of hallucination in natural language generation. ACM Comput. Surv., 55(12).

Yifei Li, Xiang Yue, Zeyi Liao, and Huan Sun. 2024. Attributionbench: How hard is automatic attribution evaluation? arXiv preprint arXiv:2402.15089v1. Zhuoyan Li, Hangxiao Zhu, Zhuoran Lu, and Ming Yin. 2023. Synthetic data generation with large language models for text classification: Potential and limitations. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 10443–10461, Singapore. Association for Computational Linguistics.

Qiao Jin, Bhuwan Dhingra, Zhengping Liu, William Cohen, and Xinghua Lu. 2019. PubMedQA: A dataset for biomedical research question answering. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2567– 2577, Hong Kong, China. Association for Computational Linguistics.

Stephanie Lin, Jacob Hilton, and Owain Evans. 2022. TruthfulQA: Measuring how models mimic human falsehoods. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3214–3252, Dublin, Ireland. Association for Computational Linguistics.

Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer, Zac Hatfield-Dodds, Nova DasSarma, Eli Tran-Johnson, Scott Johnston, Sheer El-Showk, Andy Jones, Nelson Elhage, Tristan Hume, Anna Chen, Yuntao Bai, Sam Bowman, Stanislav Fort, Deep Ganguli, Danny Hernandez, Josh Jacobson, Jackson Kernion, Shauna Kravec, Liane Lovitt, Kamal Ndousse, Catherine Olsson, Sam Ringer, Dario Amodei, Tom Brown, Jack Clark, Nicholas Joseph, Ben Mann, Sam McCandlish, Chris Olah, and Jared Kaplan. 2022. Language models (mostly) know what they know. Preprint, arXiv:2207.05221.

Varun Magesh, Faiz Surani, Matthew Dahl, Mirac Suzgun, Christopher D. Manning, and Daniel E. Ho. 2024. Hallucination-free? assessing the reliability of leading ai legal research tools. Preprint, arXiv:2405.20362.

Ehsan Kamalloo, Aref Jafari, Xinyu Zhang, Nandan Thakur, and Jimmy Lin. 2023. HAGRID: A humanllm collaborative dataset for generative informationseeking with attribution. arXiv:2307.16883.

Chaitanya Malaviya, Subin Lee, Sihao Chen, Elizabeth Sieber, Mark Yatskar, and Dan Roth. 2024. Expertqa: Expert-curated questions and attributed answers. Preprint, arXiv:2309.07852.

Seungone Kim, Juyoung Suk, Shayne Longpre, Bill Yuchen Lin, Jamin Shin, Sean Welleck, Graham Neubig, Moontae Lee, Kyungjae Lee, and Minjoon Seo. 2024. Prometheus 2: An open source language model specialized in evaluating other language models. Preprint, arXiv:2405.01535.

Potsawee Manakul, Adian Liusie, and Mark Gales. 2023. SelfCheckGPT: Zero-resource black-box hallucination detection for generative large language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 9004–9017, Singapore. Association for Computational Linguistics.

Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. 2023. Lost in the middle: How language models use long contexts. Preprint, arXiv:2307.03172.

Moritz Laurer, Wouter van Atteveldt, Andreu Casas, and Kasper Welbers. 2022. Less annotating, more classifying – addressing the data scarcity issue of supervised machine learning with deep transfer learning and bert - nli. Open Science Framework Preprint.

Nick McKenna, Tianyi Li, Liang Cheng, Mohammad Javad Hosseini, Mark Johnson, and Mark Steedman. 2023. Sources of hallucination by large language models on inference tasks. In The 2023 Conference on Empirical Methods in Natural Language Processing.

Nayeon Lee, Wei Ping, Peng Xu, Mostofa Patwary, Pascale N Fung, Mohammad Shoeybi, and Bryan Catanzaro. 2022. Factuality enhanced language models for open-ended text generation. In Advances in Neural Information Processing Systems, volume 35, pages 34586–34599. Curran Associates, Inc.

Timo Möller, Anthony Reina, Raghavan Jayakumar, and Malte Pietsch. 2020. COVID-QA: A question answering dataset for COVID-19. In Proceedings of the 1st Workshop on NLP for COVID-19 at ACL 2020, Online. Association for Computational Linguistics.

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. 2020. Retrieval-augmented generation for knowledgeintensive nlp tasks. In Advances in Neural Information Processing Systems, volume 33, pages 9459– 9474. Curran Associates, Inc.

Benjamin Muller, John Wieting, Jonathan Clark, Tom Kwiatkowski, Sebastian Ruder, Livio Soares, Roee Aharoni, Jonathan Herzig, and Xinyi Wang. 2023. Evaluating and modeling attribution for cross-lingual question answering. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 144–157, Singapore. Association for Computational Linguistics.

10

Abhilash Nandy, Soumya Sharma, Shubham Maddhashiya, Kapil Sachdeva, Pawan Goyal, and NIloy Ganguly. 2021. Question answering over electronic devices: A new benchmark dataset and a multi-task learning based QA framework. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 4600–4609, Punta Cana, Dominican Republic. Association for Computational Linguistics.

Pat Verga, Sebastian Hofstatter, Sophia Althammer, Yixuan Su, Aleksandra Piktus, Arkady Arkhangorodsky, Minjie Xu, Naomi White, and Patrick Lewis. 2024. Replacing judges with juries: Evaluating llm generations with a panel of diverse models. Preprint, arXiv:2404.18796. Tu Vu, Mohit Iyyer, Xuezhi Wang, Noah Constant, Jerry Wei, Jason Wei, Chris Tar, Yun-Hsuan Sung, Denny Zhou, Quoc Le, and Thang Luong. 2023. Freshllms: Refreshing large language models with search engine augmentation. Preprint, arXiv:2310.03214.

Tri Nguyen, Mir Rosenberg, Xia Song, Jianfeng Gao, Saurabh Tiwary, Rangan Majumder, and Li Deng. 2016. Ms marco: A human generated machine reading comprehension dataset.

Shuohang Wang, Yang Liu, Yichong Xu, Chenguang Zhu, and Michael Zeng. 2021. Want to reduce labeling cost? GPT-3 can help. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 4195–4205, Punta Cana, Dominican Republic. Association for Computational Linguistics.

OpenAI. 2023. https://openai.com. Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F Christiano, Jan Leike, and Ryan Lowe. 2022. Training language models to follow instructions with human feedback. In Advances in Neural Information Processing Systems, volume 35, pages 27730–27744. Curran Associates, Inc.

Zengzhi Wang, Qiming Xie, Yi Feng, Zixiang Ding, Zinong Yang, and Rui Xia. 2024. Is chatgpt a good sentiment analyzer? a preliminary study. Preprint, arXiv:2304.04339. Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, brian ichter, Fei Xia, Ed Chi, Quoc V Le, and Denny Zhou. 2022. Chain-of-thought prompting elicits reasoning in large language models. In Advances in Neural Information Processing Systems, volume 35, pages 24824–24837. Curran Associates, Inc.

Hannah Rashkin, Vitaly Nikolaev, Matthew Lamm, Lora Aroyo, Michael Collins, Dipanjan Das, Slav Petrov, Gaurav Singh Tomar, Iulia Turc, and David Reitter. 2023. Measuring attribution in natural language generation models. Computational Linguistics, 49(4):777–840.

Yuanhao Wu, Juno Zhu, Siliang Xu, Kashun Shum, Cheng Niu, Randy Zhong, Juntong Song, and Tong Zhang. 2023. Ragtruth: A hallucination corpus for developing trustworthy retrieval-augmented language models. Preprint, arXiv:2401.00396.

Stephen Roller, Emily Dinan, Naman Goyal, Da Ju, Mary Williamson, Yinhan Liu, Jing Xu, Myle Ott, Eric Michael Smith, Y-Lan Boureau, and Jason Weston. 2021. Recipes for building an open-domain chatbot. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 300–325, Online. Association for Computational Linguistics.

Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W. Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. 2018. HotpotQA: A dataset for diverse, explainable multi-hop question answering. In Conference on Empirical Methods in Natural Language Processing (EMNLP).

Jon Saad-Falcon, Omar Khattab, Christopher Potts, and Matei Zaharia. 2024. Ares: An automated evaluation framework for retrieval-augmented generation systems. Preprint, arXiv:2311.09476.

Xiang Yue, Boshi Wang, Ziru Chen, Kai Zhang, Yu Su, and Huan Sun. 2023. Automatic evaluation of attribution by large language models. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 4615–4635, Singapore. Association for Computational Linguistics.

Mobashir Sadat, Zhengyu Zhou, Lukas Lange, Jun Araki, Arsalan Gundroo, Bingqing Wang, Rakesh Menon, Md Parvez, and Zhe Feng. 2023. Delucionqa: Detecting hallucinations in domain-specific question answering. pages 822–835.

Hanning Zhang, Shizhe Diao, Yong Lin, Yi R. Fung, Qing Lian, Xingyao Wang, Yangyi Chen, Heng Ji, and Tong Zhang. 2024. R-tuning: Instructing large language models to say ‘i don’t know’. Preprint, arXiv:2311.09677.

Kurt Shuster, Spencer Poff, Moya Chen, Douwe Kiela, and Jason Weston. 2021. Retrieval augmentation reduces hallucination in conversation. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 3784–3803, Punta Cana, Dominican Republic. Association for Computational Linguistics.

Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, Yifan Du, Chen Yang, Yushuo Chen, Zhipeng Chen, Jinhao Jiang, Ruiyang Ren, Yifan Li, Xinyu Tang, Zikang Liu, Peiyu Liu, Jian-Yun Nie, and Ji-Rong Wen. 2023. A survey of large language models. Preprint, arXiv:2303.18223.

Neeraj Varshney, Wenlin Yao, Hongming Zhang, Jianshu Chen, and Dong Yu. 2023. A stitch in time saves nine: Detecting and mitigating hallucinations of llms by validating low-confidence generation. Preprint, arXiv:2307.03987.

11

Shen Zheng, Jie Huang, and Kevin Chang. 2024. Why does chatGPT fall short in providing truthful answers? In I Can’t Believe It’s Not Better Workshop: Failure Modes in the Age of Foundation Models.

Using 2,592,000 seconds/month, the monthly cost of serving 10qps with GPT-3.5 is:

Fengbin Zhu, Wenqiang Lei, Youcheng Huang, Chao Wang, Shuo Zhang, Jiancheng Lv, Fuli Feng, and TatSeng Chua. 2021. TAT-QA: A question answering benchmark on a hybrid of tabular and textual content in finance. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 3277–3287, Online. Association for Computational Linguistics.

With ChainPoll ensemble, we request 3 outputs per query, bringing the cost of a single query up to

A

10qps ∗ 2, 592, 000 ∗ $0.0023 = $59, 616

(4k ∗ $0.5 + 3 ∗ 200 ∗ $1.5)/1M = $0.0029 (9) And the total monthly cost for 10qps to: 10qps ∗ 2, 592, 000 ∗ $0.0029 = $75, 168 (10) RAGAS Costs RAGAS makes 2 OpenAI API calls per an input RAG example. The first query extracts a list of claims from the response. The second requests the LLM to evaluate the faithfulness of each extracted claim to the RAG context. We estimate that the output length of the first query is roughly equal to the length of the RAG response; and the output length of the second query is roughly 3x the length of the response, since it includes the original claims followed by a faithfulness score and an explanation. Factoring in overhead token length of each prompt, we calculate the cost per query to be Query1 = $380/1M (11)

Response Generation Prompt

We use the following prompt template to generate LLM responses for each sample in our QA RAG dataset. Context documents, separated by line breaks, along with the question are slotted in for each generation sample. Use the following pieces of context to answer the question. {documents} Question: {question}

B

Cost Calculations

Query2 = $2730/1M

Costs are estimated assuming average throughput of 10 queries per second (qps), with average RAG query length of 4000 tokens, and NVIDIA L4 GPU deployment hardware. When estimating LLM cost for >1qps we assume concurrency is implemented to process multiple queries in parallel.

10 = $1750 4

10qps∗2, 592, 000∗($380+$2730)/1M = $79, 937 (13) Trulens Costs Trulens makes 1 OpenAI per each sentence in the response. For this calculation, we estimate 3 sentences per response, which aligns with our obesrvations on the QA RAG dataset. Each query returns original sentence, a groundedness score (1-10), and an explanation. Here we assume that the token length of the explanation is roughly equal to the token length of the input sentence. The cost of a single query is roughly

(6)

(4k ∗ $0.5 + 2 ∗ 75 ∗ $1.5)/1M = $0.0022 (14)

OpenAI Costs At the time of writing, querying GPT-3.5-turbo through OpenAI API costs $0.50 / 1M input tokens and $1.50 / 1M output tokens8 . In our test set, we observe the average output token length from GPT-3.5 at 200 tokens. Using average input length of 4000 tokens, the cost of a single query is roughly (4k ∗ $0.5 + 200 ∗ $1.5)/1M = $0.0023 7 8

(12)

Then, the monthly cost of serving 10qps is:

Luna Costs Empirically, we find that each L4 can serve up to 4qps. At the time of writing, the monthly cost of running a g6.2xlarge GPU instance on AWS cloud is $7007 . Thus, we estimate total monthly cost for 10qps throughput as $700 ∗

(8)

Using 2,592,000 seconds/month, the monthly cost of serving 10qps with Trulens is: 10qps∗2, 592, 000∗3∗$0.0022 = $173, 016 (15)

C

Latency Optimizations

We optimize Luna and its deployment architecture to process up to 16k input tokens in under one second on NVIDIA L4 GPU. Table 5 details the latency reductions and how they were achieved.

(7)

https://aws.amazon.com/ec2/pricing/on-demand/ https://openai.com/api/pricing/

12

Optimization

s/16k

baseline TensorRT backend efficient pre- and post- processing code 512 max model length BLS

3.27 2.09 1.79 0.98 0.92

Table 5: Impact of latency optimizations on Luna inference speed. Reporting inference speed in seconds for processing 16k input tokens.

D

Latency Comparison

We empirically estimate the latency of Luna and each baseline model. Luna latency is discussed in Appendix C. For LLm models that query OpenAI API, we calculate the average latency per query after querying the API multiple times with an input of 4000k tokens, split between 3800 tokens for the context, 25 tokens for the question, and 75 tokens for the response. Model

s/4k

%change

Luna GPT-3.5 ChainPoll n=3 Trulens RAGAS

0.23 2.5 3.0 3.4 5.4

-91% -93% -93% -96%

Table 6: Model latency (in seconds), comparing Luna to LLM baselines. We also report the % difference between Luna and LLM-based models.

13

First question:

Can you give me a very clear explanation of the core assertions, implications, and mechanics elucidated in this paper?


ASSISTANT

Certainly. Here is a clear explanation of the core assertions, implications, and mechanics described in this paper:

Core Assertions:

  1. Current hallucination detection techniques for retrieval-augmented generation (RAG) systems fail to simultaneously deliver accuracy, low latency, and low cost.

  2. Luna, a lightweight DeBERTA-large encoder fine-tuned for hallucination detection, outperforms larger language models and commercial evaluation frameworks on this task.

  3. Luna generalizes well across multiple industry verticals and out-of-domain data.

Implications:

  1. Luna enables more reliable and cost-effective deployment of RAG systems in industry applications.

  2. The approach used to develop Luna could potentially be applied to create other lightweight, specialized models for LLM evaluation tasks.

  3. Luna's performance suggests that smaller, fine-tuned models can be competitive with larger language models for specific tasks like hallucination detection.

Mechanics:

  1. Model Architecture: Luna is based on a 440M parameter DeBERTA-large encoder.

  2. Training Data: Luna is fine-tuned on a carefully curated real-world RAG dataset covering multiple domains.

  3. Task Framing: The model is trained to identify supported tokens in the response, given a query and retrieved context.

  4. Long Context Handling: Luna uses a novel approach to handle long RAG contexts (up to 16k tokens) by making span-level predictions and aggregating them.

  5. Optimization: The model is optimized for deployment using techniques like ONNX tracing, TensorRT backend, and efficient pre/post-processing.

  6. Evaluation: Luna is benchmarked against GPT-3.5, commercial frameworks like RAGAS and Trulens, and other baselines on multiple datasets.

The paper presents Luna as a solution to the challenge of detecting hallucinations in RAG systems that balances accuracy, speed, and cost considerations for practical industry deployment.


USER

Second question:

Can you explain the value of this in basic terms? Like you're talking to a CEO. So what? What's the bottom line here?


ASSISTANT

Certainly. Here's a simple explanation of the value proposition, as if speaking to a CEO:

Bottom line:

Luna is a game-changer for businesses using AI-powered chatbots or search systems. It solves a critical problem - catching AI errors quickly and cheaply - which directly impacts customer trust and company liability.

Here's why it matters:

  1. Cost Savings: Luna cuts detection costs by 97% compared to current methods. This means significant savings on your AI operations.

  2. Speed: It's 91% faster than existing solutions. This speed allows you to catch and correct AI mistakes before customers see them, protecting your brand reputation.

  3. Accuracy: Despite being smaller and cheaper to run, Luna outperforms larger, more expensive AI models in detecting errors.

  4. Versatility: It works well across different industries and topics, so you don't need multiple specialized systems.

  5. Scalability: Luna's efficiency makes it feasible to implement widespread error checking in your AI systems without breaking the bank.

In essence, Luna offers a way to make your AI interactions more reliable and trustworthy, while simultaneously reducing operational costs and risks. It's a tool that can help maintain customer confidence in your AI-driven services while keeping your bottom line healthy.


USER

Third question:

Can you give me an analogy or metaphor that will help explain this to a broad audience?


ASSISTANT

Certainly. Here's an analogy that could help explain Luna to a broad audience:

Think of Luna like a highly efficient proofreader for AI conversations.

Imagine you're running a massive international call center. Your agents are incredibly knowledgeable and can answer questions on almost any topic, but occasionally they might mix up facts or give out incorrect information. To prevent this, you hire proofreaders to listen in on calls and catch any mistakes before they reach the customer.

Now, traditional methods of catching AI mistakes are like hiring teams of expert proofreaders for each call. They're highly skilled, but they're also expensive and work relatively slowly. Sometimes, they might catch a mistake after the customer has already heard it.

Luna, on the other hand, is like having a brilliant, lightning-fast proofreader who specializes in catching factual errors. This proofreader:

  1. Works incredibly quickly, able to check conversations in near real-time.
  2. Costs much less than the team of expert proofreaders.
  3. Is just as accurate (if not more so) at catching mistakes.
  4. Can work across multiple departments and topics without needing specialized training.

With Luna, it's as if every AI conversation has this super-efficient proofreader checking it instantly, allowing you to correct any mistakes before the customer even notices. This ensures that your AI 'call center' provides accurate information, maintains customer trust, and operates cost-effectively.

This analogy helps illustrate how Luna improves the reliability of AI systems while also addressing the practical concerns of speed and cost in a business environment.


USER

Fourth question:

Can you provide a specific example that best illustrates the key points elucidated in this paper?


ASSISTANT

The paper doesn't provide a specific example of Luna in action, but I can construct a hypothetical example based on the information provided that illustrates the key points:

Imagine a large financial services company using an AI-powered chatbot to handle customer inquiries. A customer asks:

"What's the current interest rate for a 30-year fixed mortgage?"

The AI system retrieves some context about current mortgage rates and generates a response:

"The current interest rate for a 30-year fixed mortgage is 3.5%. This rate has been stable for the past 6 months and is expected to remain low due to the Federal Reserve's monetary policy."

Now, let's see how Luna would evaluate this response:

  1. Speed: Luna can process this input (question, context, and response) in milliseconds, allowing for real-time checking before the response is sent to the customer.

  2. Accuracy: Luna analyzes the response token by token, checking if each piece of information is supported by the retrieved context. It might flag the "3.5%" as accurate if it matches the context, but could identify the statement about the rate being stable for 6 months as a potential hallucination if this information isn't present in the retrieved context.

  3. Cost-effectiveness: While larger models like GPT-3.5 might take seconds and cost more per query to perform this check, Luna does it almost instantly at a fraction of the cost.

  4. Long context handling: If the retrieved context is very long (e.g., containing detailed historical rate data), Luna can efficiently process it using its novel chunking approach.

  5. Cross-domain applicability: Although this example is in finance, Luna would perform similarly well for queries in other domains like healthcare or tech support, without needing domain-specific training.

This example illustrates how Luna can quickly and accurately detect potential hallucinations in AI-generated responses, allowing the system to either modify the response or flag it for human review before it reaches the customer. This helps maintain the accuracy and trustworthiness of the AI system while operating efficiently at scale.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment