Created
May 20, 2025 19:26
-
-
Save pratikmurali/4f32fff8127682be9aefd4406b87ec83 to your computer and use it in GitHub Desktop.
Evaluation of Langchain Retrievers using RAGA metrics and Langchain
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
DATASET EVALUATED | |
CSV based reviews from the 4 movies in the John Wick franchise to explore the different retrieval strategies. | |
These were obtained from IMDB, and are available in the [AIM Data Repository](https://github.com/AI-Maker-Space/DataRepository). | |
RETRIEVERS EVALUATED | |
- Naive Retriever | |
- BM-25 | |
- Multi-Query Retriever | |
- Context Compression Retriever | |
- Ensemble Retriever | |
- Parent Document Retriever | |
- Semantic Chunking | |
RAGA METRICS EVALUATION | |
Key Findings: | |
- Ensemble achieves perfect context_recall (1.0000) and has the highest faithfulness (0.9007) | |
- Naive leads in factual_correctness (0.5950), just ahead of Contextual Compression (0.5908) | |
- Parent Document has the highest answer_relevancy (0.9651), significantly outperforming others | |
- Multi-Query has the best context_entity_recall (0.5740) and noise_sensitivity (0.3871) | |
- Parent Document has the lowest context_recall (0.5014) but highest answer_relevancy | |
- Missing value for noise_sensitivity in the Ensemble retriever | |
Retriever Recommendations: | |
- For maximum context inclusion: Ensemble (perfect recall) | |
- For factual answers: Naive or Contextual Compression | |
- For highly relevant answers: Parent Document | |
- Most balanced overall: Multi-Query (strong in most categories) | |
LANGSMITH TRACES | |
This table obtained from the Langsmith traces shows the performance metrics for different retrieval methods, including: | |
- Total number of runs | |
- Total tokens processed | |
- Median tokens per run | |
- P50 latency (median response time) | |
- P99 latency (99th percentile response time) | |
| Retriever | Total Runs | Total Tokens | Median Tokens | P50 Latency | P99 Latency | | |
|-----------|------------|--------------|---------------|-------------|-------------| | |
| Ensemble | 12 | 71,271 | 6,029 | 5.45 sec | 8.49 sec | | |
| Parent Document | 12 | 9,919 | 684 | 2.48 sec | 4.78 sec | | |
| Naive Retrieval | 12 | 45,197 | 3,855 | 3.01 sec | 5.00 sec | | |
| BM 25 | 12 | 19,458 | 1,636 | 2.78 sec | 6.28 sec | | |
| Multi-Query Retrieval | 12 | 63,345 | 5,400 | 5.68 sec | 7.86 sec | | |
| Contextual Compression | 12 | 15,137 | 1,275 | 2.90 sec | 4.28 sec | | |
| Semantic Retrieval | 12 | 39,358 | 3,189 | 3.77 sec | 5.82 sec | | |
****Key Takeaways:**** | |
- ****π‘Total Tokens Influence Latency:**** There's a general trend indicating that as the total number of tokens processed by a retriever increases, both the median (P50) and 99th percentile (P99) latency tend to increase as well. | |
For example, the 'Ensemble' and 'Multi-Query Retrieval' methods, which have the highest total token counts (71,271 and 63,345 respectively), also exhibit some of the higher P50 latencies (5.45 sec and 5.68 sec) and P99 latencies (8.49 sec and 7.86 sec). | |
- ****π‘Retrieval Method Impacts Efficiency:**** Even with similar total token counts, different retrieval methods can have varying latency performances. | |
For instance, 'Naive Retrieval' (45,197 total tokens) has a P50 latency of 3.01 sec and a P99 latency of 5.00 sec, while 'Semantic Retrieval' (39,358 total tokens) has a P50 latency of 3.77 sec and a P99 latency of 5.82 sec. This suggests that the underlying retrieval algorithm plays a significant role in determining latency. | |
- ****π‘Parent Document**** Method is Efficient: The 'Parent Document' retriever stands out for its efficiency, processing the fewest total tokens (9,919) and achieving the lowest P50 latency (2.48 sec) and P99 latency (4.78 sec). | |
This indicates that this method is quicker and processes less data compared to the others in the dataset | |
## Comprehensive Retriever Analysis: RAGAS Metrics & LangSmith Traces | |
Based on the RAGAS metrics and LangSmith traces, here's a summary of each retriever's strengths and weaknesses: | |
### Ensemble Retriever | |
- ππ» Strengths: Perfect context recall (1.0), highest faithfulness (0.9007), excellent overall performance | |
- ππ» Weaknesses: Missing noise sensitivity data, potentially higher computational cost | |
- ππ» Best for: Critical applications where comprehensive context retrieval and factual accuracy are paramount | |
### Multi-Query Retriever | |
- ππ» Strengths: Excellent context recall (0.95), strong faithfulness (0.8978), best entity recall (0.5740) | |
- ππ» Weaknesses: Higher noise sensitivity (0.3871), moderate factual correctness | |
- ππ» Best for: Complex queries requiring multiple perspectives, especially when entity recognition is important | |
### Naive Retriever | |
- ππ» Strengths: Leading in factual correctness (0.5950), good balance of metrics | |
- ππ» Weaknesses: Not exceptional in any single area besides factual correctness | |
- ππ» Best for: General-purpose applications where balanced performance is needed | |
### Semantic Retriever | |
- ππ» Strengths: Strong context recall (0.8556), good faithfulness (0.8692) | |
- ππ» Weaknesses: Moderate factual correctness, higher noise sensitivity | |
- ππ»Best for: Queries requiring semantic understanding rather than keyword matching | |
### Contextual Compression Retriever | |
- ππ» Strengths: Strong factual correctness (0.5908), efficient context filtering | |
- ππ» Weaknesses: Lower answer relevancy (0.8014) compared to others | |
- ππ»Best for: Applications where precision is more important than recall | |
### BM25 Retriever | |
- ππ» Strengths: Good factual correctness (0.5550), lower noise sensitivity than some alternatives | |
- ππ» Weaknesses: Lower context recall (0.7625) than vector-based methods | |
- ππ»Best for: Keyword-heavy queries, especially when computational efficiency matters | |
### Parent Document Retriever | |
- ππ» Strengths: Highest answer relevancy (0.9651), good factual correctness (0.5867) | |
- ππ» Weaknesses: Lowest context recall (0.5014) | |
- ππ» Best for: Applications where highly relevant answers matter more than comprehensive context | |
### Key Insights | |
- π‘ Trade-offs are evident: No single retriever excels in all metrics, highlighting the importance of choosing based on specific needs. | |
- π‘ Ensemble shows power of combination: By combining multiple retrievers, the Ensemble approach achieves the best overall performance, particularly in context recall and faithfulness. | |
- π‘ Relevancy vs. Recall: Parent Document Retriever demonstrates that high answer relevancy can be achieved even with lower context recall. | |
- π‘Computational considerations: While not directly measured in RAGAS, LangSmith traces suggest varying computational demands across retrievers. | |
### Recommendations | |
- β For maximum accuracy: Use Ensemble Retriever when computational resources allow | |
- β For balanced performance: Multi-Query or Naive Retrievers offer good all-around capabilities | |
- β For efficiency with good results: BM25 provides solid performance with lower computational needs | |
- β For highly relevant answers: Parent Document Retriever when answer quality matters more than comprehensive context |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment