Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save pratikmurali/4f32fff8127682be9aefd4406b87ec83 to your computer and use it in GitHub Desktop.
Save pratikmurali/4f32fff8127682be9aefd4406b87ec83 to your computer and use it in GitHub Desktop.
Evaluation of Langchain Retrievers using RAGA metrics and Langchain
DATASET EVALUATED
CSV based reviews from the 4 movies in the John Wick franchise to explore the different retrieval strategies.
These were obtained from IMDB, and are available in the [AIM Data Repository](https://github.com/AI-Maker-Space/DataRepository).
RETRIEVERS EVALUATED
- Naive Retriever
- BM-25
- Multi-Query Retriever
- Context Compression Retriever
- Ensemble Retriever
- Parent Document Retriever
- Semantic Chunking
RAGA METRICS EVALUATION
Key Findings:
- Ensemble achieves perfect context_recall (1.0000) and has the highest faithfulness (0.9007)
- Naive leads in factual_correctness (0.5950), just ahead of Contextual Compression (0.5908)
- Parent Document has the highest answer_relevancy (0.9651), significantly outperforming others
- Multi-Query has the best context_entity_recall (0.5740) and noise_sensitivity (0.3871)
- Parent Document has the lowest context_recall (0.5014) but highest answer_relevancy
- Missing value for noise_sensitivity in the Ensemble retriever
Retriever Recommendations:
- For maximum context inclusion: Ensemble (perfect recall)
- For factual answers: Naive or Contextual Compression
- For highly relevant answers: Parent Document
- Most balanced overall: Multi-Query (strong in most categories)
LANGSMITH TRACES
This table obtained from the Langsmith traces shows the performance metrics for different retrieval methods, including:
- Total number of runs
- Total tokens processed
- Median tokens per run
- P50 latency (median response time)
- P99 latency (99th percentile response time)
| Retriever | Total Runs | Total Tokens | Median Tokens | P50 Latency | P99 Latency |
|-----------|------------|--------------|---------------|-------------|-------------|
| Ensemble | 12 | 71,271 | 6,029 | 5.45 sec | 8.49 sec |
| Parent Document | 12 | 9,919 | 684 | 2.48 sec | 4.78 sec |
| Naive Retrieval | 12 | 45,197 | 3,855 | 3.01 sec | 5.00 sec |
| BM 25 | 12 | 19,458 | 1,636 | 2.78 sec | 6.28 sec |
| Multi-Query Retrieval | 12 | 63,345 | 5,400 | 5.68 sec | 7.86 sec |
| Contextual Compression | 12 | 15,137 | 1,275 | 2.90 sec | 4.28 sec |
| Semantic Retrieval | 12 | 39,358 | 3,189 | 3.77 sec | 5.82 sec |
****Key Takeaways:****
- ****πŸ’‘Total Tokens Influence Latency:**** There's a general trend indicating that as the total number of tokens processed by a retriever increases, both the median (P50) and 99th percentile (P99) latency tend to increase as well.
For example, the 'Ensemble' and 'Multi-Query Retrieval' methods, which have the highest total token counts (71,271 and 63,345 respectively), also exhibit some of the higher P50 latencies (5.45 sec and 5.68 sec) and P99 latencies (8.49 sec and 7.86 sec).
- ****πŸ’‘Retrieval Method Impacts Efficiency:**** Even with similar total token counts, different retrieval methods can have varying latency performances.
For instance, 'Naive Retrieval' (45,197 total tokens) has a P50 latency of 3.01 sec and a P99 latency of 5.00 sec, while 'Semantic Retrieval' (39,358 total tokens) has a P50 latency of 3.77 sec and a P99 latency of 5.82 sec. This suggests that the underlying retrieval algorithm plays a significant role in determining latency.
- ****πŸ’‘Parent Document**** Method is Efficient: The 'Parent Document' retriever stands out for its efficiency, processing the fewest total tokens (9,919) and achieving the lowest P50 latency (2.48 sec) and P99 latency (4.78 sec).
This indicates that this method is quicker and processes less data compared to the others in the dataset
## Comprehensive Retriever Analysis: RAGAS Metrics & LangSmith Traces
Based on the RAGAS metrics and LangSmith traces, here's a summary of each retriever's strengths and weaknesses:
### Ensemble Retriever
- πŸ‘πŸ» Strengths: Perfect context recall (1.0), highest faithfulness (0.9007), excellent overall performance
- πŸ‘ŽπŸ» Weaknesses: Missing noise sensitivity data, potentially higher computational cost
- πŸ‘ŒπŸ» Best for: Critical applications where comprehensive context retrieval and factual accuracy are paramount
### Multi-Query Retriever
- πŸ‘πŸ» Strengths: Excellent context recall (0.95), strong faithfulness (0.8978), best entity recall (0.5740)
- πŸ‘ŽπŸ» Weaknesses: Higher noise sensitivity (0.3871), moderate factual correctness
- πŸ‘ŒπŸ» Best for: Complex queries requiring multiple perspectives, especially when entity recognition is important
### Naive Retriever
- πŸ‘πŸ» Strengths: Leading in factual correctness (0.5950), good balance of metrics
- πŸ‘ŽπŸ» Weaknesses: Not exceptional in any single area besides factual correctness
- πŸ‘ŒπŸ» Best for: General-purpose applications where balanced performance is needed
### Semantic Retriever
- πŸ‘πŸ» Strengths: Strong context recall (0.8556), good faithfulness (0.8692)
- πŸ‘ŽπŸ» Weaknesses: Moderate factual correctness, higher noise sensitivity
- πŸ‘ŒπŸ»Best for: Queries requiring semantic understanding rather than keyword matching
### Contextual Compression Retriever
- πŸ‘πŸ» Strengths: Strong factual correctness (0.5908), efficient context filtering
- πŸ‘ŽπŸ» Weaknesses: Lower answer relevancy (0.8014) compared to others
- πŸ‘ŒπŸ»Best for: Applications where precision is more important than recall
### BM25 Retriever
- πŸ‘πŸ» Strengths: Good factual correctness (0.5550), lower noise sensitivity than some alternatives
- πŸ‘ŽπŸ» Weaknesses: Lower context recall (0.7625) than vector-based methods
- πŸ‘ŒπŸ»Best for: Keyword-heavy queries, especially when computational efficiency matters
### Parent Document Retriever
- πŸ‘πŸ» Strengths: Highest answer relevancy (0.9651), good factual correctness (0.5867)
- πŸ‘ŽπŸ» Weaknesses: Lowest context recall (0.5014)
- πŸ‘ŒπŸ» Best for: Applications where highly relevant answers matter more than comprehensive context
### Key Insights
- πŸ’‘ Trade-offs are evident: No single retriever excels in all metrics, highlighting the importance of choosing based on specific needs.
- πŸ’‘ Ensemble shows power of combination: By combining multiple retrievers, the Ensemble approach achieves the best overall performance, particularly in context recall and faithfulness.
- πŸ’‘ Relevancy vs. Recall: Parent Document Retriever demonstrates that high answer relevancy can be achieved even with lower context recall.
- πŸ’‘Computational considerations: While not directly measured in RAGAS, LangSmith traces suggest varying computational demands across retrievers.
### Recommendations
- βœ… For maximum accuracy: Use Ensemble Retriever when computational resources allow
- βœ… For balanced performance: Multi-Query or Naive Retrievers offer good all-around capabilities
- βœ… For efficiency with good results: BM25 provides solid performance with lower computational needs
- βœ…For highly relevant answers: Parent Document Retriever when answer quality matters more than comprehensive context
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment