Listed here are some key points relevant to the task of text summarization by large language models and their evaluation as per the HELM benchmark.
Text summarization is formulated as an unstructured sequence-to-sequence problem, where a document is the input and the LM is tasked with generating a summary resembling the reference summary.
- ROUGE-2 correlated with more accurate models, especially a strong correlation was found with model size.
- Relationship between model quality and abstraction was very variable.
The authors evaluated 100 examples from CNN/Dailymail and XSum for six models: Anthropic-LM v4-s3 (52B), Cohere xlarge v20220609 (52.4B), OPT (175B), GPT-3 davinci v1 (175B), InstructGPT davinci v2 (175B), and GLM (130B).
Additionally, four zero-shot models were evaluated: GPT-3 davinci v1 (175B), GPT-3 curie v1 (6.7B), InstructGPT davinci v2 (175B), and InstructGPT davinci v2 zero-shot (175B).
Finally, 2 models finetuned on the datasets were also evaluated: BRIO and Pegasus. In total, 13 sets of summaries were evaluated for each model (6 few-shot, 4 zero-short, 2 finetuned, and reference summaries).
Three quality criteria were evaluated: faithfulness, relevance, and coherence. A summary is considered to be faithful if "all the information expressed by the summary can be inferred from the article". A summary is considered relevant if it "includes only important information from the source document", and coherent if it "organizes the relevant information into a well-structured summary". Faithfulness was evaluated on a binary scale while relevance and coherence were evaluated on a likert scale of 1-5 points. Each summary was annotated by 3 workers. Here are the findings from the human evaluation.
- Reference summaries are of low quality. Especially for XSum, they are worse in terms of faithfulness than all models.
- Reference summaries only outperform zero-shot models on CNN/Dailymail dataset.
- For faithfulness: zero-shot models > few-shot models > finetuned models.
- Instruction tuning is crucial for improving accuracy.
- Human evaluations are anti-correlated to automated evaluations. ROUGE-2 scores favor fine-tuned models. whereas human judgments prefer few-shot or zero-shot language models.
- Automated faithfulness measures are not reliable for evaluating few-shot or zero-shot models.
TL;DR CNN/Dailymail and XSum are insufficient datasets to study the true progress in neural text summarization. We need higher quality evaluation data and the adoption of such data as mainstream, along with future studies on the quality of summarization metrics.