Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save shahbazsyed/7c347d6404b72dc80ec1b3242d13d42d to your computer and use it in GitHub Desktop.
Save shahbazsyed/7c347d6404b72dc80ec1b3242d13d42d to your computer and use it in GitHub Desktop.
Notes on SOTA in Summarization according to HELM benchmark

SOTA in Summarization according to the HELM benchmark

Listed here are some key points relevant to the task of text summarization by large language models and their evaluation as per the HELM benchmark.

Problem setting

Text summarization is formulated as an unstructured sequence-to-sequence problem, where a document is the input and the LM is tasked with generating a summary resembling the reference summary.

Automatic Evaluation

  • ROUGE-2 correlated with more accurate models, especially a strong correlation was found with model size.
  • Relationship between model quality and abstraction was very variable.

Human Evaluation

The authors evaluated 100 examples from CNN/Dailymail and XSum for six models: Anthropic-LM v4-s3 (52B), Cohere xlarge v20220609 (52.4B), OPT (175B), GPT-3 davinci v1 (175B), InstructGPT davinci v2 (175B), and GLM (130B).

Additionally, four zero-shot models were evaluated: GPT-3 davinci v1 (175B), GPT-3 curie v1 (6.7B), InstructGPT davinci v2 (175B), and InstructGPT davinci v2 zero-shot (175B).

Finally, 2 models finetuned on the datasets were also evaluated: BRIO and Pegasus. In total, 13 sets of summaries were evaluated for each model (6 few-shot, 4 zero-short, 2 finetuned, and reference summaries).

Three quality criteria were evaluated: faithfulness, relevance, and coherence. A summary is considered to be faithful if "all the information expressed by the summary can be inferred from the article". A summary is considered relevant if it "includes only important information from the source document", and coherent if it "organizes the relevant information into a well-structured summary". Faithfulness was evaluated on a binary scale while relevance and coherence were evaluated on a likert scale of 1-5 points. Each summary was annotated by 3 workers. Here are the findings from the human evaluation.

  • Reference summaries are of low quality. Especially for XSum, they are worse in terms of faithfulness than all models.
  • Reference summaries only outperform zero-shot models on CNN/Dailymail dataset.
  • For faithfulness: zero-shot models > few-shot models > finetuned models.
  • Instruction tuning is crucial for improving accuracy.
  • Human evaluations are anti-correlated to automated evaluations. ROUGE-2 scores favor fine-tuned models. whereas human judgments prefer few-shot or zero-shot language models.
  • Automated faithfulness measures are not reliable for evaluating few-shot or zero-shot models.

TL;DR CNN/Dailymail and XSum are insufficient datasets to study the true progress in neural text summarization. We need higher quality evaluation data and the adoption of such data as mainstream, along with future studies on the quality of summarization metrics.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment