Skip to content

Instantly share code, notes, and snippets.

@danyaljj
Created September 12, 2021 22:04
Show Gist options
  • Save danyaljj/e6f6300721aaee47244c23a6e1ae10eb to your computer and use it in GitHub Desktop.
Save danyaljj/e6f6300721aaee47244c23a6e1ae10eb to your computer and use it in GitHub Desktop.
show_unpublished_scores: true
datasets:
blind_labels: danielk/genie_labels
evaluator:
image: jbragg/genie-evaluator
input_path: /preds/
predictions_filename: predictions.json
label_path: /labels/
output_path: /results
arguments:
- python
- /app/evaluator.py
- '--gold_label_file'
- /labels/wmt21_gold_outputs.json
- '--prediction_file'
- /preds/predictions.json
- '--output'
- /results/metrics.json
# Submissions are ranked by the first metric.
metrics:
- key: human_mean
display_name: "Human"
supplemental: true
- key: human_mean_ci_upper
display_name: "Human (+err)"
supplemental: true
- key: human_mean_ci_lower
display_name: "Human (-err)"
supplemental: true
- key: bert_score
display_name: "BERTScore"
- key: rouge
display_name: "ROUGE"
- key: meteor
display_name: "METEOR"
- key: sacrebleu
display_name: "SacreBLEU"
- key: bleurt
display_name: "BLEURT"
metadata:
tag_ids:
- ai2
- new
logo: /assets/images/leaderboard/genie_mt/logo.svg
short_name: GENIE - Machine Translation (WMT21)
long_name: GENeratIve Evaluation (GENIE) - Machine Translation (WMT21)
description:
GENIE is an evaluation leaderboard for tasks that involve text generation on a diverse set of tasks.
It contains a suite of seven existing datasets (direct-answer question answering, translation, common-sense reasoning, paraphrasing, etc).
The goal of this leaderboard is to provide evaluation for generative tasks and to encourage the AI community
to think about more difficult challenges.
This is the Machine Translation leaderboard component of GENIE.
example: |
For example, see the [Huggingface dataset explorer](https://huggingface.co/nlp/viewer/?dataset=wmt21&config=cs-en).
The above link is for the cs-en config, but you should download the de-en config instead.
You may not use any reference translations from the test split.
getting_the_data: |
Download the German-English portion of WMT'21, for example from HuggingFace as
```
import datasets
data = datasets.load_dataset('wmt21', 'de-en', version='1.0.1')
```
scoring: |
We measure performance in multiple ways. First, we have common metrics for text
generation tasks. In addition to the automatic metrics, we compute human scores
for the predictions by using crowdsourced annotations.
predictions_format: |
Your model should produce a `predictions.json` file that is a mapping from id to generated output for the test split.
Here's an example of the file:
```
{
"[ID1]": "[OUTPUT1]",
"[ID2]": "[OUTPUT2]",
...
}
```
If you imported your data using
```
data = datasets.load_dataset('wmt21', 'de-en', version='1.0.1')
```
then the test split is available as: `data['test']`
You may not use any reference translations from the test split.
For further details, check [GENIE's website](https://genie.apps.allenai.org/submitting).
example_models:
team:
name: AI2
description: |
The human evaluation on leaderboards is a joint project across different teams at the [Allen Institute for AI](https://allenai.org).
purpose: |
Part of AI2's mission is to measure the capabilities by state of the art AI systems.
This leaderboard collects evaluations of current AI systems on various commonsense/reasoning tasks that measure both the knowledge that these systems possess as well as their
ability to reason with and use that knowledge in context.
show_compact_table_button: true
metric_precision: 2
disable_publish_speed_bump: true
metrics_table:
columns:
- name: Human
description: "Human evaluation score computed by GENIE."
renderer: error
metric_keys: ["human_mean", "human_mean_ci_upper", "human_mean_ci_lower"]
- name: BERTScore
renderer: simple
metric_keys:
- bert_score
- name: ROUGE
renderer: simple
metric_keys:
- rouge
- name: METEOR
renderer: simple
metric_keys:
- meteor
- name: SacreBLEU
renderer: simple
metric_keys:
- sacrebleu
- name: BLEURT
renderer: simple
metric_keys:
- bleurt
@jungokasai
Copy link

jungokasai commented Sep 13, 2021

show_unpublished_scores: true
datasets:
  blind_labels: danielk/genie_labels
evaluator:
  image: jbragg/genie-evaluator
  input_path: /preds/
  predictions_filename: predictions.json
  label_path: /labels/
  output_path: /results
  arguments:
    - python
    - /app/evaluator.py
    - '--gold_label_file'
    - /labels/wmt21_gold_outputs.json
    - '--prediction_file'
    - /preds/predictions.json
    - '--output'
    - /results/metrics.json
  # Submissions are ranked by the first metric.
  metrics:
    - key: human_mean
      display_name: "Human"
      supplemental: true
    - key: human_mean_ci_upper
      display_name: "Human (+err)"
      supplemental: true
    - key: human_mean_ci_lower
      display_name: "Human (-err)"
      supplemental: true
    - key: bert_score
      display_name: "BERTScore"
    - key: rouge
      display_name: "ROUGE"
    - key: meteor
      display_name: "METEOR"
    - key: sacrebleu
      display_name: "SacreBLEU"
    - key: bleurt
      display_name: "BLEURT"
metadata:
  tag_ids:
    - ai2
    - new
  logo: /assets/images/leaderboard/genie_mt/logo.svg
  short_name: GENIE - Machine Translation (WMT21)
  long_name: GENeratIve Evaluation (GENIE) - Machine Translation (WMT21)
  description:
    GENIE is an evaluation leaderboard for tasks that involve text generation on a diverse set of tasks.
    It contains a suite of seven existing datasets (direct-answer question answering, translation, common-sense reasoning, paraphrasing, etc).
    The goal of this leaderboard is to provide evaluation for generative tasks and to encourage the AI community
    to think about more difficult challenges.

    This is the Machine Translation leaderboard component of GENIE.
  example: |
    You may not use any reference translations from the test split.
  getting_the_data: |
    You can download the German-English portion of WMT'21 from the [official website](http://statmt.org/wmt21/translation-task.html).   
    To ease the process, however, we provide raw and preprocessed data as well as several transformer baselines. 

    - [wmt2021-de-en_bpe32k.tar.gz](https://arkdata.cs.washington.edu/GENIE/wmt2021-de-en/data/wmt2021-de-en_bpe32k.tar.gz). We applied [Moses](https://github.com/moses-smt/mosesdecoder) tokenization and fastBPE with 32K BPE operations that are learned from the training data.
    - [wmt2021-de-en_bpe32k_data-bin.tar.gz](https://arkdata.cs.washington.edu/GENIE/wmt2021-de-en/data/wmt2021-de-en_bpe32k_data-bin.tar.gz). Data binarized with [Fairseq-preprocess](https://github.com/pytorch/fairseq). If you use [fairseq](https://github.com/pytorch/fairseq), you only need this data.
    - [wmt2021-de-en.tar.gz](https://arkdata.cs.washington.edu/GENIE/wmt2021-de-en/data/wmt2021-de-en.tar.gz). Raw training data. It is created by concatenating all WMT 2021 DE-EN datasets. The dev data are a concatenation of newstest2019deen and newstest2020deen. Since newstest2019, de-en data are all originally German text to mitigate the translationese effect in evaluation ([Barrault et al., 2019](https://aclanthology.org/W19-5301/)).
    - [GENIE-large.de-en_6-6.tar.gz](https://arkdata.cs.washington.edu/GENIE/wmt2021-de-en/models/GENIE-large.de-en_6-6.tar.gz). A transformer large model with a 6-layer encoder and decoder (trained for 7 epochs.)
    - [GENIE-base.de-en_6-6.tar.gz](https://arkdata.cs.washington.edu/GENIE/wmt2021-de-en/models/GENIE-base.de-en_6-6.tar.gz). A transformer base model with a 6-layer encoder and decoder (trained for 7 epochs.)
    - [GENIE-base.de-en_3-3.tar.gz](https://arkdata.cs.washington.edu/GENIE/wmt2021-de-en/models/GENIE-base.de-en_3-3.tar.gz). A transformer base model with a 3-layer encoder and decoder (trained for 7 epochs.)
    - [GENIE-base.de-en_1-1.tar.gz](https://arkdata.cs.washington.edu/GENIE/wmt2021-de-en/models/GENIE-base.de-en_1-1.tar.gz). A transformer base model with a 1-layer encoder and decoder (trained for 7 epochs.)
        More details are available [our repository](https://github.com/jungokasai/GENIE_wmt2021-de-en).
        
  scoring: |
    We measure performance in multiple ways. First, we have common metrics for text
     generation tasks. In addition to the automatic metrics, we compute human scores
    for the predictions by using crowdsourced annotations.

  predictions_format: |
    Your model should produce a `predictions.json` file that is a mapping from id to generated output for the test split. We provide a [script](https://github.com/jungokasai/GENIE_wmt2021-de-en#genie-submission) that converts a txt file to a json file with the WMT 2021 test ID numbers. 
    Here's an example of the file:
    ```
    {
      "[ID1]": "[OUTPUT1]",
      "[ID2]": "[OUTPUT2]",
      ...
    }

    For further details, check [GENIE's website](https://genie.apps.allenai.org/submitting). 

  team:
    name: AI2
    description: |
      The human evaluation on leaderboards is a joint project across different teams at the [Allen Institute for AI](https://allenai.org).

  purpose: |
    Part of AI2's mission is to measure the capabilities by state of the art AI systems.
    This leaderboard collects evaluations of current AI systems on various commonsense/reasoning tasks that measure both the knowledge that these systems possess as well as their
    ability to reason with and use that knowledge in context.
  show_compact_table_button: true
  metric_precision: 2
disable_publish_speed_bump: true
metrics_table:
  columns:
    - name: Human
      description: "Human evaluation score computed by GENIE."
      renderer: error
      metric_keys: ["human_mean", "human_mean_ci_upper", "human_mean_ci_lower"]
    - name: BERTScore
      renderer: simple
      metric_keys:
        - bert_score
    - name: ROUGE
      renderer: simple
      metric_keys:
        - rouge
    - name: METEOR
      renderer: simple
      metric_keys:
        - meteor
    - name: SacreBLEU
      renderer: simple
      metric_keys:
        - sacrebleu
    - name: BLEURT
      renderer: simple
      metric_keys:
        - bleurt

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment