Skip to content

Instantly share code, notes, and snippets.

@codelion
Created April 13, 2024 21:35
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save codelion/9f792b5a53d2494d5eb1fb9433ce05cc to your computer and use it in GitHub Desktop.
Save codelion/9f792b5a53d2494d5eb1fb9433ce05cc to your computer and use it in GitHub Desktop.
Model ARC HellaSwag MMLU TruthfulQA Winogrande GSM8K
mera-mix-4x7B 65.7 84.73 Error: File does not exist 51.03 79.48 66.34

ARC

Task Version Metric Value Stderr
arc_challenge 1 acc,none 0.62
acc_stderr,none 0.01
acc_norm,none 0.66
acc_norm_stderr,none 0.01
alias arc_challenge

Average: 65.7%

HellaSwag

Task Version Metric Value Stderr
hellaswag 1 acc,none 0.66
acc_stderr,none 0
acc_norm,none 0.85
acc_norm_stderr,none 0
alias hellaswag

Average: 84.73%

MMLU

Average: Error: File does not exist%

TruthfulQA

Task Version Metric Value Stderr
truthfulqa N/A bleu_max,none 30.01
bleu_max_stderr,none 0.82
rouge2_acc,none 0.42
rouge2_acc_stderr,none 0.02
bleu_diff,none 2.98
bleu_diff_stderr,none 0.94
rouge2_max,none 42.78
rouge2_max_stderr,none 1.02
rougeL_max,none 53.62
rougeL_max_stderr,none 0.87
rougeL_diff,none 4.03
rougeL_diff_stderr,none 1.21
acc,none 0.43
acc_stderr,none 0.01
rouge1_max,none 56.89
rouge1_max_stderr,none 0.84
bleu_acc,none 0.47
bleu_acc_stderr,none 0.02
rouge2_diff,none 3.64
rouge2_diff_stderr,none 1.33
rougeL_acc,none 0.46
rougeL_acc_stderr,none 0.02
rouge1_diff,none 4.62
rouge1_diff_stderr,none 1.19
rouge1_acc,none 0.47
rouge1_acc_stderr,none 0.02
alias truthfulqa
truthfulqa_gen 3 bleu_max,none 30.01
bleu_max_stderr,none 0.82
bleu_acc,none 0.47
bleu_acc_stderr,none 0.02
bleu_diff,none 2.98
bleu_diff_stderr,none 0.94
rouge1_max,none 56.89
rouge1_max_stderr,none 0.84
rouge1_acc,none 0.47
rouge1_acc_stderr,none 0.02
rouge1_diff,none 4.62
rouge1_diff_stderr,none 1.19
rouge2_max,none 42.78
rouge2_max_stderr,none 1.02
rouge2_acc,none 0.42
rouge2_acc_stderr,none 0.02
rouge2_diff,none 3.64
rouge2_diff_stderr,none 1.33
rougeL_max,none 53.62
rougeL_max_stderr,none 0.87
rougeL_acc,none 0.46
rougeL_acc_stderr,none 0.02
rougeL_diff,none 4.03
rougeL_diff_stderr,none 1.21
alias - truthfulqa_gen
truthfulqa_mc1 2 acc,none 0.35
acc_stderr,none 0.02
alias - truthfulqa_mc1
truthfulqa_mc2 2 acc,none 0.51
acc_stderr,none 0.02
alias - truthfulqa_mc2

Average: 51.03%

Winogrande

Task Version Metric Value Stderr
winogrande 1 acc,none 0.79
acc_stderr,none 0.01
alias winogrande

Average: 79.48%

GSM8K

Task Version Metric Value Stderr
gsm8k 3 exact_match,strict-match 0.66
exact_match_stderr,strict-match 0.01
exact_match,flexible-extract 0.62
exact_match_stderr,flexible-extract 0.01
alias gsm8k

Average: 66.34%

Average score: Not available due to errors

Elapsed time: 07:16:25

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment