Skip to content

Instantly share code, notes, and snippets.

@mlabonne
Created April 15, 2024 21:53
Show Gist options
  • Save mlabonne/0e96c31edb47c8e8555bbb5a02e474a3 to your computer and use it in GitHub Desktop.
Save mlabonne/0e96c31edb47c8e8555bbb5a02e474a3 to your computer and use it in GitHub Desktop.
Model AGIEval GPT4All TruthfulQA Bigbench Average
WizardLM-2-7B 35.76 68.56 56.46 38.24 49.76

AGIEval

Task Version Metric Value Stderr
agieval_aqua_rat 0 acc 21.65 ± 2.59
acc_norm 20.87 ± 2.55
agieval_logiqa_en 0 acc 30.41 ± 1.80
acc_norm 31.34 ± 1.82
agieval_lsat_ar 0 acc 21.30 ± 2.71
acc_norm 22.17 ± 2.75
agieval_lsat_lr 0 acc 32.55 ± 2.08
acc_norm 35.49 ± 2.12
agieval_lsat_rc 0 acc 48.33 ± 3.05
acc_norm 47.58 ± 3.05
agieval_sat_en 0 acc 61.17 ± 3.40
acc_norm 62.14 ± 3.39
agieval_sat_en_without_passage 0 acc 40.29 ± 3.43
acc_norm 37.86 ± 3.39
agieval_sat_math 0 acc 28.64 ± 3.05
acc_norm 28.64 ± 3.05

Average: 35.76%

GPT4All

Task Version Metric Value Stderr
arc_challenge 0 acc 50.17 ± 1.46
acc_norm 51.19 ± 1.46
arc_easy 0 acc 77.02 ± 0.86
acc_norm 70.96 ± 0.93
boolq 1 acc 84.71 ± 0.63
hellaswag 0 acc 63.43 ± 0.48
acc_norm 81.07 ± 0.39
openbookqa 0 acc 32.60 ± 2.10
acc_norm 42.60 ± 2.21
piqa 0 acc 78.78 ± 0.95
acc_norm 78.45 ± 0.96
winogrande 0 acc 70.96 ± 1.28

Average: 68.56%

TruthfulQA

Task Version Metric Value Stderr
truthfulqa_mc 1 mc1 38.19 ± 1.70
mc2 56.46 ± 1.58

Average: 56.46%

Bigbench

Task Version Metric Value Stderr
bigbench_causal_judgement 0 multiple_choice_grade 54.74 ± 3.62
bigbench_date_understanding 0 multiple_choice_grade 62.87 ± 2.52
bigbench_disambiguation_qa 0 multiple_choice_grade 34.11 ± 2.96
bigbench_geometric_shapes 0 multiple_choice_grade 17.83 ± 2.02
exact_str_match 8.91 ± 1.51
bigbench_logical_deduction_five_objects 0 multiple_choice_grade 27.40 ± 2.00
bigbench_logical_deduction_seven_objects 0 multiple_choice_grade 16.86 ± 1.42
bigbench_logical_deduction_three_objects 0 multiple_choice_grade 45.00 ± 2.88
bigbench_movie_recommendation 0 multiple_choice_grade 40.60 ± 2.20
bigbench_navigate 0 multiple_choice_grade 50.10 ± 1.58
bigbench_reasoning_about_colored_objects 0 multiple_choice_grade 50.70 ± 1.12
bigbench_ruin_names 0 multiple_choice_grade 40.18 ± 2.32
bigbench_salient_translation_error_detection 0 multiple_choice_grade 16.93 ± 1.19
bigbench_snarks 0 multiple_choice_grade 53.04 ± 3.72
bigbench_sports_understanding 0 multiple_choice_grade 58.42 ± 1.57
bigbench_temporal_sequences 0 multiple_choice_grade 36.20 ± 1.52
bigbench_tracking_shuffled_objects_five_objects 0 multiple_choice_grade 22.88 ± 1.19
bigbench_tracking_shuffled_objects_seven_objects 0 multiple_choice_grade 15.43 ± 0.86
bigbench_tracking_shuffled_objects_three_objects 0 multiple_choice_grade 45.00 ± 2.88

Average: 38.24%

Average score: 49.76%

Elapsed time: 02:23:11

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment