Skip to content

Instantly share code, notes, and snippets.

@tosh
Created April 5, 2024 10:37
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save tosh/578fa995f985b178b65a7675168b145c to your computer and use it in GitHub Desktop.
Save tosh/578fa995f985b178b65a7675168b145c to your computer and use it in GitHub Desktop.
Model AGIEval GPT4All TruthfulQA Bigbench Average
Mistral-7B-Instruct-v0.2 38.5 71.64 66.82 42.29 54.81

AGIEval

Task Version Metric Value Stderr
agieval_aqua_rat 0 acc 23.62 ± 2.67
acc_norm 22.05 ± 2.61
agieval_logiqa_en 0 acc 36.10 ± 1.88
acc_norm 36.56 ± 1.89
agieval_lsat_ar 0 acc 21.30 ± 2.71
acc_norm 19.13 ± 2.60
agieval_lsat_lr 0 acc 38.24 ± 2.15
acc_norm 38.04 ± 2.15
agieval_lsat_rc 0 acc 52.79 ± 3.05
acc_norm 49.81 ± 3.05
agieval_sat_en 0 acc 68.93 ± 3.23
acc_norm 67.96 ± 3.26
agieval_sat_en_without_passage 0 acc 43.20 ± 3.46
acc_norm 40.78 ± 3.43
agieval_sat_math 0 acc 35.91 ± 3.24
acc_norm 33.64 ± 3.19

Average: 38.5%

GPT4All

Task Version Metric Value Stderr
arc_challenge 0 acc 54.61 ± 1.45
acc_norm 55.97 ± 1.45
arc_easy 0 acc 81.44 ± 0.80
acc_norm 76.77 ± 0.87
boolq 1 acc 85.26 ± 0.62
hellaswag 0 acc 66.07 ± 0.47
acc_norm 83.66 ± 0.37
openbookqa 0 acc 35.40 ± 2.14
acc_norm 45.20 ± 2.23
piqa 0 acc 80.41 ± 0.93
acc_norm 80.58 ± 0.92
winogrande 0 acc 74.03 ± 1.23

Average: 71.64%

TruthfulQA

Task Version Metric Value Stderr
truthfulqa_mc 1 mc1 52.39 ± 1.75
mc2 66.82 ± 1.52

Average: 66.82%

Bigbench

Task Version Metric Value Stderr
bigbench_causal_judgement 0 multiple_choice_grade 54.21 ± 3.62
bigbench_date_understanding 0 multiple_choice_grade 66.12 ± 2.47
bigbench_disambiguation_qa 0 multiple_choice_grade 40.70 ± 3.06
bigbench_geometric_shapes 0 multiple_choice_grade 21.17 ± 2.16
exact_str_match 9.47 ± 1.55
bigbench_logical_deduction_five_objects 0 multiple_choice_grade 29.80 ± 2.05
bigbench_logical_deduction_seven_objects 0 multiple_choice_grade 20.57 ± 1.53
bigbench_logical_deduction_three_objects 0 multiple_choice_grade 45.33 ± 2.88
bigbench_movie_recommendation 0 multiple_choice_grade 34.20 ± 2.12
bigbench_navigate 0 multiple_choice_grade 41.90 ± 1.56
bigbench_reasoning_about_colored_objects 0 multiple_choice_grade 60.55 ± 1.09
bigbench_ruin_names 0 multiple_choice_grade 54.46 ± 2.36
bigbench_salient_translation_error_detection 0 multiple_choice_grade 35.17 ± 1.51
bigbench_snarks 0 multiple_choice_grade 69.06 ± 3.45
bigbench_sports_understanding 0 multiple_choice_grade 65.62 ± 1.51
bigbench_temporal_sequences 0 multiple_choice_grade 36.90 ± 1.53
bigbench_tracking_shuffled_objects_five_objects 0 multiple_choice_grade 22.40 ± 1.18
bigbench_tracking_shuffled_objects_seven_objects 0 multiple_choice_grade 17.66 ± 0.91
bigbench_tracking_shuffled_objects_three_objects 0 multiple_choice_grade 45.33 ± 2.88

Average: 42.29%

Average score: 54.81%

Elapsed time: 02:15:29

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment