Skip to content

Instantly share code, notes, and snippets.

@shamanez
Created February 28, 2024 06:01
Show Gist options
  • Save shamanez/4c18f8d79747d4019ecf6d5ce098cf72 to your computer and use it in GitHub Desktop.
Save shamanez/4c18f8d79747d4019ecf6d5ce098cf72 to your computer and use it in GitHub Desktop.
Model AGIEval GPT4All TruthfulQA Bigbench Average
gemma-7b-slerp 23.86 36.55 46.22 29.94 34.14

AGIEval

Task Version Metric Value Stderr
agieval_aqua_rat 0 acc 22.83 ± 2.64
acc_norm 23.23 ± 2.65
agieval_logiqa_en 0 acc 21.97 ± 1.62
acc_norm 26.42 ± 1.73
agieval_lsat_ar 0 acc 24.78 ± 2.85
acc_norm 26.96 ± 2.93
agieval_lsat_lr 0 acc 16.86 ± 1.66
acc_norm 20.00 ± 1.77
agieval_lsat_rc 0 acc 24.16 ± 2.61
acc_norm 21.93 ± 2.53
agieval_sat_en 0 acc 24.76 ± 3.01
acc_norm 29.13 ± 3.17
agieval_sat_en_without_passage 0 acc 20.39 ± 2.81
acc_norm 21.36 ± 2.86
agieval_sat_math 0 acc 22.73 ± 2.83
acc_norm 21.82 ± 2.79

Average: 23.86%

GPT4All

Task Version Metric Value Stderr
arc_challenge 0 acc 20.22 ± 1.17
acc_norm 23.55 ± 1.24
arc_easy 0 acc 29.55 ± 0.94
acc_norm 30.30 ± 0.94
boolq 1 acc 40.06 ± 0.86
hellaswag 0 acc 27.00 ± 0.44
acc_norm 28.70 ± 0.45
openbookqa 0 acc 18.20 ± 1.73
acc_norm 30.80 ± 2.07
piqa 0 acc 55.22 ± 1.16
acc_norm 53.86 ± 1.16
winogrande 0 acc 48.54 ± 1.40

Average: 36.55%

TruthfulQA

Task Version Metric Value Stderr
truthfulqa_mc 1 mc1 25.34 ± 1.52
mc2 46.22 ± 1.67

Average: 46.22%

Bigbench

Task Version Metric Value Stderr
bigbench_causal_judgement 0 multiple_choice_grade 47.89 ± 3.63
bigbench_date_understanding 0 multiple_choice_grade 23.31 ± 2.20
bigbench_disambiguation_qa 0 multiple_choice_grade 37.60 ± 3.02
bigbench_geometric_shapes 0 multiple_choice_grade 10.03 ± 1.59
exact_str_match 0.00 ± 0.00
bigbench_logical_deduction_five_objects 0 multiple_choice_grade 24.20 ± 1.92
bigbench_logical_deduction_seven_objects 0 multiple_choice_grade 17.43 ± 1.43
bigbench_logical_deduction_three_objects 0 multiple_choice_grade 35.00 ± 2.76
bigbench_movie_recommendation 0 multiple_choice_grade 26.00 ± 1.96
bigbench_navigate 0 multiple_choice_grade 50.40 ± 1.58
bigbench_reasoning_about_colored_objects 0 multiple_choice_grade 23.25 ± 0.94
bigbench_ruin_names 0 multiple_choice_grade 37.50 ± 2.29
bigbench_salient_translation_error_detection 0 multiple_choice_grade 12.22 ± 1.04
bigbench_snarks 0 multiple_choice_grade 54.70 ± 3.71
bigbench_sports_understanding 0 multiple_choice_grade 49.70 ± 1.59
bigbench_temporal_sequences 0 multiple_choice_grade 20.30 ± 1.27
bigbench_tracking_shuffled_objects_five_objects 0 multiple_choice_grade 20.16 ± 1.14
bigbench_tracking_shuffled_objects_seven_objects 0 multiple_choice_grade 14.17 ± 0.83
bigbench_tracking_shuffled_objects_three_objects 0 multiple_choice_grade 35.00 ± 2.76

Average: 29.94%

Average score: 34.14%

Elapsed time: 04:11:34

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment