Skip to content

Instantly share code, notes, and snippets.

@mlabonne
Created March 28, 2024 16:09
Show Gist options
  • Save mlabonne/9d98b166c1a26eb964337707543bd7da to your computer and use it in GitHub Desktop.
Save mlabonne/9d98b166c1a26eb964337707543bd7da to your computer and use it in GitHub Desktop.
Model AGIEval GPT4All TruthfulQA Bigbench Average
Cerebrum-1.0-7b 35.25 71.93 46.99 37.43 47.9

AGIEval

Task Version Metric Value Stderr
agieval_aqua_rat 0 acc 18.50 ± 2.44
acc_norm 21.26 ± 2.57
agieval_logiqa_en 0 acc 30.57 ± 1.81
acc_norm 32.87 ± 1.84
agieval_lsat_ar 0 acc 20.00 ± 2.64
acc_norm 21.30 ± 2.71
agieval_lsat_lr 0 acc 39.41 ± 2.17
acc_norm 37.84 ± 2.15
agieval_lsat_rc 0 acc 49.44 ± 3.05
acc_norm 41.64 ± 3.01
agieval_sat_en 0 acc 66.99 ± 3.28
acc_norm 59.22 ± 3.43
agieval_sat_en_without_passage 0 acc 44.17 ± 3.47
acc_norm 37.86 ± 3.39
agieval_sat_math 0 acc 37.27 ± 3.27
acc_norm 30.00 ± 3.10

Average: 35.25%

GPT4All

Task Version Metric Value Stderr
arc_challenge 0 acc 52.90 ± 1.46
acc_norm 55.20 ± 1.45
arc_easy 0 acc 82.07 ± 0.79
acc_norm 80.68 ± 0.81
boolq 1 acc 84.04 ± 0.64
hellaswag 0 acc 62.61 ± 0.48
acc_norm 82.24 ± 0.38
openbookqa 0 acc 33.00 ± 2.10
acc_norm 43.80 ± 2.22
piqa 0 acc 81.18 ± 0.91
acc_norm 82.43 ± 0.89
winogrande 0 acc 75.14 ± 1.21

Average: 71.93%

TruthfulQA

Task Version Metric Value Stderr
truthfulqa_mc 1 mc1 32.56 ± 1.64
mc2 46.99 ± 1.47

Average: 46.99%

Bigbench

Task Version Metric Value Stderr
bigbench_causal_judgement 0 multiple_choice_grade 54.74 ± 3.62
bigbench_date_understanding 0 multiple_choice_grade 67.75 ± 2.44
bigbench_disambiguation_qa 0 multiple_choice_grade 40.70 ± 3.06
bigbench_geometric_shapes 0 multiple_choice_grade 20.06 ± 2.12
exact_str_match 2.23 ± 0.78
bigbench_logical_deduction_five_objects 0 multiple_choice_grade 24.40 ± 1.92
bigbench_logical_deduction_seven_objects 0 multiple_choice_grade 17.00 ± 1.42
bigbench_logical_deduction_three_objects 0 multiple_choice_grade 36.00 ± 2.78
bigbench_movie_recommendation 0 multiple_choice_grade 32.80 ± 2.10
bigbench_navigate 0 multiple_choice_grade 50.40 ± 1.58
bigbench_reasoning_about_colored_objects 0 multiple_choice_grade 42.00 ± 1.10
bigbench_ruin_names 0 multiple_choice_grade 35.04 ± 2.26
bigbench_salient_translation_error_detection 0 multiple_choice_grade 16.03 ± 1.16
bigbench_snarks 0 multiple_choice_grade 54.14 ± 3.71
bigbench_sports_understanding 0 multiple_choice_grade 52.94 ± 1.59
bigbench_temporal_sequences 0 multiple_choice_grade 55.00 ± 1.57
bigbench_tracking_shuffled_objects_five_objects 0 multiple_choice_grade 22.88 ± 1.19
bigbench_tracking_shuffled_objects_seven_objects 0 multiple_choice_grade 15.83 ± 0.87
bigbench_tracking_shuffled_objects_three_objects 0 multiple_choice_grade 36.00 ± 2.78

Average: 37.43%

Average score: 47.9%

Elapsed time: 02:12:07

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment