Skip to content

Instantly share code, notes, and snippets.

@mlabonne
Created January 24, 2024 16:56
Show Gist options
  • Save mlabonne/8acffabb84410c26f60538e24d519008 to your computer and use it in GitHub Desktop.
Save mlabonne/8acffabb84410c26f60538e24d519008 to your computer and use it in GitHub Desktop.
Model AGIEval GPT4All TruthfulQA Bigbench Average
Darewin-7B-v2 37.67 73.16 49.5 41.01 50.33

AGIEval

Task Version Metric Value Stderr
agieval_aqua_rat 0 acc 21.26 ± 2.57
acc_norm 20.08 ± 2.52
agieval_logiqa_en 0 acc 32.41 ± 1.84
acc_norm 33.64 ± 1.85
agieval_lsat_ar 0 acc 18.70 ± 2.58
acc_norm 19.13 ± 2.60
agieval_lsat_lr 0 acc 42.94 ± 2.19
acc_norm 43.33 ± 2.20
agieval_lsat_rc 0 acc 46.47 ± 3.05
acc_norm 46.84 ± 3.05
agieval_sat_en 0 acc 53.88 ± 3.48
acc_norm 53.40 ± 3.48
agieval_sat_en_without_passage 0 acc 49.51 ± 3.49
acc_norm 49.03 ± 3.49
agieval_sat_math 0 acc 39.09 ± 3.30
acc_norm 35.91 ± 3.24

Average: 37.67%

GPT4All

Task Version Metric Value Stderr
arc_challenge 0 acc 57.94 ± 1.44
acc_norm 61.77 ± 1.42
arc_easy 0 acc 83.88 ± 0.75
acc_norm 81.65 ± 0.79
boolq 1 acc 84.59 ± 0.63
hellaswag 0 acc 62.47 ± 0.48
acc_norm 81.59 ± 0.39
openbookqa 0 acc 34.80 ± 2.13
acc_norm 46.20 ± 2.23
piqa 0 acc 81.72 ± 0.90
acc_norm 83.03 ± 0.88
winogrande 0 acc 73.32 ± 1.24

Average: 73.16%

TruthfulQA

Task Version Metric Value Stderr
truthfulqa_mc 1 mc1 34.76 ± 1.67
mc2 49.50 ± 1.52

Average: 49.5%

Bigbench

Task Version Metric Value Stderr
bigbench_causal_judgement 0 multiple_choice_grade 58.42 ± 3.59
bigbench_date_understanding 0 multiple_choice_grade 64.50 ± 2.49
bigbench_disambiguation_qa 0 multiple_choice_grade 34.88 ± 2.97
bigbench_geometric_shapes 0 multiple_choice_grade 34.26 ± 2.51
exact_str_match 24.79 ± 2.28
bigbench_logical_deduction_five_objects 0 multiple_choice_grade 24.40 ± 1.92
bigbench_logical_deduction_seven_objects 0 multiple_choice_grade 17.29 ± 1.43
bigbench_logical_deduction_three_objects 0 multiple_choice_grade 42.67 ± 2.86
bigbench_movie_recommendation 0 multiple_choice_grade 36.20 ± 2.15
bigbench_navigate 0 multiple_choice_grade 50.10 ± 1.58
bigbench_reasoning_about_colored_objects 0 multiple_choice_grade 59.35 ± 1.10
bigbench_ruin_names 0 multiple_choice_grade 37.50 ± 2.29
bigbench_salient_translation_error_detection 0 multiple_choice_grade 28.16 ± 1.42
bigbench_snarks 0 multiple_choice_grade 60.77 ± 3.64
bigbench_sports_understanding 0 multiple_choice_grade 66.02 ± 1.51
bigbench_temporal_sequences 0 multiple_choice_grade 43.70 ± 1.57
bigbench_tracking_shuffled_objects_five_objects 0 multiple_choice_grade 21.84 ± 1.17
bigbench_tracking_shuffled_objects_seven_objects 0 multiple_choice_grade 15.37 ± 0.86
bigbench_tracking_shuffled_objects_three_objects 0 multiple_choice_grade 42.67 ± 2.86

Average: 41.01%

Average score: 50.33%

Elapsed time: 02:31:28

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment