Skip to content

Instantly share code, notes, and snippets.

@mlabonne
Created January 11, 2024 03:38
Show Gist options
  • Save mlabonne/9446c95d3681a3aa7e644a876b0e2af9 to your computer and use it in GitHub Desktop.
Save mlabonne/9446c95d3681a3aa7e644a876b0e2af9 to your computer and use it in GitHub Desktop.
Model AGIEval GPT4All TruthfulQA Bigbench Average
phi-2-instruct 25.8 67.93 44.82 36.88 43.86

AGIEval

Task Version Metric Value Stderr
agieval_aqua_rat 0 acc 16.14 ± 2.31
acc_norm 18.90 ± 2.46
agieval_logiqa_en 0 acc 26.88 ± 1.74
acc_norm 30.88 ± 1.81
agieval_lsat_ar 0 acc 16.52 ± 2.45
acc_norm 18.26 ± 2.55
agieval_lsat_lr 0 acc 29.41 ± 2.02
acc_norm 27.45 ± 1.98
agieval_lsat_rc 0 acc 28.25 ± 2.75
acc_norm 23.42 ± 2.59
agieval_sat_en 0 acc 39.81 ± 3.42
acc_norm 33.01 ± 3.28
agieval_sat_en_without_passage 0 acc 34.95 ± 3.33
acc_norm 28.16 ± 3.14
agieval_sat_math 0 acc 25.45 ± 2.94
acc_norm 26.36 ± 2.98

Average: 25.8%

GPT4All

Task Version Metric Value Stderr
arc_challenge 0 acc 48.98 ± 1.46
acc_norm 49.66 ± 1.46
arc_easy 0 acc 79.12 ± 0.83
acc_norm 71.80 ± 0.92
boolq 1 acc 83.46 ± 0.65
hellaswag 0 acc 52.70 ± 0.50
acc_norm 71.10 ± 0.45
openbookqa 0 acc 33.60 ± 2.11
acc_norm 43.80 ± 2.22
piqa 0 acc 79.71 ± 0.94
acc_norm 78.94 ± 0.95
winogrande 0 acc 76.72 ± 1.19

Average: 67.93%

TruthfulQA

Task Version Metric Value Stderr
truthfulqa_mc 1 mc1 31.58 ± 1.63
mc2 44.82 ± 1.51

Average: 44.82%

Bigbench

Task Version Metric Value Stderr
bigbench_causal_judgement 0 multiple_choice_grade 57.37 ± 3.60
bigbench_date_understanding 0 multiple_choice_grade 58.27 ± 2.57
bigbench_disambiguation_qa 0 multiple_choice_grade 40.70 ± 3.06
bigbench_geometric_shapes 0 multiple_choice_grade 7.52 ± 1.39
exact_str_match 0.00 ± 0.00
bigbench_logical_deduction_five_objects 0 multiple_choice_grade 23.80 ± 1.91
bigbench_logical_deduction_seven_objects 0 multiple_choice_grade 17.86 ± 1.45
bigbench_logical_deduction_three_objects 0 multiple_choice_grade 43.33 ± 2.87
bigbench_movie_recommendation 0 multiple_choice_grade 33.20 ± 2.11
bigbench_navigate 0 multiple_choice_grade 49.40 ± 1.58
bigbench_reasoning_about_colored_objects 0 multiple_choice_grade 61.15 ± 1.09
bigbench_ruin_names 0 multiple_choice_grade 27.23 ± 2.11
bigbench_salient_translation_error_detection 0 multiple_choice_grade 24.95 ± 1.37
bigbench_snarks 0 multiple_choice_grade 72.38 ± 3.33
bigbench_sports_understanding 0 multiple_choice_grade 49.70 ± 1.59
bigbench_temporal_sequences 0 multiple_choice_grade 21.50 ± 1.30
bigbench_tracking_shuffled_objects_five_objects 0 multiple_choice_grade 18.96 ± 1.11
bigbench_tracking_shuffled_objects_seven_objects 0 multiple_choice_grade 13.26 ± 0.81
bigbench_tracking_shuffled_objects_three_objects 0 multiple_choice_grade 43.33 ± 2.87

Average: 36.88%

Average score: 43.86%

Elapsed time: 02:02:47

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment