Skip to content

Instantly share code, notes, and snippets.

@tosh
Created April 5, 2024 22:43
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save tosh/098babc1aec9753e859674239bf61a81 to your computer and use it in GitHub Desktop.
Save tosh/098babc1aec9753e859674239bf61a81 to your computer and use it in GitHub Desktop.
Model AGIEval GPT4All TruthfulQA Bigbench Average
pandafish-3-7B-32k 40.85 73.57 56.3 42.17 53.22

AGIEval

Task Version Metric Value Stderr
agieval_aqua_rat 0 acc 20.47 ± 2.54
acc_norm 20.87 ± 2.55
agieval_logiqa_en 0 acc 34.10 ± 1.86
acc_norm 36.41 ± 1.89
agieval_lsat_ar 0 acc 23.04 ± 2.78
acc_norm 23.91 ± 2.82
agieval_lsat_lr 0 acc 39.02 ± 2.16
acc_norm 40.78 ± 2.18
agieval_lsat_rc 0 acc 55.76 ± 3.03
acc_norm 53.90 ± 3.04
agieval_sat_en 0 acc 73.79 ± 3.07
acc_norm 71.36 ± 3.16
agieval_sat_en_without_passage 0 acc 46.12 ± 3.48
acc_norm 43.69 ± 3.46
agieval_sat_math 0 acc 40.91 ± 3.32
acc_norm 35.91 ± 3.24

Average: 40.85%

GPT4All

Task Version Metric Value Stderr
arc_challenge 0 acc 55.63 ± 1.45
acc_norm 58.28 ± 1.44
arc_easy 0 acc 84.22 ± 0.75
acc_norm 82.15 ± 0.79
boolq 1 acc 86.33 ± 0.60
hellaswag 0 acc 64.22 ± 0.48
acc_norm 82.78 ± 0.38
openbookqa 0 acc 35.60 ± 2.14
acc_norm 47.00 ± 2.23
piqa 0 acc 81.99 ± 0.90
acc_norm 83.24 ± 0.87
winogrande 0 acc 75.22 ± 1.21

Average: 73.57%

TruthfulQA

Task Version Metric Value Stderr
truthfulqa_mc 1 mc1 39.53 ± 1.71
mc2 56.30 ± 1.53

Average: 56.3%

Bigbench

Task Version Metric Value Stderr
bigbench_causal_judgement 0 multiple_choice_grade 55.26 ± 3.62
bigbench_date_understanding 0 multiple_choice_grade 68.29 ± 2.43
bigbench_disambiguation_qa 0 multiple_choice_grade 47.29 ± 3.11
bigbench_geometric_shapes 0 multiple_choice_grade 20.06 ± 2.12
exact_str_match 1.67 ± 0.68
bigbench_logical_deduction_five_objects 0 multiple_choice_grade 26.80 ± 1.98
bigbench_logical_deduction_seven_objects 0 multiple_choice_grade 19.86 ± 1.51
bigbench_logical_deduction_three_objects 0 multiple_choice_grade 45.33 ± 2.88
bigbench_movie_recommendation 0 multiple_choice_grade 32.20 ± 2.09
bigbench_navigate 0 multiple_choice_grade 51.00 ± 1.58
bigbench_reasoning_about_colored_objects 0 multiple_choice_grade 64.75 ± 1.07
bigbench_ruin_names 0 multiple_choice_grade 48.88 ± 2.36
bigbench_salient_translation_error_detection 0 multiple_choice_grade 25.45 ± 1.38
bigbench_snarks 0 multiple_choice_grade 67.96 ± 3.48
bigbench_sports_understanding 0 multiple_choice_grade 62.17 ± 1.55
bigbench_temporal_sequences 0 multiple_choice_grade 38.00 ± 1.54
bigbench_tracking_shuffled_objects_five_objects 0 multiple_choice_grade 23.12 ± 1.19
bigbench_tracking_shuffled_objects_seven_objects 0 multiple_choice_grade 17.26 ± 0.90
bigbench_tracking_shuffled_objects_three_objects 0 multiple_choice_grade 45.33 ± 2.88

Average: 42.17%

Average score: 53.22%

Elapsed time: 03:15:23

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment