Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save CultriX-Github/2543986851f72ac85f975589bf4a5422 to your computer and use it in GitHub Desktop.
Save CultriX-Github/2543986851f72ac85f975589bf4a5422 to your computer and use it in GitHub Desktop.
Model AGIEval GPT4All TruthfulQA Bigbench Average
una-cybertron-7b-v2-bf16 43.29 74.98 65.32 47.45 57.76

AGIEval

Task Version Metric Value Stderr
agieval_aqua_rat 0 acc 23.23 ± 2.65
acc_norm 22.44 ± 2.62
agieval_logiqa_en 0 acc 36.25 ± 1.89
acc_norm 36.87 ± 1.89
agieval_lsat_ar 0 acc 23.48 ± 2.80
acc_norm 23.91 ± 2.82
agieval_lsat_lr 0 acc 46.27 ± 2.21
acc_norm 46.08 ± 2.21
agieval_lsat_rc 0 acc 58.36 ± 3.01
acc_norm 57.99 ± 3.01
agieval_sat_en 0 acc 77.67 ± 2.91
acc_norm 76.21 ± 2.97
agieval_sat_en_without_passage 0 acc 45.15 ± 3.48
acc_norm 44.66 ± 3.47
agieval_sat_math 0 acc 40.91 ± 3.32
acc_norm 38.18 ± 3.28

Average: 43.29%

GPT4All

Task Version Metric Value Stderr
arc_challenge 0 acc 61.35 ± 1.42
acc_norm 63.23 ± 1.41
arc_easy 0 acc 84.64 ± 0.74
acc_norm 80.93 ± 0.81
boolq 1 acc 86.15 ± 0.60
hellaswag 0 acc 66.75 ± 0.47
acc_norm 83.58 ± 0.37
openbookqa 0 acc 39.00 ± 2.18
acc_norm 48.60 ± 2.24
piqa 0 acc 81.72 ± 0.90
acc_norm 83.46 ± 0.87
winogrande 0 acc 78.93 ± 1.15

Average: 74.98%

TruthfulQA

Task Version Metric Value Stderr
truthfulqa_mc 1 mc1 49.57 ± 1.75
mc2 65.32 ± 1.50

Average: 65.32%

Bigbench

Task Version Metric Value Stderr
bigbench_causal_judgement 0 multiple_choice_grade 60.53 ± 3.56
bigbench_date_understanding 0 multiple_choice_grade 62.06 ± 2.53
bigbench_disambiguation_qa 0 multiple_choice_grade 39.15 ± 3.04
bigbench_geometric_shapes 0 multiple_choice_grade 32.31 ± 2.47
exact_str_match 16.71 ± 1.97
bigbench_logical_deduction_five_objects 0 multiple_choice_grade 34.40 ± 2.13
bigbench_logical_deduction_seven_objects 0 multiple_choice_grade 23.00 ± 1.59
bigbench_logical_deduction_three_objects 0 multiple_choice_grade 54.67 ± 2.88
bigbench_movie_recommendation 0 multiple_choice_grade 40.60 ± 2.20
bigbench_navigate 0 multiple_choice_grade 53.90 ± 1.58
bigbench_reasoning_about_colored_objects 0 multiple_choice_grade 68.10 ± 1.04
bigbench_ruin_names 0 multiple_choice_grade 50.22 ± 2.36
bigbench_salient_translation_error_detection 0 multiple_choice_grade 34.57 ± 1.51
bigbench_snarks 0 multiple_choice_grade 76.24 ± 3.17
bigbench_sports_understanding 0 multiple_choice_grade 73.73 ± 1.40
bigbench_temporal_sequences 0 multiple_choice_grade 57.40 ± 1.56
bigbench_tracking_shuffled_objects_five_objects 0 multiple_choice_grade 21.20 ± 1.16
bigbench_tracking_shuffled_objects_seven_objects 0 multiple_choice_grade 17.37 ± 0.91
bigbench_tracking_shuffled_objects_three_objects 0 multiple_choice_grade 54.67 ± 2.88

Average: 47.45%

Average score: 57.76%

Elapsed time: 02:33:09

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment