Model | AGIEval | GPT4All | TruthfulQA | Bigbench | Average |
---|---|---|---|---|---|
stablelm-zephyr-3b | 34.04 | 62.07 | 46.46 | 35.11 | 44.42 |
Task | Version | Metric | Value | Stderr | |
---|---|---|---|---|---|
agieval_aqua_rat | 0 | acc | 21.26 | ± | 2.57 |
acc_norm | 19.29 | ± | 2.48 | ||
agieval_logiqa_en | 0 | acc | 29.65 | ± | 1.79 |
acc_norm | 31.95 | ± | 1.83 | ||
agieval_lsat_ar | 0 | acc | 20.43 | ± | 2.66 |
acc_norm | 20.87 | ± | 2.69 | ||
agieval_lsat_lr | 0 | acc | 32.16 | ± | 2.07 |
acc_norm | 32.94 | ± | 2.08 | ||
agieval_lsat_rc | 0 | acc | 43.12 | ± | 3.03 |
acc_norm | 42.01 | ± | 3.01 | ||
agieval_sat_en | 0 | acc | 62.14 | ± | 3.39 |
acc_norm | 61.65 | ± | 3.40 | ||
agieval_sat_en_without_passage | 0 | acc | 33.98 | ± | 3.31 |
acc_norm | 34.95 | ± | 3.33 | ||
agieval_sat_math | 0 | acc | 32.27 | ± | 3.16 |
acc_norm | 28.64 | ± | 3.05 |
Average: 34.04%
Task | Version | Metric | Value | Stderr | |
---|---|---|---|---|---|
arc_challenge | 0 | acc | 40.36 | ± | 1.43 |
acc_norm | 42.24 | ± | 1.44 | ||
arc_easy | 0 | acc | 68.60 | ± | 0.95 |
acc_norm | 61.32 | ± | 1.00 | ||
boolq | 1 | acc | 82.32 | ± | 0.67 |
hellaswag | 0 | acc | 54.83 | ± | 0.50 |
acc_norm | 71.13 | ± | 0.45 | ||
openbookqa | 0 | acc | 28.80 | ± | 2.03 |
acc_norm | 36.60 | ± | 2.16 | ||
piqa | 0 | acc | 75.79 | ± | 1.00 |
acc_norm | 76.61 | ± | 0.99 | ||
winogrande | 0 | acc | 64.25 | ± | 1.35 |
Average: 62.07%
Task | Version | Metric | Value | Stderr | |
---|---|---|---|---|---|
truthfulqa_mc | 1 | mc1 | 31.21 | ± | 1.62 |
mc2 | 46.46 | ± | 1.63 |
Average: 46.46%
Task | Version | Metric | Value | Stderr | |
---|---|---|---|---|---|
bigbench_causal_judgement | 0 | multiple_choice_grade | 57.89 | ± | 3.59 |
bigbench_date_understanding | 0 | multiple_choice_grade | 43.36 | ± | 2.58 |
bigbench_disambiguation_qa | 0 | multiple_choice_grade | 32.17 | ± | 2.91 |
bigbench_geometric_shapes | 0 | multiple_choice_grade | 16.71 | ± | 1.97 |
exact_str_match | 0.00 | ± | 0.00 | ||
bigbench_logical_deduction_five_objects | 0 | multiple_choice_grade | 24.80 | ± | 1.93 |
bigbench_logical_deduction_seven_objects | 0 | multiple_choice_grade | 19.00 | ± | 1.48 |
bigbench_logical_deduction_three_objects | 0 | multiple_choice_grade | 40.00 | ± | 2.83 |
bigbench_movie_recommendation | 0 | multiple_choice_grade | 26.20 | ± | 1.97 |
bigbench_navigate | 0 | multiple_choice_grade | 60.20 | ± | 1.55 |
bigbench_reasoning_about_colored_objects | 0 | multiple_choice_grade | 43.75 | ± | 1.11 |
bigbench_ruin_names | 0 | multiple_choice_grade | 39.96 | ± | 2.32 |
bigbench_salient_translation_error_detection | 0 | multiple_choice_grade | 21.74 | ± | 1.31 |
bigbench_snarks | 0 | multiple_choice_grade | 63.54 | ± | 3.59 |
bigbench_sports_understanding | 0 | multiple_choice_grade | 51.01 | ± | 1.59 |
bigbench_temporal_sequences | 0 | multiple_choice_grade | 18.50 | ± | 1.23 |
bigbench_tracking_shuffled_objects_five_objects | 0 | multiple_choice_grade | 19.76 | ± | 1.13 |
bigbench_tracking_shuffled_objects_seven_objects | 0 | multiple_choice_grade | 13.43 | ± | 0.82 |
bigbench_tracking_shuffled_objects_three_objects | 0 | multiple_choice_grade | 40.00 | ± | 2.83 |
Average: 35.11%
Average score: 44.42% Elapsed time: 02:21:36