CultriX-Github/StableBeluga-7B-Nous.md Secret

## StableBeluga-7B-Nous.md

      
    Raw
  

              StableBeluga-7B-Nous.md
            
          
Model
AGIEval
GPT4All
TruthfulQA
Bigbench
Average


StableBeluga-7B
35.36
69.37
50.09
35.77
47.65


AGIEval


Task
Version
Metric
Value

Stderr


agieval_aqua_rat
0
acc
25.20
±
2.73


acc_norm
25.59
±
2.74


agieval_logiqa_en
0
acc
32.41
±
1.84


acc_norm
32.87
±
1.84


agieval_lsat_ar
0
acc
19.57
±
2.62


acc_norm
17.83
±
2.53


agieval_lsat_lr
0
acc
33.53
±
2.09


acc_norm
36.08
±
2.13


agieval_lsat_rc
0
acc
43.12
±
3.03


acc_norm
43.12
±
3.03


agieval_sat_en
0
acc
64.56
±
3.34


acc_norm
62.62
±
3.38


agieval_sat_en_without_passage
0
acc
39.81
±
3.42


acc_norm
38.83
±
3.40


agieval_sat_math
0
acc
25.45
±
2.94


acc_norm
25.91
±
2.96


Average: 35.36%
GPT4All


Task
Version
Metric
Value

Stderr


arc_challenge
0
acc
49.40
±
1.46


acc_norm
52.30
±
1.46


arc_easy
0
acc
80.22
±
0.82


acc_norm
78.24
±
0.85


boolq
1
acc
82.69
±
0.66


hellaswag
0
acc
58.32
±
0.49


acc_norm
77.13
±
0.42


openbookqa
0
acc
34.80
±
2.13


acc_norm
43.80
±
2.22


piqa
0
acc
78.78
±
0.95


acc_norm
80.03
±
0.93


winogrande
0
acc
71.43
±
1.27


Average: 69.37%
TruthfulQA


Task
Version
Metric
Value

Stderr


truthfulqa_mc
1
mc1
34.76
±
1.67


mc2
50.09
±
1.54


Average: 50.09%
Bigbench


Task
Version
Metric
Value

Stderr


bigbench_causal_judgement
0
multiple_choice_grade
58.95
±
3.58


bigbench_date_understanding
0
multiple_choice_grade
64.77
±
2.49


bigbench_disambiguation_qa
0
multiple_choice_grade
34.88
±
2.97


bigbench_geometric_shapes
0
multiple_choice_grade
0.00
±
0.00


exact_str_match
0.00
±
0.00


bigbench_logical_deduction_five_objects
0
multiple_choice_grade
25.60
±
1.95


bigbench_logical_deduction_seven_objects
0
multiple_choice_grade
17.57
±
1.44


bigbench_logical_deduction_three_objects
0
multiple_choice_grade
42.00
±
2.85


bigbench_movie_recommendation
0
multiple_choice_grade
29.40
±
2.04


bigbench_navigate
0
multiple_choice_grade
50.00
±
1.58


bigbench_reasoning_about_colored_objects
0
multiple_choice_grade
54.60
±
1.11


bigbench_ruin_names
0
multiple_choice_grade
29.91
±
2.17


bigbench_salient_translation_error_detection
0
multiple_choice_grade
29.66
±
1.45


bigbench_snarks
0
multiple_choice_grade
61.88
±
3.62


bigbench_sports_understanding
0
multiple_choice_grade
49.70
±
1.59


bigbench_temporal_sequences
0
multiple_choice_grade
18.20
±
1.22


bigbench_tracking_shuffled_objects_five_objects
0
multiple_choice_grade
19.52
±
1.12


bigbench_tracking_shuffled_objects_seven_objects
0
multiple_choice_grade
15.26
±
0.86


bigbench_tracking_shuffled_objects_three_objects
0
multiple_choice_grade
42.00
±
2.85


Average: 35.77%
Average score: 47.65%
Elapsed time: 01:22:36
Task	Version	Metric	Value		Stderr
agieval_aqua_rat	0	acc	25.20	±	2.73
		acc_norm	25.59	±	2.74
agieval_logiqa_en	0	acc	32.41	±	1.84
		acc_norm	32.87	±	1.84
agieval_lsat_ar	0	acc	19.57	±	2.62
		acc_norm	17.83	±	2.53
agieval_lsat_lr	0	acc	33.53	±	2.09
		acc_norm	36.08	±	2.13
agieval_lsat_rc	0	acc	43.12	±	3.03
		acc_norm	43.12	±	3.03
agieval_sat_en	0	acc	64.56	±	3.34
		acc_norm	62.62	±	3.38
agieval_sat_en_without_passage	0	acc	39.81	±	3.42
		acc_norm	38.83	±	3.40
agieval_sat_math	0	acc	25.45	±	2.94
		acc_norm	25.91	±	2.96
Task	Version	Metric	Value		Stderr
arc_challenge	0	acc	49.40	±	1.46
		acc_norm	52.30	±	1.46
arc_easy	0	acc	80.22	±	0.82
		acc_norm	78.24	±	0.85
boolq	1	acc	82.69	±	0.66
hellaswag	0	acc	58.32	±	0.49
		acc_norm	77.13	±	0.42
openbookqa	0	acc	34.80	±	2.13
		acc_norm	43.80	±	2.22
piqa	0	acc	78.78	±	0.95
		acc_norm	80.03	±	0.93
winogrande	0	acc	71.43	±	1.27
Task	Version	Metric	Value		Stderr
bigbench_causal_judgement	0	multiple_choice_grade	58.95	±	3.58
bigbench_date_understanding	0	multiple_choice_grade	64.77	±	2.49
bigbench_disambiguation_qa	0	multiple_choice_grade	34.88	±	2.97
bigbench_geometric_shapes	0	multiple_choice_grade	0.00	±	0.00
		exact_str_match	0.00	±	0.00
bigbench_logical_deduction_five_objects	0	multiple_choice_grade	25.60	±	1.95
bigbench_logical_deduction_seven_objects	0	multiple_choice_grade	17.57	±	1.44
bigbench_logical_deduction_three_objects	0	multiple_choice_grade	42.00	±	2.85
bigbench_movie_recommendation	0	multiple_choice_grade	29.40	±	2.04
bigbench_navigate	0	multiple_choice_grade	50.00	±	1.58
bigbench_reasoning_about_colored_objects	0	multiple_choice_grade	54.60	±	1.11
bigbench_ruin_names	0	multiple_choice_grade	29.91	±	2.17
bigbench_salient_translation_error_detection	0	multiple_choice_grade	29.66	±	1.45
bigbench_snarks	0	multiple_choice_grade	61.88	±	3.62
bigbench_sports_understanding	0	multiple_choice_grade	49.70	±	1.59
bigbench_temporal_sequences	0	multiple_choice_grade	18.20	±	1.22
bigbench_tracking_shuffled_objects_five_objects	0	multiple_choice_grade	19.52	±	1.12
bigbench_tracking_shuffled_objects_seven_objects	0	multiple_choice_grade	15.26	±	0.86
bigbench_tracking_shuffled_objects_three_objects	0	multiple_choice_grade	42.00	±	2.85