mlabonne/SauerkrautLM-Gemma-7b-Nous.md Secret

## SauerkrautLM-Gemma-7b-Nous.md

      
    Raw
  

              SauerkrautLM-Gemma-7b-Nous.md
            
          
Model
AGIEval
GPT4All
TruthfulQA
Bigbench
Average


SauerkrautLM-Gemma-7b
20.75
39.29
46.2
28.88
33.78


AGIEval


Task
Version
Metric
Value

Stderr


agieval_aqua_rat
0
acc
13.39
±
2.14


acc_norm
14.96
±
2.24


agieval_logiqa_en
0
acc
23.20
±
1.66


acc_norm
27.96
±
1.76


agieval_lsat_ar
0
acc
15.65
±
2.40


acc_norm
16.52
±
2.45


agieval_lsat_lr
0
acc
14.71
±
1.57


acc_norm
20.20
±
1.78


agieval_lsat_rc
0
acc
19.70
±
2.43


acc_norm
21.19
±
2.50


agieval_sat_en
0
acc
23.30
±
2.95


acc_norm
18.93
±
2.74


agieval_sat_en_without_passage
0
acc
21.36
±
2.86


acc_norm
19.90
±
2.79


agieval_sat_math
0
acc
28.18
±
3.04


acc_norm
26.36
±
2.98


Average: 20.75%
GPT4All


Task
Version
Metric
Value

Stderr


arc_challenge
0
acc
20.56
±
1.18


acc_norm
24.06
±
1.25


arc_easy
0
acc
26.64
±
0.91


acc_norm
29.08
±
0.93


boolq
1
acc
61.96
±
0.85


hellaswag
0
acc
26.16
±
0.44


acc_norm
27.88
±
0.45


openbookqa
0
acc
15.20
±
1.61


acc_norm
28.20
±
2.01


piqa
0
acc
55.06
±
1.16


acc_norm
52.77
±
1.16


winogrande
0
acc
51.07
±
1.40


Average: 39.29%
TruthfulQA


Task
Version
Metric
Value

Stderr


truthfulqa_mc
1
mc1
22.28
±
1.46


mc2
46.20
±
1.67


Average: 46.2%
Bigbench


Task
Version
Metric
Value

Stderr


bigbench_causal_judgement
0
multiple_choice_grade
52.11
±
3.63


bigbench_date_understanding
0
multiple_choice_grade
18.16
±
2.01


bigbench_disambiguation_qa
0
multiple_choice_grade
33.72
±
2.95


bigbench_geometric_shapes
0
multiple_choice_grade
4.46
±
1.09


exact_str_match
0.00
±
0.00


bigbench_logical_deduction_five_objects
0
multiple_choice_grade
21.40
±
1.84


bigbench_logical_deduction_seven_objects
0
multiple_choice_grade
15.00
±
1.35


bigbench_logical_deduction_three_objects
0
multiple_choice_grade
37.33
±
2.80


bigbench_movie_recommendation
0
multiple_choice_grade
27.20
±
1.99


bigbench_navigate
0
multiple_choice_grade
50.80
±
1.58


bigbench_reasoning_about_colored_objects
0
multiple_choice_grade
19.05
±
0.88


bigbench_ruin_names
0
multiple_choice_grade
34.38
±
2.25


bigbench_salient_translation_error_detection
0
multiple_choice_grade
16.83
±
1.18


bigbench_snarks
0
multiple_choice_grade
42.54
±
3.69


bigbench_sports_understanding
0
multiple_choice_grade
49.70
±
1.59


bigbench_temporal_sequences
0
multiple_choice_grade
25.00
±
1.37


bigbench_tracking_shuffled_objects_five_objects
0
multiple_choice_grade
20.08
±
1.13


bigbench_tracking_shuffled_objects_seven_objects
0
multiple_choice_grade
14.69
±
0.85


bigbench_tracking_shuffled_objects_three_objects
0
multiple_choice_grade
37.33
±
2.80


Average: 28.88%
Average score: 33.78%
Elapsed time: 04:34:07
Task	Version	Metric	Value		Stderr
agieval_aqua_rat	0	acc	13.39	±	2.14
		acc_norm	14.96	±	2.24
agieval_logiqa_en	0	acc	23.20	±	1.66
		acc_norm	27.96	±	1.76
agieval_lsat_ar	0	acc	15.65	±	2.40
		acc_norm	16.52	±	2.45
agieval_lsat_lr	0	acc	14.71	±	1.57
		acc_norm	20.20	±	1.78
agieval_lsat_rc	0	acc	19.70	±	2.43
		acc_norm	21.19	±	2.50
agieval_sat_en	0	acc	23.30	±	2.95
		acc_norm	18.93	±	2.74
agieval_sat_en_without_passage	0	acc	21.36	±	2.86
		acc_norm	19.90	±	2.79
agieval_sat_math	0	acc	28.18	±	3.04
		acc_norm	26.36	±	2.98
Task	Version	Metric	Value		Stderr
arc_challenge	0	acc	20.56	±	1.18
		acc_norm	24.06	±	1.25
arc_easy	0	acc	26.64	±	0.91
		acc_norm	29.08	±	0.93
boolq	1	acc	61.96	±	0.85
hellaswag	0	acc	26.16	±	0.44
		acc_norm	27.88	±	0.45
openbookqa	0	acc	15.20	±	1.61
		acc_norm	28.20	±	2.01
piqa	0	acc	55.06	±	1.16
		acc_norm	52.77	±	1.16
winogrande	0	acc	51.07	±	1.40
Task	Version	Metric	Value		Stderr
bigbench_causal_judgement	0	multiple_choice_grade	52.11	±	3.63
bigbench_date_understanding	0	multiple_choice_grade	18.16	±	2.01
bigbench_disambiguation_qa	0	multiple_choice_grade	33.72	±	2.95
bigbench_geometric_shapes	0	multiple_choice_grade	4.46	±	1.09
		exact_str_match	0.00	±	0.00
bigbench_logical_deduction_five_objects	0	multiple_choice_grade	21.40	±	1.84
bigbench_logical_deduction_seven_objects	0	multiple_choice_grade	15.00	±	1.35
bigbench_logical_deduction_three_objects	0	multiple_choice_grade	37.33	±	2.80
bigbench_movie_recommendation	0	multiple_choice_grade	27.20	±	1.99
bigbench_navigate	0	multiple_choice_grade	50.80	±	1.58
bigbench_reasoning_about_colored_objects	0	multiple_choice_grade	19.05	±	0.88
bigbench_ruin_names	0	multiple_choice_grade	34.38	±	2.25
bigbench_salient_translation_error_detection	0	multiple_choice_grade	16.83	±	1.18
bigbench_snarks	0	multiple_choice_grade	42.54	±	3.69
bigbench_sports_understanding	0	multiple_choice_grade	49.70	±	1.59
bigbench_temporal_sequences	0	multiple_choice_grade	25.00	±	1.37
bigbench_tracking_shuffled_objects_five_objects	0	multiple_choice_grade	20.08	±	1.13
bigbench_tracking_shuffled_objects_seven_objects	0	multiple_choice_grade	14.69	±	0.85
bigbench_tracking_shuffled_objects_three_objects	0	multiple_choice_grade	37.33	±	2.80