mlabonne/Einstein-v4-7B-Nous.md

## Einstein-v4-7B-Nous.md

      
    Raw
  

              Einstein-v4-7B-Nous.md
            
          
Model
AGIEval
GPT4All
TruthfulQA
Bigbench
Average


Einstein-v4-7B
37.83
67.52
55.56
38.78
49.92


AGIEval


Task
Version
Metric
Value

Stderr


agieval_aqua_rat
0
acc
23.62
±
2.67


acc_norm
22.83
±
2.64


agieval_logiqa_en
0
acc
37.33
±
1.90


acc_norm
37.79
±
1.90


agieval_lsat_ar
0
acc
22.17
±
2.75


acc_norm
20.00
±
2.64


agieval_lsat_lr
0
acc
42.35
±
2.19


acc_norm
41.37
±
2.18


agieval_lsat_rc
0
acc
55.76
±
3.03


acc_norm
49.44
±
3.05


agieval_sat_en
0
acc
66.02
±
3.31


acc_norm
66.99
±
3.28


agieval_sat_en_without_passage
0
acc
39.81
±
3.42


acc_norm
37.86
±
3.39


agieval_sat_math
0
acc
30.91
±
3.12


acc_norm
26.36
±
2.98


Average: 37.83%
GPT4All


Task
Version
Metric
Value

Stderr


arc_challenge
0
acc
51.79
±
1.46


acc_norm
54.18
±
1.46


arc_easy
0
acc
78.87
±
0.84


acc_norm
75.42
±
0.88


boolq
1
acc
84.28
±
0.64


hellaswag
0
acc
58.36
±
0.49


acc_norm
75.89
±
0.43


openbookqa
0
acc
26.20
±
1.97


acc_norm
37.20
±
2.16


piqa
0
acc
77.75
±
0.97


acc_norm
78.84
±
0.95


winogrande
0
acc
66.85
±
1.32


Average: 67.52%
TruthfulQA


Task
Version
Metric
Value

Stderr


truthfulqa_mc
1
mc1
37.70
±
1.70


mc2
55.56
±
1.54


Average: 55.56%
Bigbench


Task
Version
Metric
Value

Stderr


bigbench_causal_judgement
0
multiple_choice_grade
58.42
±
3.59


bigbench_date_understanding
0
multiple_choice_grade
60.98
±
2.54


bigbench_disambiguation_qa
0
multiple_choice_grade
37.60
±
3.02


bigbench_geometric_shapes
0
multiple_choice_grade
15.32
±
1.90


exact_str_match
0.00
±
0.00


bigbench_logical_deduction_five_objects
0
multiple_choice_grade
29.20
±
2.04


bigbench_logical_deduction_seven_objects
0
multiple_choice_grade
20.29
±
1.52


bigbench_logical_deduction_three_objects
0
multiple_choice_grade
47.33
±
2.89


bigbench_movie_recommendation
0
multiple_choice_grade
32.20
±
2.09


bigbench_navigate
0
multiple_choice_grade
50.00
±
1.58


bigbench_reasoning_about_colored_objects
0
multiple_choice_grade
56.20
±
1.11


bigbench_ruin_names
0
multiple_choice_grade
32.14
±
2.21


bigbench_salient_translation_error_detection
0
multiple_choice_grade
13.93
±
1.10


bigbench_snarks
0
multiple_choice_grade
56.35
±
3.70


bigbench_sports_understanding
0
multiple_choice_grade
66.63
±
1.50


bigbench_temporal_sequences
0
multiple_choice_grade
36.50
±
1.52


bigbench_tracking_shuffled_objects_five_objects
0
multiple_choice_grade
21.76
±
1.17


bigbench_tracking_shuffled_objects_seven_objects
0
multiple_choice_grade
15.77
±
0.87


bigbench_tracking_shuffled_objects_three_objects
0
multiple_choice_grade
47.33
±
2.89


Average: 38.78%
Average score: 49.92%
Elapsed time: 02:26:41
Task	Version	Metric	Value		Stderr
agieval_aqua_rat	0	acc	23.62	±	2.67
		acc_norm	22.83	±	2.64
agieval_logiqa_en	0	acc	37.33	±	1.90
		acc_norm	37.79	±	1.90
agieval_lsat_ar	0	acc	22.17	±	2.75
		acc_norm	20.00	±	2.64
agieval_lsat_lr	0	acc	42.35	±	2.19
		acc_norm	41.37	±	2.18
agieval_lsat_rc	0	acc	55.76	±	3.03
		acc_norm	49.44	±	3.05
agieval_sat_en	0	acc	66.02	±	3.31
		acc_norm	66.99	±	3.28
agieval_sat_en_without_passage	0	acc	39.81	±	3.42
		acc_norm	37.86	±	3.39
agieval_sat_math	0	acc	30.91	±	3.12
		acc_norm	26.36	±	2.98
Task	Version	Metric	Value		Stderr
arc_challenge	0	acc	51.79	±	1.46
		acc_norm	54.18	±	1.46
arc_easy	0	acc	78.87	±	0.84
		acc_norm	75.42	±	0.88
boolq	1	acc	84.28	±	0.64
hellaswag	0	acc	58.36	±	0.49
		acc_norm	75.89	±	0.43
openbookqa	0	acc	26.20	±	1.97
		acc_norm	37.20	±	2.16
piqa	0	acc	77.75	±	0.97
		acc_norm	78.84	±	0.95
winogrande	0	acc	66.85	±	1.32
Task	Version	Metric	Value		Stderr
bigbench_causal_judgement	0	multiple_choice_grade	58.42	±	3.59
bigbench_date_understanding	0	multiple_choice_grade	60.98	±	2.54
bigbench_disambiguation_qa	0	multiple_choice_grade	37.60	±	3.02
bigbench_geometric_shapes	0	multiple_choice_grade	15.32	±	1.90
		exact_str_match	0.00	±	0.00
bigbench_logical_deduction_five_objects	0	multiple_choice_grade	29.20	±	2.04
bigbench_logical_deduction_seven_objects	0	multiple_choice_grade	20.29	±	1.52
bigbench_logical_deduction_three_objects	0	multiple_choice_grade	47.33	±	2.89
bigbench_movie_recommendation	0	multiple_choice_grade	32.20	±	2.09
bigbench_navigate	0	multiple_choice_grade	50.00	±	1.58
bigbench_reasoning_about_colored_objects	0	multiple_choice_grade	56.20	±	1.11
bigbench_ruin_names	0	multiple_choice_grade	32.14	±	2.21
bigbench_salient_translation_error_detection	0	multiple_choice_grade	13.93	±	1.10
bigbench_snarks	0	multiple_choice_grade	56.35	±	3.70
bigbench_sports_understanding	0	multiple_choice_grade	66.63	±	1.50
bigbench_temporal_sequences	0	multiple_choice_grade	36.50	±	1.52
bigbench_tracking_shuffled_objects_five_objects	0	multiple_choice_grade	21.76	±	1.17
bigbench_tracking_shuffled_objects_seven_objects	0	multiple_choice_grade	15.77	±	0.87
bigbench_tracking_shuffled_objects_three_objects	0	multiple_choice_grade	47.33	±	2.89