Skip to content

Instantly share code, notes, and snippets.

View gblazex's full-sized avatar

Blaze (Balázs Galambosi) gblazex

View GitHub Profile
This file has been truncated, but you can view the full file.
[
{
"dataset":"helpful_base",
"instruction":"What are the names of some famous actors that started their careers on Broadway?",
"output":"1. Meryl Streep\n2. Angela Lansbury\n3. Audra McDonald\n4. Bernadette Peters\n5. Idina Menzel\n6. Patti LuPone\n7. Hugh Jackman\n8. James Earl Jones\n9. Liza Minnelli\n10. Nathan Lane",
"generator":"dolphin-2.2.1-mistral-7b"
},
{
"dataset":"helpful_base",
"instruction":"How did US states get their names?",
Model AGIEval GPT4All TruthfulQA Bigbench Average
neuronovo-7B-v0.2 44.95 76.49 71.57 47.48 60.12

AGIEval

Task Version Metric Value Stderr
agieval_aqua_rat 0 acc 25.98 ± 2.76
acc_norm 25.59 ± 2.74
agieval_logiqa_en 0 acc 37.48 ± 1.90
Model AGIEval GPT4All TruthfulQA Bigbench Average
distilabeled-Marcoro14-7B-slerp 45.38 76.48 65.68 48.18 58.93

AGIEval

Task Version Metric Value Stderr
agieval_aqua_rat 0 acc 27.56 ± 2.81
acc_norm 25.98 ± 2.76
agieval_logiqa_en 0 acc 39.17 ± 1.91
Model AGIEval GPT4All TruthfulQA Bigbench Average
openchat-3.5-1210 42.62 72.84 53.21 43.88 53.14

AGIEval

Task Version Metric Value Stderr
agieval_aqua_rat 0 acc 22.44 ± 2.62
acc_norm 24.41 ± 2.70
agieval_logiqa_en 0 acc 41.17 ± 1.93
Model AGIEval GPT4All TruthfulQA Bigbench Average
MistralTrix-v1 44.98 76.62 71.44 47.17 60.05

AGIEval

Task Version Metric Value Stderr
agieval_aqua_rat 0 acc 25.59 ± 2.74
acc_norm 24.80 ± 2.72
agieval_logiqa_en 0 acc 37.48 ± 1.90
Model AGIEval GPT4All TruthfulQA Bigbench Average
Mistral-7B-Instruct-v0.2 38.5 71.64 66.82 42.29 54.81

AGIEval

Task Version Metric Value Stderr
agieval_aqua_rat 0 acc 23.62 ± 2.67
acc_norm 22.05 ± 2.61
agieval_logiqa_en 0 acc 36.10 ± 1.88
Model AGIEval GPT4All TruthfulQA Bigbench Average
dolphin-2.2.1-mistral-7b 38.64 72.24 54.09 39.22 51.05

AGIEval

Task Version Metric Value Stderr
agieval_aqua_rat 0 acc 23.23 ± 2.65
acc_norm 21.26 ± 2.57
agieval_logiqa_en 0 acc 35.48 ± 1.88
2024-01-09T14:51:49.894270414Z return fn(*args, **kwargs)
2024-01-09T14:51:49.894273580Z File "/lm-evaluation-harness/lm_eval/evaluator.py", line 69, in simple_evaluate
2024-01-09T14:51:49.894279732Z lm = lm_eval.models.get_model(model).create_from_arg_string(
2024-01-09T14:51:49.894283779Z File "/lm-evaluation-harness/lm_eval/base.py", line 115, in create_from_arg_string
2024-01-09T14:51:49.894316350Z return cls(**args, **args2)
2024-01-09T14:51:49.894323294Z File "/lm-evaluation-harness/lm_eval/models/gpt2.py", line 67, in __init__
2024-01-09T14:51:49.894355253Z self.tokenizer = transformers.AutoTokenizer.from_pretrained(
2024-01-09T14:51:49.894361435Z File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/tokenization_auto.py", line 787, in from_pretrained
2024-01-09T14:51:49.894470349Z return tokenizer_class.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
2024-01-09T14:51:49.894475349Z File "/usr/local/lib/python3.10/dist-packages/transformer
Model AGIEval GPT4All TruthfulQA Bigbench Average
zephyr-7b-alpha 38 72.24 56.06 40.57 51.72

AGIEval

Task Version Metric Value Stderr
agieval_aqua_rat 0 acc 20.47 ± 2.54
acc_norm 19.69 ± 2.50
agieval_logiqa_en 0 acc 31.49 ± 1.82
Model AGIEval GPT4All TruthfulQA Bigbench Average
zephyr-7b-beta 37.33 71.83 55.1 39.7 50.99

AGIEval

Task Version Metric Value Stderr
agieval_aqua_rat 0 acc 21.26 ± 2.57
acc_norm 20.47 ± 2.54
agieval_logiqa_en 0 acc 33.33 ± 1.85