cedrickchee/eval_llama3_coding.md

## eval_llama3_coding.md

      
    Raw
  

              eval_llama3_coding.md
            
          
    Evaluating Llama 3 on Code Tasks


To test Meta Llama 3's performance against existing models, we used the coding benchmarks: HumanEval. HumanEval tests the model's ability to complete code based on docstrings.
The benchmark tests 137 publicly available large language models (LLMs) on code tasks.


Model
Accuracy¹


openai/gpt-4-vision-preview
60


openai/gpt-4-turbo
60


openai/gpt-4-0314
60


openai/gpt-4-32k-0314
59


google/gemini-pro-1.5
59


openai/gpt-4-turbo-preview
58


openai/gpt-4
58


openai/gpt-3.5-turbo-0125
58


openai/gpt-3.5-turbo
58


meta-llama/llama-3-70b-instruct
56


anthropic/claude-3-sonnet:beta
56


openai/gpt-3.5-turbo-0301
54


anthropic/claude-3-sonnet
54


meta-llama/llama-3-70b-instruct:nitro
53


openai/gpt-4-32k
52


anthropic/claude-3-opus:beta
52


anthropic/claude-3-opus
52


phind/phind-codellama-34b
51


nousresearch/nous-capybara-34b
51


openai/gpt-4-1106-preview
50


openai/gpt-3.5-turbo-1106
50


mistralai/mistral-medium
50


microsoft/wizardlm-2-8x22b:nitro
50


microsoftwizardlm-2-8x22b
50


meta-llama/llama-2-70b-chat:nitro
50


google/palm-2-chat-bison
50


cohere/command-r-plus
50


anthropic/claude-3-haiku
50


anthropic/claude-2.1:beta
50


anthropic/claude-2.1
50


meta-llama/llama-2-7b-chat
50


Ol-ai/yi-34b-chat
49


perplexity/sonar-medium-chat
49


perplexity/pplx-70b-chat
48


mistralai/mistral-7b-instruct:nitro
48


google/gemma-7b-it:free
48


anthropic/claude-2.0
48


mistralai/mixtral-8x7b-instruct
47


anthropic/claude-instant-1.1
47


saolOk/fimbulvetr-llb-v2
46


openchat/openchat-7b
46


openai/gpt-3.5-turbo-16k
46


nousresearch/nous-hermes-mistral
46


mistralai/mistral-7b-instruct:free
46


mistralai/mistral-7b-instruct
46


cohere/command-r
46


teknium/openhermes-2.5-mistral-7b
45


teknium/openhermes-2-mistral-7b
45


perplexity/pplx-7b-chat
45


mistralai/mixtral-8x7b-instruct:nitro
45


google/gemma-7b-it
45


meta-llama/codellama-34b-instruct
44


google/palm-2-codechat-bison-32k
44


google/palm-2-codechat-bison
44


google/gemini-pro-vision
44


cognitivecomputations/dolphin-mixtral
44


perplexity/sonar-small-chat
43


nousresearch/nous-hermes-yi-34b
43


nousresearch/nous-hermes-2-mixtral
43


lizpreciatior/lzlv-70b-fpl6-hf
43


jondurbin/airoboros-l2-7b
43


google/gemini-pro
43


anthropic/claude-3-haiku:beta
43


anthropic/claude-2.0:beta
43


sophosympatheia/midnight-rose-70b
42


openai/gpt-3.5-turbo-0613
42


mistralai/mixtral-8x22b-instruct
42


mistralai/mistral-small
42


meta-llama/llama-2-13b-chat
42


google/gemma-7b-it:nitro
42


anthropic/claude-instant-1.2
42


anthropic/claude-1.2
42


togethercomputer/stripedhyena-no
41


databricks/dbrx-instruct
41


rwkv/rwkv-5-world-3b
40


openrouter/cinematika-7b
40


nousresearch/nous-hermes-2-mixtral
40


nousresearch/nous-capybara-7b
40


mistralai/mistral-large
40


huggingfaceh4/zephyr-orpo-14lb-a
40


google/palm-2-chat-bison-32k
40


meta-llama/llama-3-8b-instruct
39


codellama/codellama-70b-instruct
39


xwinlm/xwin-lm-70b
38


perplexity/sonar-medium-online
38


meta-llama/llama-8b-instruct:nitro
37


anthropic/claude-1
36


perplexity/pplx-7b-online
35


openrouter/cinematika-7b:free
35


gryphe/mythomax-l2-13b:nitro
35


gryphe/mythomax-l2-13b
35


recursal/eagle-7b
34


perplexity/sonar-small-online
34


huggingfaceh4/zephyr-7b-beta
34


01-ai/yi-6b
34


perplexity/pplx-70b-online
33


open-orca/mistral-7b-openorca
33


nousresearch/nous-hermes-llama2
33


mistralai/mixtral-8x22b
33


gryphe/mythomax-l2-13b:extended
33


alpindale/goliath-120b
33


mistralai/mistral-tiny
32


microsoft/wizardlm-2-7b
32


cohere/command
32


austism/chronos-hermes-13b
32


undi95/toppy-m-7b:free
31


undi95/toppy-m-7b
31


openchat/openchat-7b:free
31


pygmalionai/mythalion-13b
30


nousresearch/nous-capybara-7b:free
30


huggingfaceh4/zephyr-7b-beta:free
30


undi95/toppy-m-7b:nitro
29


undi95/remm-slerp-l2-13b
29


mistralai/mixtral-8x7b
29


anthropic/claude-instant-1.0
28


recursal/rwkv-5-3b-ai-town
27


undi95/remm-slerp-l2-13b:extended
26


koboldai/psyfighter-13b-2
26


01-ai/yi-34b
26


neversleep/noromaid-mixtral-8x7b-instruct
25


togethercomputer/stripedhyena-hessian-7b
24


openai/gpt-3.5-turbo-instruct
24


neversleep/noromaid-20b
24


gryphe/mythomist-7b
22


meta-llama/llama-8b-instructextension
20


mancer/weaver
20


intel/neural-chat-7b
20


gryphe/mythomist-7b:free
20


fireworks/firellava-13b
16


lynn/soliloquy-13
14


nousresearch/nous-hermes-2-vision
0


jondurbin/bagel-34b
0


jebcarter/psyfighter-13b
0


haotian-liu/llava-13b
0


anthropic/claude-instant-l:beta
0


anthropic/claude-instant-1
0


anthropic/claude-2:beta
0


anthropic/claude-2
0


The benchmark showed that Llama-3-70b-Instruct performed better than open-source, code-specific LLMs (Phind-CodeLlama-34b, CodeLlama-70b-Instruct) and outperformed Claude-3-Opus.

Footnotes


HumanEval (pass@1) accuracy, higher is better ↩
Model	Accuracy¹
openai/gpt-4-vision-preview	60
openai/gpt-4-turbo	60
openai/gpt-4-0314	60
openai/gpt-4-32k-0314	59
google/gemini-pro-1.5	59
openai/gpt-4-turbo-preview	58
openai/gpt-4	58
openai/gpt-3.5-turbo-0125	58
openai/gpt-3.5-turbo	58
meta-llama/llama-3-70b-instruct	56
anthropic/claude-3-sonnet:beta	56
openai/gpt-3.5-turbo-0301	54
anthropic/claude-3-sonnet	54
meta-llama/llama-3-70b-instruct:nitro	53
openai/gpt-4-32k	52
anthropic/claude-3-opus:beta	52
anthropic/claude-3-opus	52
phind/phind-codellama-34b	51
nousresearch/nous-capybara-34b	51
openai/gpt-4-1106-preview	50
openai/gpt-3.5-turbo-1106	50
mistralai/mistral-medium	50
microsoft/wizardlm-2-8x22b:nitro	50
microsoftwizardlm-2-8x22b	50
meta-llama/llama-2-70b-chat:nitro	50
google/palm-2-chat-bison	50
cohere/command-r-plus	50
anthropic/claude-3-haiku	50
anthropic/claude-2.1:beta	50
anthropic/claude-2.1	50
meta-llama/llama-2-7b-chat	50
Ol-ai/yi-34b-chat	49
perplexity/sonar-medium-chat	49
perplexity/pplx-70b-chat	48
mistralai/mistral-7b-instruct:nitro	48
google/gemma-7b-it:free	48
anthropic/claude-2.0	48
mistralai/mixtral-8x7b-instruct	47
anthropic/claude-instant-1.1	47
saolOk/fimbulvetr-llb-v2	46
openchat/openchat-7b	46
openai/gpt-3.5-turbo-16k	46
nousresearch/nous-hermes-mistral	46
mistralai/mistral-7b-instruct:free	46
mistralai/mistral-7b-instruct	46
cohere/command-r	46
teknium/openhermes-2.5-mistral-7b	45
teknium/openhermes-2-mistral-7b	45
perplexity/pplx-7b-chat	45
mistralai/mixtral-8x7b-instruct:nitro	45
google/gemma-7b-it	45
meta-llama/codellama-34b-instruct	44
google/palm-2-codechat-bison-32k	44
google/palm-2-codechat-bison	44
google/gemini-pro-vision	44
cognitivecomputations/dolphin-mixtral	44
perplexity/sonar-small-chat	43
nousresearch/nous-hermes-yi-34b	43
nousresearch/nous-hermes-2-mixtral	43
lizpreciatior/lzlv-70b-fpl6-hf	43
jondurbin/airoboros-l2-7b	43
google/gemini-pro	43
anthropic/claude-3-haiku:beta	43
anthropic/claude-2.0:beta	43
sophosympatheia/midnight-rose-70b	42
openai/gpt-3.5-turbo-0613	42
mistralai/mixtral-8x22b-instruct	42
mistralai/mistral-small	42
meta-llama/llama-2-13b-chat	42
google/gemma-7b-it:nitro	42
anthropic/claude-instant-1.2	42
anthropic/claude-1.2	42
togethercomputer/stripedhyena-no	41
databricks/dbrx-instruct	41
rwkv/rwkv-5-world-3b	40
openrouter/cinematika-7b	40
nousresearch/nous-hermes-2-mixtral	40
nousresearch/nous-capybara-7b	40
mistralai/mistral-large	40
huggingfaceh4/zephyr-orpo-14lb-a	40
google/palm-2-chat-bison-32k	40
meta-llama/llama-3-8b-instruct	39
codellama/codellama-70b-instruct	39
xwinlm/xwin-lm-70b	38
perplexity/sonar-medium-online	38
meta-llama/llama-8b-instruct:nitro	37
anthropic/claude-1	36
perplexity/pplx-7b-online	35
openrouter/cinematika-7b:free	35
gryphe/mythomax-l2-13b:nitro	35
gryphe/mythomax-l2-13b	35
recursal/eagle-7b	34
perplexity/sonar-small-online	34
huggingfaceh4/zephyr-7b-beta	34
01-ai/yi-6b	34
perplexity/pplx-70b-online	33
open-orca/mistral-7b-openorca	33
nousresearch/nous-hermes-llama2	33
mistralai/mixtral-8x22b	33
gryphe/mythomax-l2-13b:extended	33
alpindale/goliath-120b	33
mistralai/mistral-tiny	32
microsoft/wizardlm-2-7b	32
cohere/command	32
austism/chronos-hermes-13b	32
undi95/toppy-m-7b:free	31
undi95/toppy-m-7b	31
openchat/openchat-7b:free	31
pygmalionai/mythalion-13b	30
nousresearch/nous-capybara-7b:free	30
huggingfaceh4/zephyr-7b-beta:free	30
undi95/toppy-m-7b:nitro	29
undi95/remm-slerp-l2-13b	29
mistralai/mixtral-8x7b	29
anthropic/claude-instant-1.0	28
recursal/rwkv-5-3b-ai-town	27
undi95/remm-slerp-l2-13b:extended	26
koboldai/psyfighter-13b-2	26
01-ai/yi-34b	26
neversleep/noromaid-mixtral-8x7b-instruct	25
togethercomputer/stripedhyena-hessian-7b	24
openai/gpt-3.5-turbo-instruct	24
neversleep/noromaid-20b	24
gryphe/mythomist-7b	22
meta-llama/llama-8b-instructextension	20
mancer/weaver	20
intel/neural-chat-7b	20
gryphe/mythomist-7b:free	20
fireworks/firellava-13b	16
lynn/soliloquy-13	14
nousresearch/nous-hermes-2-vision	0
jondurbin/bagel-34b	0
jebcarter/psyfighter-13b	0
haotian-liu/llava-13b	0
anthropic/claude-instant-l:beta	0
anthropic/claude-instant-1	0
anthropic/claude-2:beta	0
anthropic/claude-2	0