Skip to content

Instantly share code, notes, and snippets.

Last active September 21, 2023 18:36
Show Gist options
  • Save liaocs2008/8ea11fde304922af9ed9d691e4207501 to your computer and use it in GitHub Desktop.
Save liaocs2008/8ea11fde304922af9ed9d691e4207501 to your computer and use it in GitHub Desktop.
Record the inference speed of some LLMs

LLM Speed on V100

It is glad to see some LLM speed reports online such as CPU and GPU. To give a more comprehensive investigation, this document records some LLM inference measurements on V100 16GB using text-generation-webui.

Test Setup

We test following prompts:

  1. from dataset Sqaud

How many student news papers are found at Notre Dame?

  1. from dataset MMLU

In which one of the following circumstances will the prevalence of a disease in the population increase, all else being constant? A. If the incidence rate of the disease falls. B. If survival time with the disease increases. C. If recovery of the disease is faster. D. If the population in which the disease is measured increases.

  1. from dataset CNN dailymail

summarize "LONDON, England (Reuters) -- Harry Potter star Daniel Radcliffe gains access to a reported £20 million ($41.1 million) fortune as he turns 18 on Monday, but he insists the money won't cast a spell on him. Daniel Radcliffe as Harry Potter in "Harry Potter and the Order of the Phoenix" To the disappointment of gossip columnists around the world, the young actor says he has no plans to fritter his cash away on fast cars, drink and celebrity parties. "I don't plan to be one of those people who, as soon as they turn 18, suddenly buy themselves a massive sports car collection or something similar," he told an Australian interviewer earlier this month. "I don't think I'll be particularly extravagant. "The things I like buying are things that cost about 10 pounds -- books and CDs and DVDs." At 18, Radcliffe will be able to gamble in a casino, buy a drink in a pub or see the horror film "Hostel: Part II," currently six places below his number one movie on the UK box office chart. Details of how he'll mark his landmark birthday are under wraps. His agent and publicist had no comment on his plans. "I'll definitely have some sort of party," he said in an interview. "Hopefully none of you will be reading about it." Radcliffe's earnings from the first five Potter films have been held in a trust fund which he has not been able to touch. Despite his growing fame and riches, the actor says he is keeping his feet firmly on the ground. "People are always looking to say 'kid star goes off the rails,'" he told reporters last month. "But I try very hard not to go that way because it would be too easy for them." His latest outing as the boy wizard in "Harry Potter and the Order of the Phoenix" is breaking records on both sides of the Atlantic and he will reprise the role in the last two films. Watch I-Reporter give her review of Potter's latest » . There is life beyond Potter, however. The Londoner has filmed a TV movie called "My Boy Jack," about author Rudyard Kipling and his son, due for release later this year. He will also appear in "December Boys," an Australian film about four boys who escape an orphanage. Earlier this year, he made his stage debut playing a tortured teenager in Peter Shaffer's "Equus." Meanwhile, he is braced for even closer media scrutiny now that he's legally an adult: "I just think I'm going to be more sort of fair game," he told Reuters. E-mail to a friend . Copyright 2007 Reuters. All rights reserved.This material may not be published, broadcast, rewritten, or redistributed."

Test Results

We put the output from the tool in the order of above prompt:

Output generated in 3.05 seconds (12.80 tokens/s, 39 tokens, context 74, seed 1582072731)
Output generated in 5.10 seconds (17.46 tokens/s, 89 tokens, context 130, seed 877217908)
Output generated in 1.09 seconds (8.24 tokens/s, 9 tokens, context 682, seed 755819891)

openlm-research/open_llama_3b_v2 (8bit, meaningless outputs)
Output generated in 26.47 seconds (5.14 tokens/s, 136 tokens, context 74, seed 823036846)
Output generated in 9.24 seconds (5.20 tokens/s, 48 tokens, context 130, seed 1794183469)
Output generated in 36.82 seconds (5.41 tokens/s, 199 tokens, context 682, seed 1156697004)

Output generated in 2.27 seconds (18.97 tokens/s, 43 tokens, context 73, seed 479534543)
Output generated in 4.19 seconds (23.41 tokens/s, 98 tokens, context 129, seed 863019224)
Output generated in 5.05 seconds (23.16 tokens/s, 117 tokens, context 635, seed 374463409)

mosaicml/mpt-7b-instruct (8bit)
error: inf/nan

Output generated in 5.86 seconds (15.69 tokens/s, 92 tokens, context 77, seed 1614263612)
Output generated in 1.61 seconds (11.17 tokens/s, 18 tokens, context 136, seed 772611540)
Output generated in 11.87 seconds (15.08 tokens/s, 179 tokens, context 724, seed 1343180256)

lmsys/vicuna-7b-v1.5 (8bit)
Output generated in 23.49 seconds (3.96 tokens/s, 93 tokens, context 77, seed 2018045883)
Output generated in 17.97 seconds (3.95 tokens/s, 71 tokens, context 136, seed 2062659643)
Output generated in 50.42 seconds (3.95 tokens/s, 199 tokens, context 724, seed 1818982281)

Output generated in 1.96 seconds (11.72 tokens/s, 23 tokens, context 73, seed 1900398047)
Output generated in 1.89 seconds (11.66 tokens/s, 22 tokens, context 129, seed 970413165)
Output generated in 14.11 seconds (14.10 tokens/s, 199 tokens, context 635, seed 284941147)

sgugger/rwkv-7b-pile (8bit, meaningless outputs)
Output generated in 3.15 seconds (2.86 tokens/s, 9 tokens, context 73, seed 1293804291)
Output generated in 61.73 seconds (3.22 tokens/s, 199 tokens, context 129, seed 103279655)
Output generated in 61.77 seconds (3.22 tokens/s, 199 tokens, context 635, seed 70913884)

Output generated in 2.75 seconds (7.28 tokens/s, 20 tokens, context 72, seed 2060612524)
Output generated in 3.09 seconds (9.06 tokens/s, 28 tokens, context 128, seed 1706926550)
Output generated in 2.11 seconds (4.27 tokens/s, 9 tokens, context 668, seed 910221591)

tiiuae/falcon-7b (8bit)
error: inf/nan

Output generated in 2.35 seconds (12.75 tokens/s, 30 tokens, context 77, seed 535249575)
Output generated in 1.68 seconds (11.33 tokens/s, 19 tokens, context 136, seed 1937169918)
Output generated in 1.80 seconds (9.99 tokens/s, 18 tokens, context 724, seed 737398567)

NousResearch/Nous-Hermes-llama-2-7b (8bit)
Output generated in 9.24 seconds (3.57 tokens/s, 33 tokens, context 77, seed 1021487283)
Output generated in 4.46 seconds (3.36 tokens/s, 15 tokens, context 136, seed 1851128740)
Output generated in 17.95 seconds (3.62 tokens/s, 65 tokens, context 724, seed 1707074847)

Output generated in 6.08 seconds (3.78 tokens/s, 23 tokens, context 77, seed 291226512)
Output generated in 5.34 seconds (4.12 tokens/s, 22 tokens, context 136, seed 390536894)
Output generated in 3.46 seconds (3.76 tokens/s, 13 tokens, context 724, seed 335641766)
Copy link

Here are some measurements on 3060:

Output generated in 8.30 seconds (19.15 tokens/s, 159 tokens, context 74, seed 1800275435)
Output generated in 6.26 seconds (20.12 tokens/s, 126 tokens, context 130, seed 154831876)
Output generated in 3.80 seconds (16.31 tokens/s, 62 tokens, context 682, seed 872867114)

Copy link

It is found that for GPTQ. This can be even faster on V100 using exllama. The difference in this report compared with the github repo may come from the performance difference between V100 and 4090, where 4090 is generally 2-3 times faster than V100 on FP16 and FP32 on single GPU.

exllama-web-1  | Prompt processed in 0.64 seconds, 73 new tokens, 114.67 tokens/second:
exllama-web-1  | Chatbort: I don't know, can you tell me more about it?
exllama-web-1  | Response generated in 0.28 seconds, 20 tokens, 72.53 tokens/second:
exllama-web-1  | Prompt processed in 0.06 seconds, 132 new tokens, 2048.85 tokens/second:
exllama-web-1  | Chatbort: Hmm, that's an interesting question. I think it would be A if the incidence rate of the disease falls.
exllama-web-1  | Response generated in 0.47 seconds, 32 tokens, 68.28 tokens/second:
exllama-web-1  | Prompt processed in 0.17 seconds, 720 new tokens, 4209.09 tokens/second:
exllama-web-1  | Chatbort: Hey there! What's up?
exllama-web-1  | Response generated in 0.18 seconds, 14 tokens, 79.74 tokens/second:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment