-
-
Save adrienbrault/b76631c56c736def9bc1bc2167b5d129 to your computer and use it in GitHub Desktop.
# Clone llama.cpp | |
git clone https://github.com/ggerganov/llama.cpp.git | |
cd llama.cpp | |
# Build it | |
make clean | |
LLAMA_METAL=1 make | |
# Download model | |
export MODEL=llama-2-13b-chat.ggmlv3.q4_0.bin | |
wget "https://huggingface.co/TheBloke/Llama-2-13B-chat-GGML/resolve/main/${MODEL}" | |
# Run | |
echo "Prompt: " \ | |
&& read PROMPT \ | |
&& ./main \ | |
--threads 8 \ | |
--n-gpu-layers 1 \ | |
--model ${MODEL} \ | |
--color \ | |
--ctx-size 2048 \ | |
--temp 0.7 \ | |
--repeat_penalty 1.1 \ | |
--n-predict -1 \ | |
--prompt "[INST] ${PROMPT} [/INST]" |
Im getting error, by any-chance do you know solution to this.
ggml_metal_graph_compute: command buffer 0 failed with status 5
GGML_ASSERT: ggml-metal.m:991: false
zsh: abort ./main -ins -f ./prompts/alpaca.txt -t 8 -ngl 1 -m --color -c 2048 --temp 0
@tmm1 Can you please check this updated and let me know if it looks right to you?
https://gpus.llm-utils.org/llama-2-prompt-template/
Also, let me know if you can tell whether </s>
is added if it's only a single user message. I seem to get better results when it's added, but in the ones you linked, they don't have </s>
if it's only one prompt
Nice work!
And it can be used by simply calling the bash examples/chat-13B.sh at the last step.
Besides, is there a way to download the 70B model and 70B-chat model? Thanks!
Im getting error, by any-chance do you know solution to this.
ggml_metal_graph_compute: command buffer 0 failed with status 5 GGML_ASSERT: ggml-metal.m:991: false zsh: abort ./main -ins -f ./prompts/alpaca.txt -t 8 -ngl 1 -m --color -c 2048 --temp 0
@BaliDataMan, looks like something is wrong with metal at the moment on these models. You can remove '-ngl 1' to run on cpu (which is also surprisingly fast)
Nice work!
And it can be used by simply calling the bash examples/chat-13B.sh at the last step.
Besides, is there a way to download the 70B model and 70B-chat model? Thanks!
@yeasy , the 70B model is waiting for ggerganov/llama.cpp#2276 to get merged. Right now that model can't be converted due to a new custom layer that's not present in the smaller models
Im getting error, by any-chance do you know solution to this.
ggml_metal_graph_compute: command buffer 0 failed with status 5 GGML_ASSERT: ggml-metal.m:991: false zsh: abort ./main -ins -f ./prompts/alpaca.txt -t 8 -ngl 1 -m --color -c 2048 --temp 0
@BaliDataMan, looks like something is wrong with metal at the moment on these models. You can remove '-ngl 1' to run on cpu (which is also surprisingly fast)
@TortoiseHam Thanks for mentioning it. For me on M2 Chip with 8GB Ram the latency is around 6-8 secs/word on an average. Is it similar for you?
I run the command above on terminal, it works, but it seems like the chat only happens once off and then stop, back to terminal.
Is it possible:
- to have proper chat and continue chat
- to save out the chat
- to start new conversation without restarting
I also noticed, if I close my Terminal, I need to start and downloading the model once again. I end up having multiple file of the same model in my directory.
I run the command above on terminal, it works, but it seems like the chat only happens once off and then stop, back to terminal.
Is it possible:
- to have proper chat and continue chat
- to save out the chat
- to start new conversation without restarting
I also noticed, if I close my Terminal, I need to start and downloading the model once again. I end up having multiple file of the same model in my directory.
It is installed in the folder llama.cpp. You can run it in that folder, for example:
./main -m llama-2-13b-chat.ggmlv3.q4_0.bin -help
or
./main -m llama-2-13b-chat.ggmlv3.q4_0.bin -p "tell me a joke"
I run the command above on terminal, it works, but it seems like the chat only happens once off and then stop, back to terminal.
Is it possible:
- to have proper chat and continue chat
- to save out the chat
- to start new conversation without restarting
I also noticed, if I close my Terminal, I need to start and downloading the model once again. I end up having multiple file of the same model in my directory.
It is installed in the folder llama.cpp. You can run it in that folder, for example:
./main -m llama-2-13b-chat.ggmlv3.q4_0.bin -help
or
./main -m llama-2-13b-chat.ggmlv3.q4_0.bin -p "tell me a joke"
For the redownload issue, you only need to wget once and save it somewhere, say <PATH_TO_BIN>, then you can use -m <PATH_TO_BIN> to load the model you downloaded already yeverything afterward.
Im getting error, by any-chance do you know solution to this.
ggml_metal_graph_compute: command buffer 0 failed with status 5 GGML_ASSERT: ggml-metal.m:991: false zsh: abort ./main -ins -f ./prompts/alpaca.txt -t 8 -ngl 1 -m --color -c 2048 --temp 0
@BaliDataMan, looks like something is wrong with metal at the moment on these models. You can remove '-ngl 1' to run on cpu (which is also surprisingly fast)
@TortoiseHam Thanks for mentioning it. For me on M2 Chip with 8GB Ram the latency is around 6-8 secs/word on an average. Is it similar for you?
@BaliDataMan , I'm using M2 Max w. 64 GB ram. On the 13B model with 8bit quantization (rather than 4bit in the link above) the code is reporting 8 tokens per second. Seems like quite a significant performance difference. My model is 13.79GB on disk, but that still fits easily within my RAM limit. How big is the 4bit model? Maybe if it doesn't fit into ram you're getting slowed down by paging the memory to disk?
Cheers for the simple single line -help and -p "prompt here". I tested the -i hoping to get interactive chat, but it just keep talking and then just blank lines.
Still wondering how to run "chat" mode session then saving the conversation. Will check this page again later.
Thank you for this fantastic note! I tried running Llama-2-13B-chat locally on my M1 MacBook, and it works perfectly.
Here are my notes on the process, hoping it will be helpful to someone.
@TortoiseHam try lower -c
, such as -c 1024
. in case my case I got the same error with -c 4096
(but -c 2048
works)
@junhochoi , That didn't work for me unfortunately. It turns out that using the 4bit quantized model rather than 8bit quantization does work with GPU though. I was hoping to use 8 since it does much less harm to the model perplexity
@junhochoi , That didn't work for me unfortunately. It turns out that using the 4bit quantized model rather than 8bit quantization does work with GPU though. I was hoping to use 8 since it does much less harm to the model perplexity
I see. I was using q4_0
.
@gengwg Does this look right for text-generation-webui for MacBookAir 2020 M1:
python3 server.py --listen --trust-remote-code --cpu-memory 8 --gpu-memory 8 --extensions openai --loader llamacpp --model TheBloke_Llama-2-13B-chat-GGML --notebook
I asked it where is Atlanta, and it's very, very very slow. At 290 seconds, it has responded with this so far:
Question: Where is Atlanta?
Factual answer: Atlanta is located in the state of Georgia, United States.
that is just great! can you point me where to find more info how to learn tnat model? feed it with some extra data like my ebook collection to discuss matter described there? is there some cmd simple command to pass pdf or txt file? would like to discuss avant garde cinema and some media art theory
:)
Cheers for the simple single line -help and -p "prompt here". I tested the -i hoping to get interactive chat, but it just keep talking and then just blank lines.
Still wondering how to run "chat" mode session then saving the conversation. Will check this page again later.
@enzyme69 try with -i -ins instead of -p
Cheers adding -i indeed making it generating words non stop! I'll check around if there is some kind of nice chat UI for Llama. I tried gpt4All and simply loading the model, however the response seems weird.
How many tokens/sec are you all getting and what's your Mac CPU and Ram?
macOS M2, 32 GB. Not sure how the "token" etc actually works. Does it vary depending on the prompt?
Cheers for the simple single line -help and -p "prompt here". I tested the -i hoping to get interactive chat, but it just keep talking and then just blank lines.
Still wondering how to run "chat" mode session then saving the conversation. Will check this page again later.
maybe try the following command instead
./server -m llama-2-13b-chat.ggmlv3.q4_0.bin --ctx-size 2048 --threads 10 --n-gpu-layers 1
and then go to localhost:8080
Thanks @zhedasuiyuan this chat mode is what I was looking for.
Could I humbly suggest to expand all of the command line args to their full version in your script? Much easier to grok as a newcomer! Thanks for posting this.
echo "Prompt: " \
&& read PROMPT \
&& ./main \
--threads 8 \
--n-gpu-layers 1 \
--model ${MODEL} \
--color \
--ctx-size 2048 \
--temp 0.7 \
--repeat_penalty 1.1 \
--n-predict -1 \
--prompt "[INST] ${PROMPT} [/INST]"
And for those curious, ./main --help
is... helpful!
Could I humbly suggest to expand all of the command line args to their full version in your script? Much easier to grok as a newcomer! Thanks for posting this.
echo "Prompt: " \ && read PROMPT \ && ./main \ --threads 8 \ --n-gpu-layers 1 \ --model ${MODEL} \ --color \ --ctx-size 2048 \ --temp 0.7 \ --repeat_penalty 1.1 \ --n-predict -1 \ --prompt "[INST] ${PROMPT} [/INST]"And for those curious,
./main --help
is... helpful!
@reustle Good idea! Updated, thanks
Running a Macbook Pro M2 with 32GB and wish to ask about entities in news article. From the following page:
I am using the following lines in this gist script:
export MODEL=llama-2-13b.ggmlv3.q4_0.bin
wget "https://huggingface.co/TheBloke/Llama-2-13B-GGML/resolve/main/${MODEL}"
It this the right way to use 13B non chat? It seems to work but hallucinates quite a lot.
https://github.com/ggerganov/llama.cpp/blob/master/examples/llama2.sh is a good example script added recently.
This is great.
Any suggestions to serve this as an API endpoint locally and then use it with a chat-ui ?
I don't think llama is using the GPU.
Ran the step:
LLAMA_METAL=1 make
then
make -j && ./main -m ./models/7B/ggml-model-q4_0.bin -p "Tell me something about Barack Obama" -n 512 -ngl 1
but the activity monitor shows CPU being only used:
This is the response when I run again LLAMA_METAL=1 make
:
I llama.cpp build info:
I UNAME_S: Darwin
I UNAME_P: arm
I UNAME_M: arm64
I CFLAGS: -I. -O3 -std=c11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -pthread -DGGML_USE_K_QUANTS -DGGML_USE_ACCELERATE -DGGML_USE_METAL -DGGML_METAL_NDEBUG
I CXXFLAGS: -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -DGGML_USE_K_QUANTS -DGGML_USE_METAL
I LDFLAGS: -framework Accelerate -framework Foundation -framework Metal -framework MetalKit -framework MetalPerformanceShaders
I CC: Apple clang version 14.0.3 (clang-1403.0.22.14.1)
I CXX: Apple clang version 14.0.3 (clang-1403.0.22.14.1)
make: Nothing to be done for `default'.
Please advise.
@AmoghM Try make clean && LLAMA_METAL=1 make
and then run ./main ...
again
@AmoghM Try
make clean && LLAMA_METAL=1 make
and then run./main ...
again
@adrienbrault Thanks, that worked!
Nice work!
And it can be used by simply calling the bash examples/chat-13B.sh at the last step.
Besides, is there a way to download the 70B model and 70B-chat model? Thanks!
Yes, "The Bloke" published them on hugging face: https://huggingface.co/TheBloke/Llama-2-70B-Chat-GGML
I recomment not downloading via browser, use j downloader or anything like this instead. Maybe even commandline tools are better when it comes to downloading files of that size.
I recomment not downloading via browser, use j downloader or anything like this instead. Maybe even commandline tools are better when it comes to downloading files of that size.
here its using wget
using the commandline & not via the huggingface browser ui. so its all good right, or did I not get your point?
everyone, my need is to generate embeddings with llama2:
the examples/embedding/embedding.cpp
list the 2048
token limit:
if (params.n_ctx > 2048) {
fprintf(stderr, "%s: warning: model might not support context sizes greater than 2048 tokens (%d specified);"
"expect poor results\n", __func__, params.n_ctx);
}
but the llama2 has 4096 context length, on building it, we get embedding
file just like the main
file & more, so i was not sure if we need to edit that. to 4096?
any help is really appreciated. thanks.
Getting the following error loading model:
main: build = 1154 (3358c38)
main: seed = 1693681287
gguf_init_from_file: invalid magic number 67676a74
error loading model: llama_model_loader: failed to load model from llama-2-13b-chat.ggmlv3.q4_0.bin
llama_load_model_from_file: failed to load model
llama_init_from_gpt_params: error: failed to load model 'llama-2-13b-chat.ggmlv3.q4_0.bin'
Does anyone know how to fix this?
Same issue :(
Getting the following error loading model:
main: build = 1154 (3358c38) main: seed = 1693681287 gguf_init_from_file: invalid magic number 67676a74 error loading model: llama_model_loader: failed to load model from llama-2-13b-chat.ggmlv3.q4_0.bin llama_load_model_from_file: failed to load model llama_init_from_gpt_params: error: failed to load model 'llama-2-13b-chat.ggmlv3.q4_0.bin'
Does anyone know how to fix this?
Same issue here also! Did something change? I'm a noob, so no idea what a magic number is.
There's a similar error reported in the Python bindings for llama.cpp. Sounds like need to wait for a new model format to be available. In the meantime, a temporary workaround is to checkout an older release of llama.cpp, for example:
git checkout 1aa18ef
Which is for this release from Jul 25.
Then run the build again.
Thanks for above.
I was running into an error:
error loading model: failed to open --color: No such file or directory
llama_load_model_from_file: failed to load model
llama_init_from_gpt_params: error: failed to load model '--color'
main: error: unable to load model
Deleted everything and then ran:
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
git reset --hard 1aa18ef
Then ran the rest of gist and it worked again.
Yeah, latest llama.cpp is no longer compatible with GGML models. The new model format, GGUF, was merged recently. As far as llama.cpp is concerned, GGML is now dead
https://huggingface.co/TheBloke/vicuna-13B-v1.5-16K-GGML/discussions/6#64e5ba63a9a5eabaa6fd4a04
Replacing the GGML model with a GGUF model
https://huggingface.co/TheBloke/Llama-2-7b-Chat-GGUF/blob/main/llama-2-7b-chat.Q8_0.gguf
You can check if it works:
PROMPT> ./main -m models/llama-2-7b-chat.Q8_0.gguf --random-prompt
snip lots of info
response to the prompt
After years of hard work and dedication, a high school teacher in Texas has been recognized for her outstanding contributions to education.
Ms. Rodriguez, a mathematics teacher at...
Does anybody know how to adjust the prompt input to include multiple lines of input before submitting the prompt?
Thanks for above.
I was running into an error:
error loading model: failed to open --color: No such file or directory llama_load_model_from_file: failed to load model llama_init_from_gpt_params: error: failed to load model '--color' main: error: unable to load model
Deleted everything and then ran:
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp git reset --hard 1aa18ef
Then ran the rest of gist and it worked again.
Seems to have worked once but now continues to fail. Any ideas why @smart-patrol ?
Prompt:
How large is the sun?
main: build = 904 (1aa18ef)
main: seed = 1700587479
error loading model: failed to open --color: No such file or directory
llama_load_model_from_file: failed to load model
llama_init_from_gpt_params: error: failed to load model '--color'
main: error: unable to load model
same issue
@TikkunCreation nice write up. technically the multi-turn prompt is wrong because you need to add EOS/BOS
See ggerganov/llama.cpp#2262 (comment) and oobabooga/text-generation-webui@8ec225f