Skip to content

Instantly share code, notes, and snippets.

@adrienbrault
Last active April 22, 2024 08:47
Show Gist options
  • Save adrienbrault/b76631c56c736def9bc1bc2167b5d129 to your computer and use it in GitHub Desktop.
Save adrienbrault/b76631c56c736def9bc1bc2167b5d129 to your computer and use it in GitHub Desktop.
Run Llama-2-13B-chat locally on your M1/M2 Mac with GPU inference. Uses 10GB RAM. UPDATE: see https://twitter.com/simonw/status/1691495807319674880?s=20
# Clone llama.cpp
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
# Build it
make clean
LLAMA_METAL=1 make
# Download model
export MODEL=llama-2-13b-chat.ggmlv3.q4_0.bin
wget "https://huggingface.co/TheBloke/Llama-2-13B-chat-GGML/resolve/main/${MODEL}"
# Run
echo "Prompt: " \
&& read PROMPT \
&& ./main \
--threads 8 \
--n-gpu-layers 1 \
--model ${MODEL} \
--color \
--ctx-size 2048 \
--temp 0.7 \
--repeat_penalty 1.1 \
--n-predict -1 \
--prompt "[INST] ${PROMPT} [/INST]"
@adrienbrault
Copy link
Author

@AmoghM Try make clean && LLAMA_METAL=1 make and then run ./main ... again

@AmoghM
Copy link

AmoghM commented Jul 28, 2023

@AmoghM Try make clean && LLAMA_METAL=1 make and then run ./main ... again

@adrienbrault Thanks, that worked!

image

@BoKa33
Copy link

BoKa33 commented Aug 18, 2023

Nice work!

And it can be used by simply calling the bash examples/chat-13B.sh at the last step.

Besides, is there a way to download the 70B model and 70B-chat model? Thanks!

Yes, "The Bloke" published them on hugging face: https://huggingface.co/TheBloke/Llama-2-70B-Chat-GGML

I recomment not downloading via browser, use j downloader or anything like this instead. Maybe even commandline tools are better when it comes to downloading files of that size.

@sujantkumarkv
Copy link

I recomment not downloading via browser, use j downloader or anything like this instead. Maybe even commandline tools are better when it comes to downloading files of that size.

here its using wget using the commandline & not via the huggingface browser ui. so its all good right, or did I not get your point?

@sujantkumarkv
Copy link

everyone, my need is to generate embeddings with llama2:
the examples/embedding/embedding.cpp list the 2048 token limit:

if (params.n_ctx > 2048) {
        fprintf(stderr, "%s: warning: model might not support context sizes greater than 2048 tokens (%d specified);"
                "expect poor results\n", __func__, params.n_ctx);
    }

but the llama2 has 4096 context length, on building it, we get embedding file just like the main file & more, so i was not sure if we need to edit that. to 4096?

any help is really appreciated. thanks.

@danielabar
Copy link

Getting the following error loading model:

main: build = 1154 (3358c38)
main: seed  = 1693681287
gguf_init_from_file: invalid magic number 67676a74
error loading model: llama_model_loader: failed to load model from llama-2-13b-chat.ggmlv3.q4_0.bin

llama_load_model_from_file: failed to load model
llama_init_from_gpt_params: error: failed to load model 'llama-2-13b-chat.ggmlv3.q4_0.bin'

Does anyone know how to fix this?

@brobles82
Copy link

Same issue :(

@cfmbrand
Copy link

cfmbrand commented Sep 3, 2023

Getting the following error loading model:

main: build = 1154 (3358c38)
main: seed  = 1693681287
gguf_init_from_file: invalid magic number 67676a74
error loading model: llama_model_loader: failed to load model from llama-2-13b-chat.ggmlv3.q4_0.bin

llama_load_model_from_file: failed to load model
llama_init_from_gpt_params: error: failed to load model 'llama-2-13b-chat.ggmlv3.q4_0.bin'

Does anyone know how to fix this?

Same issue here also! Did something change? I'm a noob, so no idea what a magic number is.

@danielabar
Copy link

There's a similar error reported in the Python bindings for llama.cpp. Sounds like need to wait for a new model format to be available. In the meantime, a temporary workaround is to checkout an older release of llama.cpp, for example:

git checkout 1aa18ef

Which is for this release from Jul 25.

Then run the build again.

@smart-patrol
Copy link

smart-patrol commented Sep 15, 2023

Thanks for above.

I was running into an error:

error loading model: failed to open --color: No such file or directory
llama_load_model_from_file: failed to load model
llama_init_from_gpt_params: error: failed to load model '--color'
main: error: unable to load model

Deleted everything and then ran:

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
git reset --hard 1aa18ef

Then ran the rest of gist and it worked again.

@neoneye
Copy link

neoneye commented Sep 23, 2023

Yeah, latest llama.cpp is no longer compatible with GGML models. The new model format, GGUF, was merged recently. As far as llama.cpp is concerned, GGML is now dead

https://huggingface.co/TheBloke/vicuna-13B-v1.5-16K-GGML/discussions/6#64e5ba63a9a5eabaa6fd4a04

Replacing the GGML model with a GGUF model
https://huggingface.co/TheBloke/Llama-2-7b-Chat-GGUF/blob/main/llama-2-7b-chat.Q8_0.gguf

You can check if it works:

PROMPT> ./main -m models/llama-2-7b-chat.Q8_0.gguf --random-prompt
snip lots of info
response to the prompt
After years of hard work and dedication, a high school teacher in Texas has been recognized for her outstanding contributions to education.
Ms. Rodriguez, a mathematics teacher at...

@data-octo
Copy link

Screenshot 2023-07-25 at 12 26 31 pm

While having simple chat, I got segmentation fault, what happened? How to prevent this?

How is chat ui implemented? Thanks!

@ap247
Copy link

ap247 commented Nov 7, 2023

Does anybody know how to adjust the prompt input to include multiple lines of input before submitting the prompt?

@therumham
Copy link

therumham commented Nov 21, 2023

Thanks for above.

I was running into an error:

error loading model: failed to open --color: No such file or directory
llama_load_model_from_file: failed to load model
llama_init_from_gpt_params: error: failed to load model '--color'
main: error: unable to load model

Deleted everything and then ran:

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
git reset --hard 1aa18ef

Then ran the rest of gist and it worked again.

Seems to have worked once but now continues to fail. Any ideas why @smart-patrol ?

Prompt: 
How large is the sun?
main: build = 904 (1aa18ef)
main: seed  = 1700587479
error loading model: failed to open --color: No such file or directory
llama_load_model_from_file: failed to load model
llama_init_from_gpt_params: error: failed to load model '--color'
main: error: unable to load model

@bhadreshvk
Copy link

same issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment