Skip to content

Instantly share code, notes, and snippets.

@alexcpn
Last active March 15, 2024 11:22
Show Gist options
  • Save alexcpn/25bf5143829230191091edb6d7e4a840 to your computer and use it in GitHub Desktop.
Save alexcpn/25bf5143829230191091edb6d7e4a840 to your computer and use it in GitHub Desktop.
llama.cpp with Mistral using NVIDIA GPU's and CUDA

Step 1 Download and Build llama.cpp

Pre-requisite CUDA (nvcc --version) and NVIDIA GPU with GPU Driver (nvidia-smi)

nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Tue_Feb__7_19:32:13_PST_2023
Cuda compilation tools, release 12.1, V12.1.66
Build cuda_12.1.r12.1/compiler.32415258_0
git clone https://github.com/ggerganov/llama.cpp
$ cd llama.cpp
$ make LLAMA_CUBLAS=1 

See also https://stackoverflow.com/a/78165019/429476 regarding g++ errors

Step 2 Download the desired GGUF model

Note: - What is GGUF ? https://vickiboykis.com/2024/02/28/gguf-the-long-way-around/

From https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.2-GGUF

Select the model you want to run https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.2-GGUF#provided-files

I an using a meidum sized model

Name Quant method Bits Size Max RAM required Use case
mistral-7b-instruct-v0.2.Q4_K_M.gguf Q4_K_M 4 4.37 GB 6.87 GB medium, balanced quality - recommended
GIT_LFS_SKIP_SMUDGE=1 git clone  https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.2-GGUF
cd Mistral-7B-Instruct-v0.2-GGUF
ls
# select and  pull whichever model you are intrested in
git lfs pull --include mistral-7b-instruct-v0.2.Q4_K_M.gguf

Step 3- Run

ssd/llama.cpp$ ./main -m ~/coding/Mistral-7B-Instruct-v0.2-GGUF/mistral-7b-instruct-v0.2.Q4_K_M.gguf --color --ctx_size 2048 -n -1 -ins -b 256 --top_k 10000 --temp 0.2 --repeat_penalty 1.1 -t 8 -ngl 10000  --multiline-input

Note I am using with --multiline-input option

Thats it - Note that it is using about 4 GB GPU Ram

 nvidia-smi
Fri Mar 15 14:26:56 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.113.01             Driver Version: 535.113.01   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 3060 ...    Off | 00000000:01:00.0  On |                  N/A |
| N/A   52C    P8              13W /  80W |   4932MiB /  6144MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A      5901      G   /usr/lib/xorg/Xorg                          210MiB |
|    0   N/A  N/A      6070      G   /usr/bin/gnome-shell                        124MiB |
|    0   N/A  N/A     15150      G   ...rker,SpareRendererForSitePerProcess       24MiB |
|    0   N/A  N/A     39497      C   ./main                                     4560MiB |
+---------------------------------------------------------------------------------------+

Running it as a Server

./server -m /home/alex/coding/Mistral-7B-Instruct-v0.2-GGUF/mistral-7b-instruct-v0.2.Q4_K_M.gguf  --ctx_size 2048 -n -1 -b 256  -t 8 -ngl 10000  --host localhost --port 8080

The server can be accesed at

http://localhost:8080/

Thanks to post https://www.xzh.me/2023/09/serving-llama-2-7b-using-llamacpp-with.html

LLAMA.CPP with AutoGen

The above server binding is not OpenAI compatible. Since we need to be open AI compatible for Autogen we will install the python binding for llama.cpp whose server is OpenAI compatibe;

CMAKE_ARGS="-DLLAMA_CUBLAS=on" pip install llama-cpp-python # for cuBLAS (BLAS via CUDA)

I have gcc different versions and had to use to overcome the DSO error
CC=gcc-12 CXX=g++-12 CMAKE_ARGS="-DLLAMA_CUBLAS=on" pip install llama-cpp-python

Install other dependecies it asks via pip (pydantic,starlet etc)

Run the server

python3 -m llama_cpp.server --model /home/alex/coding/Mistral-7B-Instruct-v0.2-GGUF/mistral-7b-instruct-v0.2.Q4_K_M.gguf   --host localhost --port 8080

Testing with AutoGen

Install AutoGen

Write a small AutoGen script

import autogen

config_list = [
    {
        "model":"mistral-instruct-v0.2",
        "base_url": "http://localhost:8080/v1",
        "api_key":"NULL"
    },
]


llm_config = {
    "cache_seed": 442,  # seed for caching and reproducibility
    "config_list": config_list,  # a list of OpenAI API configurations
    "temperature": 0,  # temperature for sampling
    #"timeout": 600,
 }
# create an AssistantAgent named "assistant" via OpenAI
assistant = autogen.AssistantAgent(
    name="assistant",
    llm_config=llm_config,
)

# create a UserProxyAgent instance named "user_proxy"
user_proxy = autogen.UserProxyAgent(
    name="user_proxy",
    human_input_mode="TERMINATE",
    max_consecutive_auto_reply=10,
    is_termination_msg=lambda x: x.get("content", "").rstrip().endswith("TERMINATE"),
    code_execution_config={
        "work_dir": "web",
        "use_docker": False,
    },
    llm_config=llm_config,
    system_message="""Reply TERMINATE if the task has been solved at full satisfaction.
    Otherwise, reply CONTINUE, or the reason why the task is not solved yet.""",
)

# the assistant receives a message from the user_proxy, which contains the task description
user_proxy.initiate_chat(
    #assistant,
    assistant,
    message="Check the web for latest financial news and summarise the chance of US Fed raising interest rates this month",
)


Output is happening but Autogen is not using tools like web-browser but rather relying on the LLM data. Need to check

assistant (to user_proxy):

 According to various financial news outlets, including The Wall Street Journal, Reuters, and Bloomberg, there is a strong likelihood that the US Federal Reserve (Fed) will raise interest rates by 0.25 percentage points at its two-day policy meeting ending on March 16, 2023. This would mark the first rate hike since 2018.

The consensus among economists and market analysts is that the Fed will increase rates in response to rising inflation, which reached 6.0% in February, well above the central bank's 2% target. Additionally, the labor market remains tight, with the unemployment rate at 3.6%, near a 50-year low.

However, some analysts have suggested that the Fed might hold off on a rate hike if inflation data for March comes in weaker than expected. Nevertheless, most experts believe that the Fed will begin its rate hike cycle this month to keep inflation in check and maintain its credibility.

It's important to note that the Fed's decision will depend on various factors, including economic data releases, geopolitical developments, and market conditions. Therefore, while the odds of a rate hike are high, they are not 100%, and the final decision will be announced on March 16.

--------------------------------------------------------------------------------

>>>>>>>> USING AUTO REPLY...
please check the latuser_proxy (to assistant):

 In summary, based on current information from reputable financial news sources, there is a high probability that the US Federal Reserve will raise interest rates by 0.25 percentage points at its upcoming policy meeting on March 16, 2023. This decision is largely due to rising inflation and a tight labor market. However, some analysts suggest that a weaker-than-expected inflation reading in March could cause the Fed to hold off on a rate hike. Ultimately, the final decision will depend on various economic and geopolitical factors.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment