Skip to content

Instantly share code, notes, and snippets.

@pmbaumgartner
Last active January 6, 2024 21:17
Show Gist options
  • Save pmbaumgartner/d565ccd6cc420f2a5ea92ca03222b46b to your computer and use it in GitHub Desktop.
Save pmbaumgartner/d565ccd6cc420f2a5ea92ca03222b46b to your computer and use it in GitHub Desktop.
Mistal w/ vLLM. Run w/ a RTX 3090

Testing out mistralai/Mistral-7B-Instruct-v0.2 through vLLM and documenting the very basics to make an API call request. Run through docker. Requires the NVIDIA Container Toolkit.

# https://docs.mistral.ai/self-deployment/vllm/
export HF_TOKEN=<Huggingface Token>
docker run --gpus all \
-e HF_TOKEN=$HF_TOKEN -p 8000:8000 \
ghcr.io/mistralai/mistral-src/vllm:latest \
--host 0.0.0.0 \
--model mistralai/Mistral-7B-Instruct-v0.2
from pathlib import Path
import requests
lyrics = Path("lyrics.txt").read_text()
message = f"""Interpret the core message of these lyrics:
{lyrics}
"""
data = {"messages" : [{"role" : "user", "content" : message}], "model": "mistralai/Mistral-7B-Instruct-v0.2"}
# Note: Both the vLLM and Mistral docs don't mention you need the `v1` in the URL.
# You'll see a lot of {"detail" : "Not Found"} responses without this
r = requests.post("http://localhost:8000/v1/chat/completions", json=data).json()
print(r['choices'][0]['message']['content'])
# The core message of these lyrics appears to be about the speaker's experience of being in a relationship with someone who has hurt or confused them, and their struggle to decide whether to continue investing their emotions in the relationship or to let go and move on. The speaker expresses their desire for the other person to be open and expressive in their feelings, as they have been trying to be more forgiving and patient. However, they also acknowledge that they have been getting better at letting go of things and not getting too attached, particularly if the other person is not reciprocating their feelings or behavior is inconsistent. The speaker ultimately expresses their reluctance to leave the relationship but also their determination to protect themselves from unnecessary pain. Overall, the lyrics suggest a complex emotional landscape of love, hurt, and ambivalence.
INFO: 172.17.0.1:45074 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO 01-06 20:33:24 async_llm_engine.py:379] Received request cmpl-0493039beb574618be780e6235452a46: prompt: '<s>[INST] Interpret the core message of these lyrics.\n\nLyrics:\nGot so hung up\nOn something you said\nI should’ve guessed \nthat you would mess \nwith my head\n\nYou got up \nand I stayed in bed\nI was about to say something\nSaid nothing instead \n\nGetting good at letting things go\nBut you’re somebody I want to know\nSo if you love me than let it show\nCuz I’ve been getting good at letting things go\n\nI’ve got a feeling\nYou could prove me wrong\nA feeling that I haven’t felt in so long\nI can be patient\nI can play along\nForgive as fast I forget you, \nSo don’t make me have to move on\n\nGetting good at letting things go\nBut you’re somebody I want to know\nSo if you love me than let it show\nCuz I’ve been getting good at letting things go\n\nDon’t want to leave so I’m letting you know\nThat I’ve been getting good at letting things go done [/INST]', sampling params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.7, top_p=1.0, top_k=-1, min_p=0.0, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], ignore_eos=False, max_tokens=32522, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True), prompt token ids: [1, 1, 733, 16289, 28793, 4287, 5520, 272, 6421, 2928, 302, 1167, 22583, 28723, 13, 13, 28758, 19591, 28747, 13, 28777, 322, 579, 7342, 582, 13, 2486, 1545, 368, 773, 13, 28737, 1023, 28809, 333, 26415, 28705, 13, 6087, 368, 682, 4687, 28705, 13, 3415, 586, 1335, 13, 13, 1976, 1433, 582, 28705, 13, 391, 315, 10452, 297, 2855, 13, 28737, 403, 684, 298, 1315, 1545, 13, 28735, 3439, 2511, 3519, 28705, 13, 13, 1458, 1157, 1179, 438, 12815, 1722, 576, 13, 2438, 368, 28809, 267, 12421, 315, 947, 298, 873, 13, 5142, 513, 368, 2016, 528, 821, 1346, 378, 1347, 13, 28743, 3533, 315, 28809, 333, 750, 2719, 1179, 438, 12815, 1722, 576, 13, 13, 28737, 28809, 333, 1433, 264, 4622, 13, 1976, 829, 7674, 528, 3544, 13, 28741, 4622, 369, 315, 6253, 28809, 28707, 2770, 297, 579, 1043, 13, 28737, 541, 347, 7749, 13, 28737, 541, 1156, 2267, 13, 28765, 1909, 495, 390, 4102, 315, 7120, 368, 28725, 28705, 13, 5142, 949, 28809, 28707, 1038, 528, 506, 298, 2318, 356, 13, 13, 1458, 1157, 1179, 438, 12815, 1722, 576, 13, 2438, 368, 28809, 267, 12421, 315, 947, 298, 873, 13, 5142, 513, 368, 2016, 528, 821, 1346, 378, 1347, 13, 28743, 3533, 315, 28809, 333, 750, 2719, 1179, 438, 12815, 1722, 576, 13, 13, 6017, 28809, 28707, 947, 298, 3530, 579, 315, 28809, 28719, 12815, 368, 873, 13, 3840, 315, 28809, 333, 750, 2719, 1179, 438, 12815, 1722, 576, 2203, 733, 28748, 16289, 28793].
INFO 01-06 20:33:24 llm_engine.py:649] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 1.1%, CPU KV cache usage: 0.0%
INFO 01-06 20:33:27 async_llm_engine.py:111] Finished request cmpl-0493039beb574618be780e6235452a46.
INFO: 172.17.0.1:55126 - "POST /v1/chat/completions HTTP/1.1" 200 OK
Lyrics:
Got so hung up
On something you said
I should’ve guessed
that you would mess
with my head
You got up
and I stayed in bed
I was about to say something
Said nothing instead
Getting good at letting things go
But you’re somebody I want to know
So if you love me than let it show
Cuz I’ve been getting good at letting things go
I’ve got a feeling
You could prove me wrong
A feeling that I haven’t felt in so long
I can be patient
I can play along
Forgive as fast I forget you,
So don’t make me have to move on
Getting good at letting things go
But you’re somebody I want to know
So if you love me than let it show
Cuz I’ve been getting good at letting things go
Don’t want to leave so I’m letting you know
That I’ve been getting good at letting things go done
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.23.08 Driver Version: 545.23.08 CUDA Version: 12.3 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce RTX 3090 On | 00000000:09:00.0 On | N/A |
| 0% 27C P8 24W / 420W | 17696MiB / 24576MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 1423 G /usr/lib/xorg/Xorg 161MiB |
| 0 N/A N/A 1689 G /usr/bin/gnome-shell 46MiB |
| 0 N/A N/A 12701 G /usr/lib/firefox/firefox 0MiB |
| 0 N/A N/A 16374 C python3 17306MiB |
+---------------------------------------------------------------------------------------+
@pmbaumgartner
Copy link
Author

Logs from longer request:

INFO 01-06 21:17:01 llm_engine.py:649] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 7.2%, CPU KV cache usage: 0.0%
INFO 01-06 21:17:06 llm_engine.py:649] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 49.9 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 8.4%, CPU KV cache usage: 0.0%
INFO 01-06 21:17:11 llm_engine.py:649] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 49.7 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 9.5%, CPU KV cache usage: 0.0%
INFO 01-06 21:17:16 llm_engine.py:649] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 49.6 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 10.5%, CPU KV cache usage: 0.0%
INFO 01-06 21:17:21 llm_engine.py:649] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 49.2 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 11.7%, CPU KV cache usage: 0.0%
INFO 01-06 21:17:26 llm_engine.py:649] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 48.9 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 12.7%, CPU KV cache usage: 0.0%
INFO 01-06 21:17:29 async_llm_engine.py:111] Finished request cmpl-fb97369e399f40e1a77058aff71b829e.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment