Skip to content

Instantly share code, notes, and snippets.

@sohang3112
Last active June 12, 2024 12:07
Show Gist options
  • Save sohang3112/0772cb7676bda008aaf5430a10f7b893 to your computer and use it in GitHub Desktop.
Save sohang3112/0772cb7676bda008aaf5430a10f7b893 to your computer and use it in GitHub Desktop.
Notes on self-hosting Generative AI LLM models with Ollama

Ollama Notes

Ollama is an open-source tool that allows self-hosting Large Language Models (LLMs). Normally these LLMs are not feasible to deploy on consumer hardware, however Ollama optimizes the models by removing unused layers, rounding decimal points in weights to reduce model size, etc.

Note: Some details about the Ollama service are Linux-specific, but most things are same on all platforms.

In Linux, Ollama can be installed using the command curl -fsSL https://ollama.com/install.sh | sh. Verify successful install by typing ollama --version.

Ollama Server

The above Linux install command also starts the Ollama service in the background using systemd, which will automatically restart ollama if it crashes or the system reboots. Note that since systemd runs as root, therefore the Ollama service started is also owned by root. If you don't want this, you can stop the Ollama service using sudo systemctl disable ollama --now and instead start the Ollama service in a terminal in user-space using ollama serve.

By default, Ollama server listens at port 11434. Various settings (eg. host & port) in the Ollama server can be modified by setting environment variables. For example, when Ollama is deployed on a multi-GPU server, NCCL_P2P_LEVEL=NV environment variable may boost performance (it speeds up inter-GPU communication by bypassing CPU).

When Ollama is running as a Systemctl service (default), its logs can be viewed with journalctl command:

$ sudo journalctl -u ollama --boot         # Logs of Ollama service since boot, interactive (like less and man commands)
$ sudo journalctl -u ollama --boot > ollama_log.txt      # save logs of Ollama service to text file

Note: Ollama server must be running for all further Ollama commands (eg. pull, run, chat api, etc.) to work.

Downloading LLM models in Ollama

See all available models & their info (eg. no. of model parameters) here - eg. models like mistral, llama3, etc. can be run like this (put the actual model name in place of MODEL):

$ ollama run MODEL
>>> 

This first downloads the model, and then opens a chat REPL where you can chat with the model. You can also download models without immediately opening chat REPL by running ollama pull MODEL.

You can list all downloaded models using ollama ls, and delete a downloaded model using ollama rm MODEL. GPU usage of recently used models can be seen using ollama ps:

$ ollama ps
NAME            ID              SIZE    PROCESSOR       UNTIL
llama3:latest   365c0bd3c000    5.4 GB  100% GPU        2 minutes from now

See ollama --help to see all Ollama CLI commands.

Note: When running as root as a Systemctl service, downloaded models are at /usr/share/ollama/.ollama/models/. OTOH models are stored at ~/.ollama/models/ when running ollama serve in user-space.

Ollama Context Size / Token Limit

By default, context size in Ollama is 2048 tokens - see this for how to change context size. Each model has its own maximum context size - for example, Llama 3 models support max 8192 tokens. Model accuracy generally degrades for longer prompts / contexts, so make sure to test to find the maximum tokens for which the model you're using responds accurately for your prompt.

Note: No. of Llama 3 tokens in prompt can be checked at this website.

Max Time 20 seconds has been specified in each of the following, to prevent indefinite hanging if Ollama server doesn't respond for some reason (this usually happens when inference on a model is done for the first time, since the model takes a few minutes to load).

In options dict, temperature 0 has been set for reproducible model output. Note that temperature 0 reduces non-determinism but doesn't eliminate it entirely - so it's still possible to sometimes get 2 different outputs from a LLM for the same prompt. See all options here - eg. num_ctx context size, etc.

  • Direct network request to /api/chat (streaming is true by default):
$ curl http://localhost:11434/api/chat -H "Content-Type: application/json" --max-time 20 -d '{
  "model": "MODEL",
  "messages": [
    { "role": "user", "content": "PROMPT" }
  ]
}'
{"model":"MODEL","created_at":"2024-05-15T07:26:52.58290265Z","message":{"role":"assistant","content":" reason"},"done":false}
{"model":"MODEL","created_at":"2024-05-15T07:26:52.621422361Z","message":{"role":"assistant","content":" the"},"done":false}
...
  • Direct network request to /api/chat (with streaming false):
$ curl http://localhost:11434/api/chat -H "Content-Type: application/json" --max-time 20 -d '{
  "model": "MODEL",
  "messages": [
    { "role": "user", "content": "PROMPT" }
  ],
  "options": {
    "seed": 123,
    "temperature": 0      
  },
  "stream": false
}'
{"model":"MODEL","created_at":"2024-05-27T10:58:54.341293172Z","message":{"role":"assistant","content":"MODEL_PROMPT_RESPONSE"},"done_reason":"stop","done":true,"total_duration":41088402065,"load_duration":31246191223,"prompt_eval_count":15,"prompt_eval_duration":239390000,"eval_count":405,"eval_duration":9580648000}
import ollama
client = ollama.Client('http://localhost:11434/api/chat', timeout=20)      # timeout in seconds
response = ollama.chat(
  model='MODEL', 
  messages=[
     { 'role': 'user', 'content': 'PROMPT'}
  ],
  options={
    "seed": 123,
    "temperature": 0      
  },
  # stream=True
)
print(response['message']['content'])

NOTE: Here, streaming is False by default (unlike directly calling Ollama chat api).

  • Streaming using Python requests:
import json
import requests
api_url = 'http://localhost:11434/api/chat'
payload = {
  "model": model,
  "messages": [
    { "role": "user", "content": prompt }
  ],
  "options": {
    "seed": 123,
    "temperature": 0      
  },
  stream=True       # default
}
with requests.post(api_url, json=payload, timeout=timeout, stream=True) as resp:
  for line in resp.iter_lines():
    chunk = json.loads(line)
    if 'error' in word_json:
        raise Exception(f'Got error from Ollama chat api: {chunk["error"]}')
    if chunk["done"]:
        break
    print(chunk['message']['content'], end='', flush=True)
@sohang3112
Copy link
Author

TODO: Try out these Ollama interesting options: https://github.com/ollama/ollama/blob/main/docs/api.md#generate-a-chat-completion

  • format=json
  • keep_alive (how long model is kept in memory, default is 5m = 5 minutes)
  • system
  • template

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment