Skip to content

Instantly share code, notes, and snippets.

@jrknox1977
Last active February 21, 2024 21:07
Show Gist options
  • Save jrknox1977/15eeb39fd71ae72cf2014a7cbeb9b2e1 to your computer and use it in GitHub Desktop.
Save jrknox1977/15eeb39fd71ae72cf2014a7cbeb9b2e1 to your computer and use it in GitHub Desktop.
Running Multiple ollama containers on a single host.

Multiple Ollama Containers on a single host (with multiple GPUs)

I don't want model RELOAD

  • I have a large machine with 2 GPUs and a considerable amount of RAM.
  • I was trying to use ollama to server llava and mistral BUT it would reload the models every time I switched model requests.
  • So this is the solution that appears to be working: Multiple Containers, each serving a different model, on different ports.

Ollama model working dir:

  • I have many models already downloaded on my machine so I mount the host ollama working dir to the containers.
  • Linux (At least on my linux machine) - /usr/share/ollama/.ollama
  • In container: /root/.ollama

The Docker Run Command:

  • docker run -e CUDA_VISIBLE_DEVICES=0 -d --gpus=all -v /usr/share/ollama/.ollama:/root/.ollama -p 11434:11434 --name ollama00 ollama/ollama
  • -e This is the environmental variable flag to set an ENV VAR.
  • The CUDA_VISIBLE_DEVICES=0 locks this container done to the first GPU.
  • --gpus=all still limited by the CUDA_VISIBLE_DEVICES=0
  • -v is the volume to mount from the HOST/CONTAINER so for me: /usr/share/ollama/.ollama is where I already have a ton of models downloaded and I don't want to download them again inside the container.
  • -p port number this again is HOST/CONTAINER and it is important to change the host port number is have multiple contianers serving ollama
  • --name give it a good name. in my case ollama00 just so I know it's the one on GPU 0

And my second container:

  • docker run -e CUDA_VISIBLE_DEVICES=1 -d --gpus=all -v /usr/share/ollama/.ollama:/root/.ollama -p 11435:11434 --name ollama01 ollama/ollama
  • Change the CUDA_VISIBLE_DEVICES=1 to 1 for my second GPU
  • Give it a different port number (HOST/CONTAINER)
  • Give is a different name, in this case ollama01 for my second GPU

Bonus CPU only Container (It is definitely not as fast, but it works and I am up to three models that can be called with out 'reload':

  • docker run -d -v /usr/share/ollama/.ollama:/root/.ollama -p 11436:11434 --name ollama03 ollama/o llama
@RockAfeller2013
Copy link

What is your Computer specs? How do u run two GPUs, does the OS recognise that

@jrknox1977
Copy link
Author

What is your Computer specs? How do u run two GPUs, does the OS recognise that
@RockAfeller2013
It's an i9 with 128G RAM, 2 3090s. It's a gaming motherboard that can support 2 GPUs.
Ubuntu 22 is the OS I am running.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment