Skip to content

Instantly share code, notes, and snippets.

@Birch-san
Last active March 12, 2024 15:14
Show Gist options
  • Star 23 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save Birch-san/666a60eacbd9095902f99874622767be to your computer and use it in GitHub Desktop.
Save Birch-san/666a60eacbd9095902f99874622767be to your computer and use it in GitHub Desktop.
Running GitHub Copilot against local Code Llama model

Running GitHub Copilot VSCode extension against local Code Llama model

image

image

Tested on NVIDIA RTX 4090, but these instructions also cover AMD and Mac in case you wanna try those.
This guide assumes you are running Linux (I ran this on Ubuntu).

Before you get excited:
These instructions currently require you to have genuine GitHub Copilot access.
So, this is currently only useful if you "have Copilot, but want to try an open model instead".
I believe there will be a way to skip authentication and run against a local model, but this will take some digging from the community.
Perhaps the VSCode GitHub Copilot extension's client code (~/.vscode-insiders/extensions/github.copilot-1.111.414/dist/extension.js) can be modified to skip authentication when connecting to a local, non-GitHub API server.
You could also check out Continue.dev or HF Code Autocomplete.

If you do have genuine GitHub Copilot access already, then read on! Let's replace Copilot with an open model.

Serve Code Llama locally via ialacol

We will install & run ialacol, a Python program to serve an OpenAI-compatible API webserver.
Afterwards, we will configure the GitHub Copilot VSCode extension, to talk to our local API instead of to GitHub/OpenAI.

Ialacol serves an LLM locally via ctransformers + GGML.

Create + activate a new virtual environment

This is to avoid interfering with your current Python environment (other Python scripts on your computer might not appreciate it if you update a bunch of packages they were relying on).

Follow the instructions for virtualenv, or conda, or neither (if you don't care what happens to other Python scripts on your computer).

Using venv

Create environment:

python -m venv venv
pip install --upgrade pip

Activate environment:

. ./venv/bin/activate

(First-time) update environment's pip:

pip install --upgrade pip

Using conda

Download conda.

Skip this step if you already have conda.

Install conda:

Skip this step if you already have conda.

Assuming you're using a bash shell:

# Linux installs Anaconda via this shell script. Mac installs by running a .pkg installer.
bash Anaconda-latest-Linux-x86_64.sh
# this step probably works on both Linux and Mac.
eval "$(~/anaconda3/bin/conda shell.bash hook)"
conda config --set auto_activate_base false
conda init

Create environment:

conda create -n p311-ialacol python=3.11

Activate environment:

conda activate p311-ialacol

Get the ialacol source code

git clone https://github.com/chenhunghan/ialacol.git
cd ialacol

Install dependencies

Ensure that you have first activated your Python environment, as above, and are inside the ialacol directory.

pip install -r requirements.txt

Install platform-specific dependencies

We will install ctransformers with hardware acceleration for our platform.

NVIDIA/CUDA:

pip install ctransformers[cuda]

AMD/ROCm:

CT_HIPBLAS=1 pip install ctransformers --no-binary ctransformers

Mac/Metal:

CT_METAL=1 pip install ctransformers --no-binary ctransformers

Run ialacol

The server will listen on http://localhost:8001, it will run TheBloke/CodeLlama-7B-Python-GGUF's Q5_K_M distribution (5-bit quantization - explanation, measurements).

I am using GPU_LAYERS=48 to put the whole model on-GPU. TheBloke/CodeLlama-7B-Python-GGUF, has 32 layers (as do other 7b Llama models).
Larger models such as CodeLlama-34b-Python have 48 layers.
If you find you don't have enough VRAM: reduce GPU_LAYERS.

LOGGING_LEVEL=DEBUG DEFAULT_MODEL_HG_REPO_ID=TheBloke/CodeLlama-7B-Python-GGUF DEFAULT_MODEL_FILE=codellama-7b-python.Q5_K_M.gguf GPU_LAYERS=48 uvicorn main:app --host 127.0.0.1 --port 8001

Note: you will want to come back here later! After you have Copilot working against this small model, you could try terminating this and launching it again with a bigger model, such as CodeLlama-34b-Python, DEFAULT_MODEL_HG_REPO_ID=TheBloke/CodeLlama-34B-Python-GGUF DEFAULT_MODEL_FILE=codellama-34b-python.Q4_K_M.gguf.

Did it launch correctly? If so: that's good, but we're not done yet!
If we connect the VSCode GitHub Copilot extension to this thing: it will spam it with requests, and it might segfault. We need a queue.

Note: if you're not worried about the queue just yet, you can skip the "Set up the text-inference-batcher section" and go straight to the "Set up GitHub Copilot VSCode Extension" section, just be sure to configure the VSCode Extension to connect to ialacol directly on port 8001, instead of connecting to text-inference-batcher on port 8000.

text-inference-batcher is a load-balancer / queue, which will protect us from the frequent requests that the VSCode Copilot extensions sends to our ialacol webserver.

It says it requires Node 20 or higher.

Installing NodeJS on Linux:
To install NodeJS on Ubuntu/Debian, see nodesource/distributions instructions. Here's a copy of what it says, for your convenience:

# Download and import the Nodesource GPG key
sudo apt-get update
sudo apt-get install -y ca-certificates curl gnupg
sudo mkdir -p /etc/apt/keyrings
curl -fsSL https://deb.nodesource.com/gpgkey/nodesource-repo.gpg.key | sudo gpg --dearmor -o /etc/apt/keyrings/nodesource.gpg

# Create deb repository
NODE_MAJOR=20
echo "deb [signed-by=/etc/apt/keyrings/nodesource.gpg] https://deb.nodesource.com/node_$NODE_MAJOR.x nodistro main" | sudo tee /etc/apt/sources.list.d/nodesource.list

# Update repository sources and install
sudo apt-get update
sudo apt-get install nodejs -y

Get the text-inference-batcher source code

git clone https://github.com/ialacol/text-inference-batcher.git
cd text-inference-batcher

Install dependencies

npm i

Run text-inference-batcher

This will run text-inference-batcher on http://localhost:8000.
text-inference-batcher will forward requests to http://localhost:8001, where our ialacol webserver is running.

UPSTREAMS="http://localhost:8001" npm start

Ensure GitHub Copilot VSCode extension is installed:

image

image

Connect GitHub Copilot VSCode extension to your GitHub account

Follow the modal dialogues, to connect the GitHub Copilot VSCode extension to your GitHub account.

image

Once you are logged in, open the command palette (Ctrl Shift P) and run the "Reload window" command:

image

Once the window reloads, maybe you will see "GitHub Copilot could not connect to the server" + "No access to GitHub Copilot found". If you see this, I believe that's game over. Yes, even though we don't actually want to connect to GitHub. I have so far found that if you want the extension to finish setup and proceed to direct traffic to your configured local LLM: you first need to authenticate with true Copilot. Internet people: tell me if you find a way to get past this step without having Copilot access! Bonus points if it can be done without connecting a GitHub account.

image

Configure Copilot VSCode extension to contact your ialacol webserver

Open your VSCode settings:

image

Add github.copilot.advanced config telling it to connect to http://127.0.0.1:8000, where we deployed our text-inference-batcher queue:

    "github.copilot.advanced": {
        "debug.overrideEngine": "codellama-7b-python.Q5_K_M.gguf",
        "debug.testOverrideProxyUrl": "http://127.0.0.1:8000",
        "debug.overrideProxyUrl": "http://127.0.0.1:8000"
    },

Note: if you have any problems with connectivity, and nothing seems to be receiving traffic, you could change this to port 8001 to connect to ialacol directly, without going through our text-inference-batcher queue.

Now reload, to make the GitHub Copilot VSCode extension redo its setup process using the new config:

image

You should see your Copilot in the corner:

image

Trying it out

Open some code file, and start typing!

image

Troubleshooting

Copilot isn't doing anything

Check the extension's Output

Errors and successes are both logged under Output:

image

Check that ialocal is receiving traffic

When healthy and receiving traffic and streaming tokens, the ialocal terminal looks like this:

image

image

Check that text-inference-batcher is receiving traffic

When healthy and receiving traffic, text-inference-batcher looks like this:

image

Check whether the VSCode extension is responsive

Click the Copilot button, to check it's responsive.
If it's configured incorrectly: clicking will do nothing. But if it's configured correctly, then it'll offer up a "disable" dialogue:

image

Changing models

Check in your VSCode settings that "debug.overrideEngine": "codellama-7b-python.Q5_K_M.gguf" matches the environment variable with which your launched ialacol (DEFAULT_MODEL_FILE=codellama-34b-python.Q4_K_M.gguf).
Visit http://localhost:8001/v1/models to check which model is currently being served by ialacol.
If you relaunch ialacol with a different model configured:

  • You may need to restart text-inference-batcher (I'm not sure) to rediscover the available upstream models
  • You will need to change VSCode settings.json debug.overrideEngine to match your new DEFAULT_MODEL_FILE
  • You may need to Reload Window in VSCode (I'm not sure) to setup the extension again

VSCode Remote Development

Check which VSCode you configured (local or remote). You may see a clue in the filepath (I see from "Application Support" in the filepath, that it is my local Mac I have configured, not my remote Linux machine which serves the model):
image

For my own personal setup: I have so far only managed to reconfigure Copilot in my local VSCode workspaces. I am not sure how to reconfigure it for remote VSCode workspaces.

Check which endpoint you have told the Copilot extension to connect to. "localhost" (or 127.0.0.1) here means your local machine's localhost, not the remote machine's localhost. So make sure your remote machine is forwarding the ports on which its local processes are listening for traffic (specifically we care about forwarding port 8000, the text-inference-batcher):
image

Acknowledgements

Thanks Henry Chen for developing ialacol + text-inference-batcher and explaining how to set them up.
Thanks Parth Thakkar for reverse-engineering Copilot, to discover the protocol and how it's configured.

And ggml contributors, and developers of Copilot and the Copilot extension, and the ctransformers team, and TheBloke, and Meta / the Code Llama team…

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment