Skip to content

Instantly share code, notes, and snippets.

@dteoh
Last active March 18, 2024 10:20
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save dteoh/89a714766e08746b2652e506e3a479da to your computer and use it in GitHub Desktop.
Save dteoh/89a714766e08746b2652e506e3a479da to your computer and use it in GitHub Desktop.
How to run StarCoderBase 1B SFT on a MacBook Pro with Apple Silicon

How to run StarCoderBase 1B SFT on a MacBook Pro with Apple Silicon

These are notes on how I managed to get StarCoderBase-1B-SFT model compiled into a quantized version such that it can run locally on my MBP M1 Pro and be queryable through an OpenAI API-compatible server. StarCoderBase is a model trained/tuned for programming tasks. The 1B parameters SFT model I am using in this article is a version of the model that has had supervised fine tuning applied to it. I am just going to call this "StarCoder" in the rest of this article for the sake of simplicity. Number of parameters that a model has is going to impact resource usage, so a smaller version of the model makes it more practical for localhost usage. On my machine this model uses around 3GB of RAM.

Full disclaimer that I did not make anything new here. I am just describing how I glued everything together. Full credit goes to the engineers who made all the various tools and graciously provided their work through open source. I believe that democratizing access to ML models is extremely important especially now that models have a tendency to be large and require computation power that can only be provided by corporations. Being able to run model inferencing on a personal machine ensures that some of this power is available to individuals.

Setup

I am running macOS Ventura (13.5) at the time of writing. I will assume that if you are reading this you know how to use homebrew and git. I will also assume that you know your way around your shell of choice.

Python

You will need to get Python installed. I used Miniconda to install Python and create an environment for the projects that were used. After installing the tool and configuring your shell, you can :

conda create -n python3-11 python=3.11
conda activate python3-11

If this is done correctly, python -V should output Python 3.11.4 or similar.

Golang

We will need to compile some C++ and Golang code, install these tools:

brew install cmake go

Create quantized model

We will first build ggml so that we can create a quantized version of the StarCoder model.

Clone the repo somewhere. My clone's HEAD was at 79ea7a4. The instructions I am giving here are altered from the repo's examples/starcoder quick start.

cd ggml

# If you opened a new shell, make sure we are using the python env we created
conda activate python3-11

pip install -r requirements.txt

# This is going to download the 1B model from Hugging Face and convert the
# model file to a new format.
# This step created a file named "ggml-model.bin" on my machine
python examples/starcoder/convert-hf-to-ggml.py abacaj/starcoderbase-1b-sft

# You don't have to move or rename the model file, just doing this to align
# with the quick start instructions.
mkdir -p models/abacaj
mv ggml-model.bin models/abacaj/starcoderbase-1b-sft-ggml.bin

# Same build instructions from quick start
mkdir build && cd build
cmake .. && make -j4 starcoder starcoder-quantize

# Quantize the model. There is a trailing 2 at the end of the command.
# The "2" sets the quantization mode to q4_0. We use this mode so that we can
# use it with Metal later on.
./bin/starcoder-quantize ../models/abacaj/starcoderbase-1b-sft-ggml.bin ../models/abacaj/starcoderbase-1b-sft-ggml-q4_0.bin 2

At the end of this, you should have a quantized model. Once you have this, you can actually load it with the starcoder binary and start playing around with inferencing. For example:

./bin/starcoder -m ../models/abacaj/starcoderbase-1b-sft-ggml-q4_0.bin -p "def fizzbuzz(" --top_k 40 --top_p 0.95 --temp 0.2


main: seed = 1694432933
starcoder_model_load: loading model from '../models/abacaj/starcoderbase-1b-sft-ggml-q4_0.bin'
starcoder_model_load: n_vocab = 49153
starcoder_model_load: n_ctx   = 8192
starcoder_model_load: n_embd  = 2048
starcoder_model_load: n_head  = 16
starcoder_model_load: n_layer = 24
starcoder_model_load: ftype   = 2002
starcoder_model_load: qntvr   = 2
starcoder_model_load: ggml ctx size = 3894.60 MB
starcoder_model_load: memory size =  3072.00 MB, n_mem = 196608
starcoder_model_load: model size  =   822.46 MB
extract_tests_from_file : No test file found.
test_gpt_tokenizer : 0 tests failed out of 0 tests.

main: temp           = 0.200
main: top_k          = 40
main: top_p          = 0.950
main: repeat_last_n  = 64
main: repeat_penalty = 1.000
main: prompt: 'def fizzbuzz('
main: number of tokens in prompt = 5
main: token[0] =    589, def
main: token[1] =   8058,  fi
main: token[2] =   4669, zz
main: token[3] =  49075, buzz
main: token[4] =     26, (


def fizzbuzz(n):
    if n < 1:
        return "Invalid input"
    elif n < 10:
        return "1"
    elif n < 100:
        return "2"
    elif n < 1000:
        return "3"
    elif n < 10000:
        return "5"
    else:
        return "7"

for n in range(1, 10000):
    print(fizzbuzz(n))<|endoftext|>

main: mem per token =   335808 bytes
main:     load time =   404.87 ms
main:   sample time =   118.89 ms
main:  predict time =  1484.62 ms / 13.37 ms per token
main:    total time =  2072.40 ms

Cool. This will totally pass a tech interview.

Serve StarCoder through a REST API

Start by cloning the LocalAI repo. My clone's HEAD was at cc74fc9. We are going to build this project instead of using a Docker image.

cd LocalAI

make BUILD_TYPE=metal build

mkdir -p models

# Move or copy the quantized model file created from the previous section into
# the models directory.
mv $THE_STARCODERBASE_Q4_0_FILE ./models/

# The contents of these files are attached to this gist. Please fill in the
# files before starting the server.
touch ./models/starcoderbase-1b.yaml
touch ./models/starcoderbase-1b-chat.tmpl
touch ./models/starcoderbase-1b-completion.tmpl

# This will start a server on port 8080
./local-ai --debug

# In a new terminal window you can try to use the completions API.
# Don't expect an instant response.
curl http://localhost:8080/v1/completions -H 'Content-Type: application/json' -d '{
  "model": "starcoderbase-1b",
  "prompt": "function incrementByTwo(num) {"
}'

# You can also try to chat with it.
curl http://localhost:8080/v1/chat/completions -H 'Content-Type: application/json' -d '{
  "model": "starcoderbase-1b",
  "messages": [{"role": "user", "content": "Write a Node.js handler for CORS"}]
}'

That's it. I hope this got you somewhere.

[Instructions]:
{{.Input}}
[Response]:
name: starcoderbase-1b
parameters:
model: abacaj-starcoderbase-1b-sft-ggml-q4_0.bin
temperature: 0.2
frequency_penalty: 1
repeat_penalty: 2
top_k: 10
top_p: 0.95
max_tokens: 512
backend: starcoder
template:
chat: starcoderbase-1b-chat
completion: starcoderbase-1b-completion
gpu_layers: 1
f16: true
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment