These are notes on how I managed to get StarCoderBase-1B-SFT model compiled into a quantized version such that it can run locally on my MBP M1 Pro and be queryable through an OpenAI API-compatible server. StarCoderBase is a model trained/tuned for programming tasks. The 1B parameters SFT model I am using in this article is a version of the model that has had supervised fine tuning applied to it. I am just going to call this "StarCoder" in the rest of this article for the sake of simplicity. Number of parameters that a model has is going to impact resource usage, so a smaller version of the model makes it more practical for localhost usage. On my machine this model uses around 3GB of RAM.
Full disclaimer that I did not make anything new here. I am just describing how I glued everything together. Full credit goes to the engineers who made all the various tools and graciously provided their work through open source. I believe that democratizing access to ML models is extremely important especially now that models have a tendency to be large and require computation power that can only be provided by corporations. Being able to run model inferencing on a personal machine ensures that some of this power is available to individuals.
I am running macOS Ventura (13.5) at the time of writing. I will assume that if you are reading this you know how to use homebrew and git. I will also assume that you know your way around your shell of choice.
You will need to get Python installed. I used Miniconda to install Python and create an environment for the projects that were used. After installing the tool and configuring your shell, you can :
conda create -n python3-11 python=3.11
conda activate python3-11
If this is done correctly, python -V
should output Python 3.11.4
or similar.
We will need to compile some C++ and Golang code, install these tools:
brew install cmake go
We will first build ggml so that we can create a quantized version of the StarCoder model.
Clone the repo somewhere. My clone's HEAD was at 79ea7a4
. The instructions I
am giving here are altered from the repo's examples/starcoder
quick start.
cd ggml
# If you opened a new shell, make sure we are using the python env we created
conda activate python3-11
pip install -r requirements.txt
# This is going to download the 1B model from Hugging Face and convert the
# model file to a new format.
# This step created a file named "ggml-model.bin" on my machine
python examples/starcoder/convert-hf-to-ggml.py abacaj/starcoderbase-1b-sft
# You don't have to move or rename the model file, just doing this to align
# with the quick start instructions.
mkdir -p models/abacaj
mv ggml-model.bin models/abacaj/starcoderbase-1b-sft-ggml.bin
# Same build instructions from quick start
mkdir build && cd build
cmake .. && make -j4 starcoder starcoder-quantize
# Quantize the model. There is a trailing 2 at the end of the command.
# The "2" sets the quantization mode to q4_0. We use this mode so that we can
# use it with Metal later on.
./bin/starcoder-quantize ../models/abacaj/starcoderbase-1b-sft-ggml.bin ../models/abacaj/starcoderbase-1b-sft-ggml-q4_0.bin 2
At the end of this, you should have a quantized model. Once you have this, you
can actually load it with the starcoder
binary and start playing around with
inferencing. For example:
./bin/starcoder -m ../models/abacaj/starcoderbase-1b-sft-ggml-q4_0.bin -p "def fizzbuzz(" --top_k 40 --top_p 0.95 --temp 0.2
main: seed = 1694432933
starcoder_model_load: loading model from '../models/abacaj/starcoderbase-1b-sft-ggml-q4_0.bin'
starcoder_model_load: n_vocab = 49153
starcoder_model_load: n_ctx = 8192
starcoder_model_load: n_embd = 2048
starcoder_model_load: n_head = 16
starcoder_model_load: n_layer = 24
starcoder_model_load: ftype = 2002
starcoder_model_load: qntvr = 2
starcoder_model_load: ggml ctx size = 3894.60 MB
starcoder_model_load: memory size = 3072.00 MB, n_mem = 196608
starcoder_model_load: model size = 822.46 MB
extract_tests_from_file : No test file found.
test_gpt_tokenizer : 0 tests failed out of 0 tests.
main: temp = 0.200
main: top_k = 40
main: top_p = 0.950
main: repeat_last_n = 64
main: repeat_penalty = 1.000
main: prompt: 'def fizzbuzz('
main: number of tokens in prompt = 5
main: token[0] = 589, def
main: token[1] = 8058, fi
main: token[2] = 4669, zz
main: token[3] = 49075, buzz
main: token[4] = 26, (
def fizzbuzz(n):
if n < 1:
return "Invalid input"
elif n < 10:
return "1"
elif n < 100:
return "2"
elif n < 1000:
return "3"
elif n < 10000:
return "5"
else:
return "7"
for n in range(1, 10000):
print(fizzbuzz(n))<|endoftext|>
main: mem per token = 335808 bytes
main: load time = 404.87 ms
main: sample time = 118.89 ms
main: predict time = 1484.62 ms / 13.37 ms per token
main: total time = 2072.40 ms
Cool. This will totally pass a tech interview.
Start by cloning the LocalAI repo. My clone's HEAD was at cc74fc9
. We are
going to build this project instead of using a Docker image.
cd LocalAI
make BUILD_TYPE=metal build
mkdir -p models
# Move or copy the quantized model file created from the previous section into
# the models directory.
mv $THE_STARCODERBASE_Q4_0_FILE ./models/
# The contents of these files are attached to this gist. Please fill in the
# files before starting the server.
touch ./models/starcoderbase-1b.yaml
touch ./models/starcoderbase-1b-chat.tmpl
touch ./models/starcoderbase-1b-completion.tmpl
# This will start a server on port 8080
./local-ai --debug
# In a new terminal window you can try to use the completions API.
# Don't expect an instant response.
curl http://localhost:8080/v1/completions -H 'Content-Type: application/json' -d '{
"model": "starcoderbase-1b",
"prompt": "function incrementByTwo(num) {"
}'
# You can also try to chat with it.
curl http://localhost:8080/v1/chat/completions -H 'Content-Type: application/json' -d '{
"model": "starcoderbase-1b",
"messages": [{"role": "user", "content": "Write a Node.js handler for CORS"}]
}'
That's it. I hope this got you somewhere.