dteoh/article_starcoder_macbook.md

## article_starcoder_macbook.md

      
    Raw
  

              article_starcoder_macbook.md
            
          
    How to run StarCoderBase 1B SFT on a MacBook Pro with Apple Silicon

These are notes on how I managed to get StarCoderBase-1B-SFT model compiled into
a quantized version such that it can run locally on my MBP M1 Pro and be
queryable through an OpenAI API-compatible server. StarCoderBase is a model
trained/tuned for programming tasks. The 1B parameters SFT model I am using
in this article is a version of the model that has had supervised fine tuning
applied to it. I am just going to call this "StarCoder" in the rest of this
article for the sake of simplicity. Number of parameters that a model has is
going to impact resource usage, so a smaller version of the model makes it more
practical for localhost usage. On my machine this model uses around 3GB of RAM.
Full disclaimer that I did not make anything new here. I am just describing how
I glued everything together. Full credit goes to the engineers who made all the
various tools and graciously provided their work through open source. I believe
that democratizing access to ML models is extremely important especially now
that models have a tendency to be large and require computation power that can
only be provided by corporations. Being able to run model inferencing on a
personal machine ensures that some of this power is available to individuals.
Setup

I am running macOS Ventura (13.5) at the time of writing. I will assume that if
you are reading this you know how to use homebrew and git. I will also assume
that you know your way around your shell of choice.
Python

You will need to get Python installed. I used Miniconda to install Python
and create an environment for the projects that were used. After installing the
tool and configuring your shell, you can :
conda create -n python3-11 python=3.11
conda activate python3-11
If this is done correctly, python -V should output Python 3.11.4 or similar.
Golang

We will need to compile some C++ and Golang code, install these tools:
brew install cmake go
Create quantized model

We will first build ggml so that we can create a quantized version of the
StarCoder model.
Clone the repo somewhere. My clone's HEAD was at 79ea7a4. The instructions I
am giving here are altered from the repo's examples/starcoder quick start.
cd ggml

# If you opened a new shell, make sure we are using the python env we created
conda activate python3-11

pip install -r requirements.txt

# This is going to download the 1B model from Hugging Face and convert the
# model file to a new format.
# This step created a file named "ggml-model.bin" on my machine
python examples/starcoder/convert-hf-to-ggml.py abacaj/starcoderbase-1b-sft

# You don't have to move or rename the model file, just doing this to align
# with the quick start instructions.
mkdir -p models/abacaj
mv ggml-model.bin models/abacaj/starcoderbase-1b-sft-ggml.bin

# Same build instructions from quick start
mkdir build && cd build
cmake .. && make -j4 starcoder starcoder-quantize

# Quantize the model. There is a trailing 2 at the end of the command.
# The "2" sets the quantization mode to q4_0. We use this mode so that we can
# use it with Metal later on.
./bin/starcoder-quantize ../models/abacaj/starcoderbase-1b-sft-ggml.bin ../models/abacaj/starcoderbase-1b-sft-ggml-q4_0.bin 2
At the end of this, you should have a quantized model. Once you have this, you
can actually load it with the starcoder binary and start playing around with
inferencing. For example:
./bin/starcoder -m ../models/abacaj/starcoderbase-1b-sft-ggml-q4_0.bin -p "def fizzbuzz(" --top_k 40 --top_p 0.95 --temp 0.2


main: seed = 1694432933
starcoder_model_load: loading model from '../models/abacaj/starcoderbase-1b-sft-ggml-q4_0.bin'
starcoder_model_load: n_vocab = 49153
starcoder_model_load: n_ctx   = 8192
starcoder_model_load: n_embd  = 2048
starcoder_model_load: n_head  = 16
starcoder_model_load: n_layer = 24
starcoder_model_load: ftype   = 2002
starcoder_model_load: qntvr   = 2
starcoder_model_load: ggml ctx size = 3894.60 MB
starcoder_model_load: memory size =  3072.00 MB, n_mem = 196608
starcoder_model_load: model size  =   822.46 MB
extract_tests_from_file : No test file found.
test_gpt_tokenizer : 0 tests failed out of 0 tests.

main: temp           = 0.200
main: top_k          = 40
main: top_p          = 0.950
main: repeat_last_n  = 64
main: repeat_penalty = 1.000
main: prompt: 'def fizzbuzz('
main: number of tokens in prompt = 5
main: token[0] =    589, def
main: token[1] =   8058,  fi
main: token[2] =   4669, zz
main: token[3] =  49075, buzz
main: token[4] =     26, (


def fizzbuzz(n):
    if n < 1:
        return "Invalid input"
    elif n < 10:
        return "1"
    elif n < 100:
        return "2"
    elif n < 1000:
        return "3"
    elif n < 10000:
        return "5"
    else:
        return "7"

for n in range(1, 10000):
    print(fizzbuzz(n))<|endoftext|>

main: mem per token =   335808 bytes
main:     load time =   404.87 ms
main:   sample time =   118.89 ms
main:  predict time =  1484.62 ms / 13.37 ms per token
main:    total time =  2072.40 ms
Cool. This will totally pass a tech interview.
Serve StarCoder through a REST API

Start by cloning the LocalAI repo. My clone's HEAD was at cc74fc9. We are
going to build this project instead of using a Docker image.
cd LocalAI

make BUILD_TYPE=metal build

mkdir -p models

# Move or copy the quantized model file created from the previous section into
# the models directory.
mv $THE_STARCODERBASE_Q4_0_FILE ./models/

# The contents of these files are attached to this gist. Please fill in the
# files before starting the server.
touch ./models/starcoderbase-1b.yaml
touch ./models/starcoderbase-1b-chat.tmpl
touch ./models/starcoderbase-1b-completion.tmpl

# This will start a server on port 8080
./local-ai --debug

# In a new terminal window you can try to use the completions API.
# Don't expect an instant response.
curl http://localhost:8080/v1/completions -H 'Content-Type: application/json' -d '{
  "model": "starcoderbase-1b",
  "prompt": "function incrementByTwo(num) {"
}'

# You can also try to chat with it.
curl http://localhost:8080/v1/chat/completions -H 'Content-Type: application/json' -d '{
  "model": "starcoderbase-1b",
  "messages": [{"role": "user", "content": "Write a Node.js handler for CORS"}]
}'
That's it. I hope this got you somewhere.

  
## starcoderbase-1b-chat.tmpl
[Instructions]:
{{.Input}}

[Response]:

## starcoderbase-1b-completion.tmpl
{{.Input}}

## starcoderbase-1b.yaml
name: starcoderbase-1b

parameters:
  model: abacaj-starcoderbase-1b-sft-ggml-q4_0.bin
  temperature: 0.2
  frequency_penalty: 1
  repeat_penalty: 2
  top_k: 10
  top_p: 0.95
  max_tokens: 512

backend: starcoder

template:
  chat: starcoderbase-1b-chat
  completion: starcoderbase-1b-completion

gpu_layers: 1
f16: true
	name: starcoderbase-1b

	parameters:
	model: abacaj-starcoderbase-1b-sft-ggml-q4_0.bin
	temperature: 0.2
	frequency_penalty: 1
	repeat_penalty: 2
	top_k: 10
	top_p: 0.95
	max_tokens: 512

	backend: starcoder

	template:
	chat: starcoderbase-1b-chat
	completion: starcoderbase-1b-completion

	gpu_layers: 1
	f16: true