Last active
May 27, 2024 14:03
-
-
Save rajivmehtaflex/68557ee03eb022f7267ac7ebc13db0b9 to your computer and use it in GitHub Desktop.
huggingface to gguf conversion.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Converting HuggingFace Models to GGUF/GGML | |
Downloading a HuggingFace model | |
Running llama.cpp convert.py on the HuggingFace model | |
(Optionally) Uploading the model back to HuggingFace | |
Downloading a HuggingFace model | |
There are various ways to download models, but in my experience the huggingface_hub library has been the most reliable. The git clone method occasionally results in OOM errors for large models. | |
Install the huggingface_hub library: | |
pip install huggingface_hub | |
Create a Python script named download.py with the following content: | |
from huggingface_hub import snapshot_download | |
model_id="lmsys/vicuna-13b-v1.5" | |
snapshot_download(repo_id=model_id, local_dir="vicuna-hf", | |
local_dir_use_symlinks=False, revision="main") | |
Run the Python script: | |
python download.py | |
You should now have the model downloaded to a directory called vicuna-hf. Verify by running: | |
ls -lash vicuna-hf | |
Converting the model | |
Now it's time to convert the downloaded HuggingFace model to a GGUF model. Llama.cpp comes with a converter script to do this. | |
Get the script by cloning the llama.cpp repo: | |
git clone https://github.com/ggerganov/llama.cpp.git | |
Install the required python libraries: | |
pip install -r llama.cpp/requirements.txt | |
Verify the script is there and understand the various options: | |
python llama.cpp/convert.py -h | |
Convert the HF model to GGUF model: | |
python llama.cpp/convert.py AutoCoder_S_6 --vocab-type bpe --pad-vocab --outfile AutoCoder_S_6.gguf --outtype q8_0 | |
In this case we're also quantizing the model to 8 bit by setting --outtype q8_0. Quantizing helps improve inference speed, but it can negatively impact quality. You can use --outtype f16 (16 bit) or --outtype f32 (32 bit) to preserve original quality. | |
Verify the GGUF model was created: | |
ls -lash vicuna-13b-v1.5.gguf | |
Pushing the GGUF model to HuggingFace | |
You can optionally push back the GGUF model to HuggingFace. | |
Create a Python script with the filename upload.py that has the following content: | |
from huggingface_hub import HfApi | |
api = HfApi() | |
model_id = "substratusai/vicuna-13b-v1.5-gguf" | |
api.create_repo(model_id, exist_ok=True, repo_type="model") | |
api.upload_file( | |
path_or_fileobj="vicuna-13b-v1.5.gguf", | |
path_in_repo="vicuna-13b-v1.5.gguf", | |
repo_id=model_id, | |
) | |
Get a HuggingFace Token that has write permission from here: https://huggingface.co/settings/tokens | |
Set your HuggingFace token: | |
export HUGGING_FACE_HUB_TOKEN=<paste-your-own-token> | |
Run the upload.py script: | |
python upload.py |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment