Skip to content

Instantly share code, notes, and snippets.

@SinclairCoder
Forked from padeoe/README_hfd.md
Created November 11, 2023 12:02
Show Gist options
  • Save SinclairCoder/bf7237f3ab8f04e641bfdbac398b4f69 to your computer and use it in GitHub Desktop.
Save SinclairCoder/bf7237f3ab8f04e641bfdbac398b4f69 to your computer and use it in GitHub Desktop.
Command-line Tool for Easy Downloading of Huggingface Models

🤗Huggingface Model Downloader

Update: The previous version has a bug. When resuming from a breakpoint, there may be an issue causing incomplete files. Please update to the latest version!!!

Considering the lack of multi-threaded download support in the official huggingface-cli, and the inadequate error handling in hf_transfer, this command-line tool smartly utilizes wget or aria2 for LFS files and git clone for the rest.

Features

  • ⏯️ Resume from breakpoint: You can re-run it or Ctrl+C anytime.
  • 🚀 Multi-threaded Download: Utilize multiple threads to speed up the download process.
  • 🚫 File Exclusion: Use --exclude to skip specific files, save time for models with duplicate formats (e.g., .bin and .safetensors).
  • 🔐 Auth Support: For gated models that require Huggingface login, use --hf_username and --hf_token to authenticate.
  • 🪞 Mirror Site Support: Set up with HF_ENDPOINT environment variable.
  • 🌍 Proxy Support: Set up with HTTPS_PROXY environment variable.
  • 📦 Simple: No dependencies & No installation required.

Usage

First, Download hfd.sh or clone this repo, and then grant execution permission to the script.

chmod a+x hfd.sh

Usage Instructions:

$ ./hfd.sh -h
Usage:
  hfd <model_id> [--exclude exclude_pattern] [--hf_username username] [--hf_token token] [--tool wget|aria2c] [-x threads] [--dataset]

Description:
  Downloads a model or dataset from Hugging Face using the provided model ID.

Parameters:
  model_id        The Hugging Face model ID in the format 'repo/model_name'.
  --exclude       (Optional) Flag to specify a string pattern to exclude files from downloading.
  exclude_pattern The pattern to match against filenames for exclusion.
  --hf_username   (Optional) Hugging Face username for authentication.
  --hf_token      (Optional) Hugging Face token for authentication.
  --tool          (Optional) Download tool to use. Can be wget (default) or aria2c.
  -x              (Optional) Number of download threads for aria2c.
  --dataset       (Optional) Flag to indicate downloading a dataset.

Example:
  hfd bigscience/bloom-560m --exclude safetensors
  hfd meta-llama/Llama-2-7b --hf_username myuser --hf_token mytoken --tool aria2c -x 8
  hfd lavita/medical-qa-shared-task-v1-toy --dataset

Download a model:

./hdf.sh bigscience/bloom-560m

Download a model need login

Get huggingface token from https://huggingface.co/settings/tokens, then

hfd meta-llama/Llama-2-7b --hf_username YOUR_HF_USERNAME --hf_token YOUR_HF_TOKEN

Download a model and exclude certain files (e.g., .safetensors):

./hdf.sh bigscience/bloom-560m --exclude safetensors

Download with aria2c and multiple threads:

./hfd.sh bigscience/bloom-560m --tool aria2c -x 4

Output: During the download, the file URLs will be displayed:

$ ./hdf.sh bigscience/bloom-560m --exclude safetensors
...
Start Downloading lfs files, bash script:

wget -c https://huggingface.co/bigscience/bloom-560m/resolve/main/flax_model.msgpack
# wget -c https://huggingface.co/bigscience/bloom-560m/resolve/main/model.safetensors
wget -c https://huggingface.co/bigscience/bloom-560m/resolve/main/onnx/decoder_model.onnx
...

Create an Alias for Convenience

For easier access, you can create an alias for the script:

alias hfd="$PWD/hfd.sh"
#!/bin/bash
trap 'printf "\nDownload interrupted. If you re-run the command, you can resume the download from the breakpoint.\n"; exit 1' INT
display_help() {
cat << EOF
Usage:
hfd <model_id> [--exclude exclude_pattern] [--hf_username username] [--hf_token token] [--tool wget|aria2c] [-x threads] [--dataset]
Description:
Downloads a model or dataset from Hugging Face using the provided model ID.
Parameters:
model_id The Hugging Face model ID in the format 'repo/model_name'.
--exclude (Optional) Flag to specify a string pattern to exclude files from downloading.
exclude_pattern The pattern to match against filenames for exclusion.
--hf_username (Optional) Hugging Face username for authentication.
--hf_token (Optional) Hugging Face token for authentication.
--tool (Optional) Download tool to use. Can be wget (default) or aria2c.
-x (Optional) Number of download threads for aria2c.
--dataset (Optional) Flag to indicate downloading a dataset.
Example:
hfd bigscience/bloom-560m --exclude safetensors
hfd meta-llama/Llama-2-7b --hf_username myuser --hf_token mytoken --tool aria2c -x 8
hfd lavita/medical-qa-shared-task-v1-toy --dataset
EOF
exit 1
}
MODEL_ID=$1
shift
# Default values
TOOL="wget"
THREADS=1
HF_ENDPOINT=${HF_ENDPOINT:-"https://huggingface.co"}
while [[ $# -gt 0 ]]; do
case $1 in
--exclude) EXCLUDE_PATTERN="$2"; shift 2 ;;
--hf_username) HF_USERNAME="$2"; shift 2 ;;
--hf_token) HF_TOKEN="$2"; shift 2 ;;
--tool) TOOL="$2"; shift 2 ;;
-x) THREADS="$2"; shift 2 ;;
--dataset) DATASET=1; shift ;;
*) shift ;;
esac
done
# Check if aria2c is installed
if [[ "$TOOL" == "aria2c" ]]; then
if ! command -v aria2c &>/dev/null; then
echo "aria2c is not installed. Installing it..."
sudo apt update && sudo apt install -y aria2 || { echo "Failed to install aria2c. Exiting."; exit 1; }
fi
fi
[[ -z "$MODEL_ID" || "$MODEL_ID" =~ ^-h ]] && display_help
MODEL_DIR="${MODEL_ID#*/}"
echo $DATASET
if [[ "$DATASET" == 1 ]]; then
MODEL_ID="datasets/$MODEL_ID"
fi
echo $MODEL_DIR
if [ -d "$MODEL_DIR/.git" ]; then
printf "%s exists, Skip Clone.\n" "$MODEL_DIR"
cd "$MODEL_DIR" && GIT_LFS_SKIP_SMUDGE=1 git pull || { printf "Git pull failed.\n"; exit 1; }
else
REPO_URL="$HF_ENDPOINT/$MODEL_ID"
echo $REPO_URL
OUTPUT=$(GIT_TERMINAL_PROMPT=0 git ls-remote "$REPO_URL" 2>&1)
GIT_EXIT_CODE=$?
if [[ $OUTPUT == *"could not read Username"* ]]; then
if [[ -z "$HF_USERNAME" || -z "$HF_TOKEN" ]]; then
printf "The repository requires authentication, but --hf_username and --hf_token is not passed.\nPlease get token from https://huggingface.co/settings/tokens.\nExiting.\n"
echo $OUTPUT
exit 1
fi
REPO_URL="https://$HF_USERNAME:$HF_TOKEN@${HF_ENDPOINT#https://}/$MODEL_ID"
elif [ $GIT_EXIT_CODE -ne 0 ]; then
echo "$OUTPUT"; exit 1
fi
GIT_LFS_SKIP_SMUDGE=1 git clone "$REPO_URL" && cd "$MODEL_DIR" || { printf "Git clone failed.\n"; exit 1; }
for file in $(git lfs ls-files | awk '{print $3}'); do
truncate -s 0 "$file"
done
fi
printf "\nStart Downloading lfs files, bash script:\n"
files=$(git lfs ls-files | awk '{print $3}')
declare -a urls
for file in $files; do
url="$HF_ENDPOINT/$MODEL_ID/resolve/main/$file"
file_dir=$(dirname "$file")
mkdir -p "$file_dir"
if [[ "$TOOL" == "wget" ]]; then
download_cmd="wget -c \"$url\" -O \"$file\""
[[ -n "$HF_TOKEN" ]] && download_cmd="wget --header=\"Authorization: Bearer ${HF_TOKEN}\" -c \"$url\" -O \"$file\""
else
download_cmd="aria2c -x $THREADS -s $THREADS -k 1M -c \"$url\" -d \"$file_dir\" -o \"$(basename "$file")\""
[[ -n "$HF_TOKEN" ]] && download_cmd="aria2c --header=\"Authorization: Bearer ${HF_TOKEN}\" -x $THREADS -s $THREADS -k 1M -c \"$url\" -d \"$file_dir\" -o \"$(basename "$file")\""
fi
[[ -n "$EXCLUDE_PATTERN" && $file == *"$EXCLUDE_PATTERN"* ]] && printf "# %s\n" "$download_cmd" && continue
printf "%s\n" "$download_cmd"
urls+=("$url|$file")
done
for url_file in "${urls[@]}"; do
IFS='|' read -r url file <<< "$url_file"
file_dir=$(dirname "$file")
if [[ "$TOOL" == "wget" ]]; then
[[ -n "$HF_TOKEN" ]] && wget --header="Authorization: Bearer ${HF_TOKEN}" -c "$url" -O "$file" || wget -c "$url" -O "$file"
else
[[ -n "$HF_TOKEN" ]] && aria2c --header="Authorization: Bearer ${HF_TOKEN}" -x $THREADS -s $THREADS -k 1M -c "$url" -d "$file_dir" -o "$(basename "$file")" || aria2c -x $THREADS -s $THREADS -k 1M -c "$url" -d "$file_dir" -o "$(basename "$file")"
fi
[[ $? -eq 0 ]] && printf "Downloaded %s successfully.\n" "$url" || { printf "Failed to download %s.\n" "$url"; exit 1; }
done
printf "Download completed successfully.\n"
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment