Skip to content

Instantly share code, notes, and snippets.

@relic-yuexi
Forked from padeoe/README_hfd.md
Last active April 8, 2024 16:28
Show Gist options
  • Save relic-yuexi/5d2a6d5cd2f3d75e64f471bf09718e33 to your computer and use it in GitHub Desktop.
Save relic-yuexi/5d2a6d5cd2f3d75e64f471bf09718e33 to your computer and use it in GitHub Desktop.
CLI-Tool for download Huggingface models and datasets with aria2/wget+git

🤗Huggingface 模型下载器

考虑到官方的 huggingface-cli 缺乏多线程下载支持,以及 hf_transfer 错误处理不足的问题,这个命令行工具巧妙地利用 wgetaria2 下载 LFS 文件,并使用 git clone 下载其他文件。

特性

  • ⏯️ 断点续传: 你可以随时重新运行或使用 Ctrl+C 中断下载。
  • 🚀 多线程下载: 利用多线程加速下载过程。
  • 🚫 文件排除: 使用 --exclude--include 跳过或指定要下载的文件,节省时间以避免下载模型的重复格式文件(例如 .bin 和 .safetensors)。
  • 🔐 认证支持: 对于需要 Huggingface 登录的私有模型,使用 --hf_username--hf_token 进行身份验证。
  • 🪞 镜像站点支持: 通过设置 HF_ENDPOINT 环境变量使用镜像站点。
  • 🌍 代理支持: 通过设置 HTTPS_PROXY 环境变量使用代理。
  • 📦 简单: 无依赖且无需安装。

使用方法

首先,下载 hfd.sh 或克隆此仓库,然后给予脚本执行权限。

chmod a+x hfd.sh

使用说明:

$ ./hfd.sh -h
用法:
  hfd <model_id> [--include include_pattern] [--exclude exclude_pattern] [--hf_username username] [--hf_token token] [--tool wget|aria2c] [-x threads] [--dataset] [--dir download_dir]

说明:
  使用提供的模型 ID 从 Hugging Face 下载模型或数据集。

参数:
  model_id        Hugging Face 模型 ID,格式为 'repo/model_name'。
  --include       (可选) 指定要包含下载的文件模式。
  --exclude       (可选) 指定要排除下载的文件模式。
  exclude_pattern 用于匹配要排除的文件名的模式。
  --hf_username   (可选) Hugging Face 用户名,用于身份验证。
  --hf_token      (可选) Hugging Face 令牌,用于身份验证。
  --tool          (可选) 使用的下载工具。可以是 wget (默认) 或 aria2c。
  -x              (可选) aria2c 的下载线程数。
  --dataset       (可选) 指示下载数据集。
  --dir           (可选) 下载模型/数据集的目录。

示例:
  hfd bigscience/bloom-560m --exclude safetensors
  hfd meta-llama/Llama-2-7b --hf_username myuser --hf_token mytoken --tool aria2c -x 8
  hfd lavita/medical-qa-shared-task-v1-toy --dataset
  hfd bigscience/bloom-560m --dir /path/to/download/dir

下载模型:

./hfd.sh bigscience/bloom-560m

下载需要登录的模型

https://huggingface.co/settings/tokens 获取 Huggingface 令牌,然后执行:

hfd meta-llama/Llama-2-7b --hf_username 你的HF用户名 --hf_token 你的HF令牌

下载模型并排除某些文件(例如 .safetensors):

./hfd.sh bigscience/bloom-560m --exclude safetensors

使用 aria2c 和多线程下载:

./hfd.sh bigscience/bloom-560m --tool aria2c -x 4

输出: 在下载过程中,将显示文件的 URL:

$ ./hfd.sh bigscience/bloom-560m --exclude safetensors
...
开始下载 lfs 文件,bash 脚本:

wget -c https://huggingface.co/bigscience/bloom-560m/resolve/main/flax_model.msgpack
# wget -c https://huggingface.co/bigscience/bloom-560m/resolve/main/model.safetensors
wget -c https://huggingface.co/bigscience/bloom-560m/resolve/main/onnx/decoder_model.onnx
...

为方便使用创建别名

为了更方便地使用,你可以为脚本创建一个别名:

alias hfd="$PWD/hfd.sh"
#!/usr/bin/env bash
# Color definitions
RED='\033[0;31m'
GREEN='\033[0;32m'
YELLOW='\033[1;33m'
NC='\033[0m' # No Color
trap 'printf "${YELLOW}\nDownload interrupted. If you re-run the command, you can resume the download from the breakpoint.\n${NC}"; exit 1' INT
display_help() {
cat << EOF
Usage:
hfd <model_id> [--include include_pattern] [--exclude exclude_pattern] [--hf_username username] [--hf_token token] [--tool wget|aria2c] [-x threads] [--dataset]
Description:
Downloads a model or dataset from Hugging Face using the provided model ID.
Parameters:
model_id The Hugging Face model ID in the format 'repo/model_name'.
--include (Optional) Flag to specify a string pattern to include files for downloading.
--exclude (Optional) Flag to specify a string pattern to exclude files from downloading.
exclude_pattern The pattern to match against filenames for exclusion.
--hf_username (Optional) Hugging Face username for authentication.
--hf_token (Optional) Hugging Face token for authentication.
--tool (Optional) Download tool to use. Can be wget (default) or aria2c.
-x (Optional) Number of download threads for aria2c.
--dataset (Optional) Flag to indicate downloading a dataset.
--dir (Optional) Directory to download the model/dataset.
Example:
hfd bigscience/bloom-560m --exclude safetensors
hfd meta-llama/Llama-2-7b --hf_username myuser --hf_token mytoken --tool aria2c -x 8
hfd lavita/medical-qa-shared-task-v1-toy --dataset
hfd bigscience/bloom-560m --dir /path/to/download/dir
EOF
exit 1
}
MODEL_ID=$1
shift
# Default values
TOOL="wget"
THREADS=1
HF_ENDPOINT=${HF_ENDPOINT:-"https://hf-mirror.com"}
DOWNLOAD_DIR="." # Default to current directory
while [[ $# -gt 0 ]]; do
case $1 in
--include) INCLUDE_PATTERN="$2"; shift 2 ;;
--exclude) EXCLUDE_PATTERN="$2"; shift 2 ;;
--hf_username) HF_USERNAME="$2"; shift 2 ;;
--hf_token) HF_TOKEN="$2"; shift 2 ;;
--tool) TOOL="$2"; shift 2 ;;
-x) THREADS="$2"; shift 2 ;;
--dataset) DATASET=1; shift ;;
--dir) DOWNLOAD_DIR="$2"; shift 2 ;;
*) shift ;;
esac
done
# Check if aria2, wget, curl, git, and git-lfs are installed
check_command() {
if ! command -v $1 &>/dev/null; then
echo -e "${RED}$1 is not installed. Please install it first.${NC}"
exit 1
fi
}
[[ "$TOOL" == "aria2c" ]] && check_command aria2c
[[ "$TOOL" == "wget" ]] && check_command wget
check_command curl; check_command git; check_command git-lfs
[[ -z "$MODEL_ID" || "$MODEL_ID" =~ ^-h ]] && display_help
MODEL_DIR="${MODEL_ID#*/}"
if [[ "$DATASET" == 1 ]]; then
MODEL_ID="datasets/$MODEL_ID"
fi
# Ensure the directory exists and change to it
mkdir -p "$DOWNLOAD_DIR/"
cd "$DOWNLOAD_DIR/"
echo "Downloading to $DOWNLOAD_DIR/$MODEL_DIR"
if [ -d "$MODEL_DIR/.git" ]; then
printf "${YELLOW}%s exists, Skip Clone.\n${NC}" "$MODEL_DIR"
cd "$MODEL_DIR" && GIT_LFS_SKIP_SMUDGE=1 git pull || { printf "Git pull failed.\n"; exit 1; }
else
REPO_URL="$HF_ENDPOINT/$MODEL_ID"
GIT_REFS_URL="${REPO_URL}/info/refs?service=git-upload-pack"
echo "Test GIT_REFS_URL: $GIT_REFS_URL"
response=$(curl -s -o /dev/null -w "%{http_code}" "$GIT_REFS_URL")
if [ "$response" == "401" ] || [ "$response" == "403" ]; then
if [[ -z "$HF_USERNAME" || -z "$HF_TOKEN" ]]; then
printf "${RED}HTTP Status Code: $response.\nThe repository requires authentication, but --hf_username and --hf_token is not passed. Please get token from https://huggingface.co/settings/tokens.\nExiting.\n${NC}"
exit 1
fi
REPO_URL="https://$HF_USERNAME:$HF_TOKEN@${HF_ENDPOINT#https://}/$MODEL_ID"
elif [ "$response" != "200" ]; then
echo -e "${RED}Unexpected HTTP Status Code: $response.\nExiting.\n${NC}"; exit 1
fi
echo "git clone $REPO_URL"
GIT_LFS_SKIP_SMUDGE=1 git clone "$REPO_URL" && cd "$MODEL_DIR" || { printf "${RED}Git clone failed.\n${NC}"; exit 1; }
for file in $(git lfs ls-files | awk '{print $3}'); do
truncate -s 0 "$file"
done
fi
printf "\nStart Downloading lfs files, bash script:\n"
files=$(git lfs ls-files | awk '{print $3}')
declare -a urls
for file in $files; do
url="$HF_ENDPOINT/$MODEL_ID/resolve/main/$file"
file_dir=$(dirname "$file")
mkdir -p "$file_dir"
if [[ "$TOOL" == "wget" ]]; then
download_cmd="wget -c \"$url\" -O \"$file\""
[[ -n "$HF_TOKEN" ]] && download_cmd="wget --header=\"Authorization: Bearer ${HF_TOKEN}\" -c \"$url\" -O \"$file\""
else
download_cmd="aria2c -x $THREADS -s $THREADS -k 1M -c \"$url\" -d \"$file_dir\" -o \"$(basename "$file")\""
[[ -n "$HF_TOKEN" ]] && download_cmd="aria2c --header=\"Authorization: Bearer ${HF_TOKEN}\" -x $THREADS -s $THREADS -k 1M -c \"$url\" -d \"$file_dir\" -o \"$(basename "$file")\""
fi
[[ -n "$INCLUDE_PATTERN" && $file != *"$INCLUDE_PATTERN"* ]] && printf "# %s\n" "$download_cmd" && continue
[[ -n "$EXCLUDE_PATTERN" && $file == *"$EXCLUDE_PATTERN"* ]] && printf "# %s\n" "$download_cmd" && continue
printf "%s\n" "$download_cmd"
urls+=("$url|$file")
done
for url_file in "${urls[@]}"; do
IFS='|' read -r url file <<< "$url_file"
file_dir=$(dirname "$file")
if [[ "$TOOL" == "wget" ]]; then
[[ -n "$HF_TOKEN" ]] && wget --header="Authorization: Bearer ${HF_TOKEN}" -c "$url" -O "$file" || wget -c "$url" -O "$file"
else
[[ -n "$HF_TOKEN" ]] && aria2c --header="Authorization: Bearer ${HF_TOKEN}" -x $THREADS -s $THREADS -k 1M -c "$url" -d "$file_dir" -o "$(basename "$file")" || aria2c -x $THREADS -s $THREADS -k 1M -c "$url" -d "$file_dir" -o "$(basename "$file")"
fi
[[ $? -eq 0 ]] && printf "Downloaded %s successfully.\n" "$url" || { printf "${RED}Failed to download %s.\n${NC}" "$url"; exit 1; }
done
printf "${GREEN}Download completed successfully.\n${NC}"
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment