Skip to content

Instantly share code, notes, and snippets.

@padeoe
Last active April 29, 2024 07:37
Show Gist options
  • Save padeoe/697678ab8e528b85a2a7bddafea1fa4f to your computer and use it in GitHub Desktop.
Save padeoe/697678ab8e528b85a2a7bddafea1fa4f to your computer and use it in GitHub Desktop.
CLI-Tool for download Huggingface models and datasets with aria2/wget+git

🤗Huggingface Model Downloader

Considering the lack of multi-threaded download support in the official huggingface-cli, and the inadequate error handling in hf_transfer, this command-line tool smartly utilizes wget or aria2 for LFS files and git clone for the rest.

Features

  • ⏯️ Resume from breakpoint: You can re-run it or Ctrl+C anytime.
  • 🚀 Multi-threaded Download: Utilize multiple threads to speed up the download process.
  • 🚫 File Exclusion: Use --exclude or --include to skip or specify files, save time for models with duplicate formats (e.g., *.bin or *.safetensors).
  • 🔐 Auth Support: For gated models that require Huggingface login, use --hf_username and --hf_token to authenticate.
  • 🪞 Mirror Site Support: Set up with HF_ENDPOINT environment variable.
  • 🌍 Proxy Support: Set up with HTTPS_PROXY environment variable.
  • 📦 Simple: Only depend on git, aria2c/wget.

Usage

First, Download hfd.sh or clone this repo, and then grant execution permission to the script.

chmod a+x hfd.sh

you can create an alias for convenience

alias hfd="$PWD/hfd.sh"

Usage Instructions:

$ ./hfd.sh -h
Usage:
  hfd <repo_id> [--include include_pattern] [--exclude exclude_pattern] [--hf_username username] [--hf_token token] [--tool aria2c|wget] [-x threads] [--dataset] [--local-dir path]

Description:
  Downloads a model or dataset from Hugging Face using the provided repo ID.

Parameters:
  repo_id        The Hugging Face repo ID in the format 'org/repo_name'.
  --include       (Optional) Flag to specify a string pattern to include files for downloading.
  --exclude       (Optional) Flag to specify a string pattern to exclude files from downloading.
  include/exclude_pattern The pattern to match against filenames, supports wildcard characters. e.g., '--exclude *.safetensor', '--include vae/*'.
  --hf_username   (Optional) Hugging Face username for authentication. **NOT EMAIL**.
  --hf_token      (Optional) Hugging Face token for authentication.
  --tool          (Optional) Download tool to use. Can be aria2c (default) or wget.
  -x              (Optional) Number of download threads for aria2c. Defaults to 4.
  --dataset       (Optional) Flag to indicate downloading a dataset.
  --local-dir     (Optional) Local directory path where the model or dataset will be stored.

Example:
  hfd bigscience/bloom-560m --exclude *.safetensors
  hfd meta-llama/Llama-2-7b --hf_username myuser --hf_token mytoken -x 4
  hfd lavita/medical-qa-shared-task-v1-toy --dataset

Download a model:

hfd bigscience/bloom-560m

Download a model need login

Get huggingface token from https://huggingface.co/settings/tokens, then

hfd meta-llama/Llama-2-7b --hf_username YOUR_HF_USERNAME_NOT_EMAIL --hf_token YOUR_HF_TOKEN

Download a model and exclude certain files (e.g., .safetensors):

hfd bigscience/bloom-560m --exclude *.safetensors

Download with aria2c and multiple threads:

hfd bigscience/bloom-560m

Output: During the download, the file URLs will be displayed:

$ hfd bigscience/bloom-560m --tool wget --exclude *.safetensors
...
Start Downloading lfs files, bash script:

wget -c https://huggingface.co/bigscience/bloom-560m/resolve/main/flax_model.msgpack
# wget -c https://huggingface.co/bigscience/bloom-560m/resolve/main/model.safetensors
wget -c https://huggingface.co/bigscience/bloom-560m/resolve/main/onnx/decoder_model.onnx
...
#!/usr/bin/env bash
# Color definitions
RED='\033[0;31m'
GREEN='\033[0;32m'
YELLOW='\033[1;33m'
NC='\033[0m' # No Color
trap 'printf "${YELLOW}\nDownload interrupted. If you re-run the command, you can resume the download from the breakpoint.\n${NC}"; exit 1' INT
display_help() {
cat << EOF
Usage:
hfd <repo_id> [--include include_pattern] [--exclude exclude_pattern] [--hf_username username] [--hf_token token] [--tool aria2c|wget] [-x threads] [--dataset] [--local-dir path]
Description:
Downloads a model or dataset from Hugging Face using the provided repo ID.
Parameters:
repo_id The Hugging Face repo ID in the format 'org/repo_name'.
--include (Optional) Flag to specify a string pattern to include files for downloading.
--exclude (Optional) Flag to specify a string pattern to exclude files from downloading.
include/exclude_pattern The pattern to match against filenames, supports wildcard characters. e.g., '--exclude *.safetensor', '--include vae/*'.
--hf_username (Optional) Hugging Face username for authentication. **NOT EMAIL**.
--hf_token (Optional) Hugging Face token for authentication.
--tool (Optional) Download tool to use. Can be aria2c (default) or wget.
-x (Optional) Number of download threads for aria2c. Defaults to 4.
--dataset (Optional) Flag to indicate downloading a dataset.
--local-dir (Optional) Local directory path where the model or dataset will be stored.
Example:
hfd bigscience/bloom-560m --exclude *.safetensors
hfd meta-llama/Llama-2-7b --hf_username myuser --hf_token mytoken -x 4
hfd lavita/medical-qa-shared-task-v1-toy --dataset
EOF
exit 1
}
MODEL_ID=$1
shift
# Default values
TOOL="aria2c"
THREADS=4
HF_ENDPOINT=${HF_ENDPOINT:-"https://huggingface.co"}
while [[ $# -gt 0 ]]; do
case $1 in
--include) INCLUDE_PATTERN="$2"; shift 2 ;;
--exclude) EXCLUDE_PATTERN="$2"; shift 2 ;;
--hf_username) HF_USERNAME="$2"; shift 2 ;;
--hf_token) HF_TOKEN="$2"; shift 2 ;;
--tool) TOOL="$2"; shift 2 ;;
-x) THREADS="$2"; shift 2 ;;
--dataset) DATASET=1; shift ;;
--local-dir) LOCAL_DIR="$2"; shift 2 ;;
*) shift ;;
esac
done
# Check if aria2, wget, curl, git, and git-lfs are installed
check_command() {
if ! command -v $1 &>/dev/null; then
echo -e "${RED}$1 is not installed. Please install it first.${NC}"
exit 1
fi
}
[[ "$TOOL" == "aria2c" ]] && check_command aria2c
[[ "$TOOL" == "wget" ]] && check_command wget
check_command curl; check_command git; check_command git-lfs
[[ -z "$MODEL_ID" || "$MODEL_ID" =~ ^-h ]] && display_help
if [[ -z "$LOCAL_DIR" ]]; then
LOCAL_DIR="${MODEL_ID#*/}"
fi
if [[ "$DATASET" == 1 ]]; then
MODEL_ID="datasets/$MODEL_ID"
fi
echo "Downloading to $LOCAL_DIR"
if [ -d "$LOCAL_DIR/.git" ]; then
printf "${YELLOW}%s exists, Skip Clone.\n${NC}" "$LOCAL_DIR"
cd "$LOCAL_DIR" && GIT_LFS_SKIP_SMUDGE=1 git pull || { printf "${RED}Git pull failed.${NC}\n"; exit 1; }
else
REPO_URL="$HF_ENDPOINT/$MODEL_ID"
GIT_REFS_URL="${REPO_URL}/info/refs?service=git-upload-pack"
echo "Testing GIT_REFS_URL: $GIT_REFS_URL"
response=$(curl -s -o /dev/null -w "%{http_code}" "$GIT_REFS_URL")
if [ "$response" == "401" ] || [ "$response" == "403" ]; then
if [[ -z "$HF_USERNAME" || -z "$HF_TOKEN" ]]; then
printf "${RED}HTTP Status Code: $response.\nThe repository requires authentication, but --hf_username and --hf_token is not passed. Please get token from https://huggingface.co/settings/tokens.\nExiting.\n${NC}"
exit 1
fi
REPO_URL="https://$HF_USERNAME:$HF_TOKEN@${HF_ENDPOINT#https://}/$MODEL_ID"
elif [ "$response" != "200" ]; then
printf "${RED}Unexpected HTTP Status Code: $response\n${NC}"
printf "${YELLOW}Executing debug command: curl -v %s\nOutput:${NC}\n" "$GIT_REFS_URL"
curl -v "$GIT_REFS_URL"; printf "\n${RED}Git clone failed.\n${NC}"; exit 1
fi
echo "git clone $REPO_URL $LOCAL_DIR"
GIT_LFS_SKIP_SMUDGE=1 git clone $REPO_URL $LOCAL_DIR && cd "$LOCAL_DIR" || { printf "${RED}Git clone failed.\n${NC}"; exit 1; }
for file in $(git lfs ls-files | awk '{print $3}'); do
truncate -s 0 "$file"
done
fi
printf "\nStart Downloading lfs files, bash script:\ncd $LOCAL_DIR\n"
files=$(git lfs ls-files | awk '{print $3}')
declare -a urls
for file in $files; do
url="$HF_ENDPOINT/$MODEL_ID/resolve/main/$file"
file_dir=$(dirname "$file")
mkdir -p "$file_dir"
if [[ "$TOOL" == "wget" ]]; then
download_cmd="wget -c \"$url\" -O \"$file\""
[[ -n "$HF_TOKEN" ]] && download_cmd="wget --header=\"Authorization: Bearer ${HF_TOKEN}\" -c \"$url\" -O \"$file\""
else
download_cmd="aria2c --console-log-level=error -x $THREADS -s $THREADS -k 1M -c \"$url\" -d \"$file_dir\" -o \"$(basename "$file")\""
[[ -n "$HF_TOKEN" ]] && download_cmd="aria2c --header=\"Authorization: Bearer ${HF_TOKEN}\" --console-log-level=error -x $THREADS -s $THREADS -k 1M -c \"$url\" -d \"$file_dir\" -o \"$(basename "$file")\""
fi
[[ -n "$INCLUDE_PATTERN" && ! "$file" == $INCLUDE_PATTERN ]] && printf "# %s\n" "$download_cmd" && continue
[[ -n "$EXCLUDE_PATTERN" && "$file" == $EXCLUDE_PATTERN ]] && printf "# %s\n" "$download_cmd" && continue
printf "%s\n" "$download_cmd"
urls+=("$url|$file")
done
for url_file in "${urls[@]}"; do
IFS='|' read -r url file <<< "$url_file"
printf "${YELLOW}Start downloading ${file}.\n${NC}"
file_dir=$(dirname "$file")
if [[ "$TOOL" == "wget" ]]; then
[[ -n "$HF_TOKEN" ]] && wget --header="Authorization: Bearer ${HF_TOKEN}" -c "$url" -O "$file" || wget -c "$url" -O "$file"
else
[[ -n "$HF_TOKEN" ]] && aria2c --header="Authorization: Bearer ${HF_TOKEN}" --console-log-level=error -x $THREADS -s $THREADS -k 1M -c "$url" -d "$file_dir" -o "$(basename "$file")" || aria2c --console-log-level=error -x $THREADS -s $THREADS -k 1M -c "$url" -d "$file_dir" -o "$(basename "$file")"
fi
[[ $? -eq 0 ]] && printf "Downloaded %s successfully.\n" "$url" || { printf "${RED}Failed to download %s.\n${NC}" "$url"; exit 1; }
done
printf "${GREEN}Download completed successfully.\n${NC}"
@BlairSadewitz
Copy link

Thank you SO MUCH for posting this. It does exactly what I want (git lfs clone with GIT_LFS_SKIP_SMUDGE set, then download large files with aria2c). I had some script fragments with commands I'd written that I was using, but you put it all together, actually parse the commandline arguments properly, etc. ;-)

I haven't been able to find any other download tool to download from huggingface as quickly as aria2c does.

@padeoe
Copy link
Author

padeoe commented Nov 8, 2023

Thank you SO MUCH for posting this. It does exactly what I want (git lfs clone with GIT_LFS_SKIP_SMUDGE set, then download large files with aria2c). I had some script fragments with commands I'd written that I was using, but you put it all together, actually parse the commandline arguments properly, etc. ;-)

I haven't been able to find any other download tool to download from huggingface as quickly as aria2c does.

Thank you for your kind words! I'm delighted to hear that the script is working well for you.

@BlairSadewitz Have you checked out huggingface-cli and hf_transfer? Although I find them a bit unstable, they're worth considering for their official support.

Happy downloading! 😄

@yzlnew
Copy link

yzlnew commented Nov 14, 2023

我用里面的 header 下载一直提示 Unrecognized URI or unsupported protocol: Bearer 这是什么原因?

@padeoe
Copy link
Author

padeoe commented Nov 14, 2023

我用里面的 header 下载一直提示 Unrecognized URI or unsupported protocol: Bearer 这是什么原因?

请提供下复现问题的命令,密钥请隐藏。

@hi-zhenyu
Copy link

执行命令./hfd.sh bigscience/bloom-560m,下载出现错误:

Downloading to ./bloom-560m
Test GIT_REFS_URL: https://huggingface.co/bigscience/bloom-560m/info/refs?service=git-upload-pack
Unexpected HTTP Status Code: 000.
Exiting.

是什么情况导致呢,谢谢~

@padeoe
Copy link
Author

padeoe commented Jan 30, 2024

执行命令./hfd.sh bigscience/bloom-560m,下载出现错误:

Downloading to ./bloom-560m
Test GIT_REFS_URL: https://huggingface.co/bigscience/bloom-560m/info/refs?service=git-upload-pack
Unexpected HTTP Status Code: 000.
Exiting.

是什么情况导致呢,谢谢~

huggingface.co 被墙了,设置镜像站端点可以解决 export HF_ENDPOINT="https://hf-mirror.com"

@hao203
Copy link

hao203 commented Feb 15, 2024

hfd.sh: command not found

@padeoe
Copy link
Author

padeoe commented Feb 27, 2024

hfd.sh: command not found

Use ./hfd.sh or alias hfd="/path/to/hfd.sh" first.

@xuanmumy
Copy link

ar t

大佬 我用梯子 hf官网也能访问 环境变量 HTTPS_PROXY 也设了 , clash的 . 但是运行脚本还是这样呢
Test GIT_REFS_URL: https://huggingface.co/stablediffusionapi/anything-v5/info/refs?service=git-upload-pack
Unexpected HTTP Status Code: 000.
Exiting.

@xuanmumy
Copy link

export HF_ENDPOINT="https://hf-mirror.com" 加了这个好使. 这些grd整天不干点人事.

@Kevin-siyuan
Copy link

报错:git-lfs is not installed. Please install it first.
是为什么呢

@relic-yuexi
Copy link

报错:git-lfs is not installed. Please install it first. 是为什么呢

#初始化git.lfs命令

curl -s https://packagecloud.io/install/repositories/github/git-lfs/script.deb.sh | sudo bash
sudo apt-get install git-lfs
git lfs install

@ZhouJYu
Copy link

ZhouJYu commented Mar 5, 2024

大佬,每个文件下载刚开始能到100M/s,但是到最后几十M都会特别慢(<100k/s),甚至直接卡住了是什么情况?
export HF_ENDPOINT="https://hf-mirror.com" export HF_HUB_ENABLE_HF_TRANSFER=1 huggingface-cli download --resume-download Qwen/Qwen-VL-Chat --local-dir Qwen#Qwen-VL-Chat --local-dir-use-symlinks False downloading https://hf-mirror.com/Qwen/Qwen-VL-Chat/resolve/f57cfbd358cb56b710d963669ad1bcfb44cdcdd8/pytorch_model-00001-of-00010.bin to /root/.cache/huggingface/hub/models--Qwen--Qwen-VL-Chat/blobs/d63e4b4238be3897d3b44d0f604422fc07dfceaf971ebde7adadd7be7a2a35bb.incomplete pytorch_model-00001-of-00010.bin: 99%|██████████████████████████████████████████████████████████████████████████████████████████████████▉ | 1.94G/1.96G [1:27:23<30:13, 11.6kB/s]

@A-raniy-day
Copy link

大佬,请问已经使用了-- exclude *.safetensors还是会下载safetensors文件怎么办

@A-raniy-day
Copy link

大佬,请问已经使用了-- exclude *.safetensors还是会下载safetensors文件怎么办

已解决,通配符问题,改为-- exclude .*.safetensors可以不下载safetensors文件

@A-raniy-day
Copy link

大佬,请问已经使用了-- exclude *.safetensors还是会下载safetensors文件怎么办

已解决,通配符问题,改为-- exclude .*.safetensors可以不下载safetensors文件

126 [[ -n "$EXCLUDE_PATTERN" && "$file" == $EXCLUDE_PATTERN ]] && printf "# %s\n" "$download_cmd" && continue 改为

   if [[ -n "$EXCLUDE_PATTERN" ]]; then
    shopt -s nullglob extglob
    if [[ "$file" =~ $EXCLUDE_PATTERN ]]; then
        printf "# Excluded by exclude pattern: %s\n" "$file"
        continue
    fi
fi

@padeoe
Copy link
Author

padeoe commented Mar 24, 2024

大佬,请问已经使用了-- exclude *.safetensors还是会下载safetensors文件怎么办

我没有复现你的说的问题,请提供下系统环境信息。以下是我的测试命令和输出,可以看到--exclude *.safetensors可以跳过 model.safetensors(对应命令前有#注释)

$ hfd gpt2 --exclude *.safetensors --tool wget
Downloading to gpt2
Testing GIT_REFS_URL: https://hf-mirror.com/gpt2/info/refs?service=git-upload-pack
git clone https://hf-mirror.com/gpt2 gpt2
Cloning into 'gpt2'...
remote: Enumerating objects: 87, done.
remote: Counting objects: 100% (3/3), done.
remote: Compressing objects: 100% (2/2), done.
remote: Total 87 (delta 0), reused 0 (delta 0), pack-reused 84
Unpacking objects: 100% (87/87), done.

Start Downloading lfs files, bash script:
cd gpt2
wget -c "https://hf-mirror.com/gpt2/resolve/main/64-8bits.tflite" -O "64-8bits.tflite"
wget -c "https://hf-mirror.com/gpt2/resolve/main/64-fp16.tflite" -O "64-fp16.tflite"
wget -c "https://hf-mirror.com/gpt2/resolve/main/64.tflite" -O "64.tflite"
wget -c "https://hf-mirror.com/gpt2/resolve/main/flax_model.msgpack" -O "flax_model.msgpack"
# wget -c "https://hf-mirror.com/gpt2/resolve/main/model.safetensors" -O "model.safetensors"
wget -c "https://hf-mirror.com/gpt2/resolve/main/onnx/decoder_model.onnx" -O "onnx/decoder_model.onnx"
wget -c "https://hf-mirror.com/gpt2/resolve/main/onnx/decoder_model_merged.onnx" -O "onnx/decoder_model_merged.onnx"
wget -c "https://hf-mirror.com/gpt2/resolve/main/onnx/decoder_with_past_model.onnx" -O "onnx/decoder_with_past_model.onnx"
wget -c "https://hf-mirror.com/gpt2/resolve/main/pytorch_model.bin" -O "pytorch_model.bin"
wget -c "https://hf-mirror.com/gpt2/resolve/main/rust_model.ot" -O "rust_model.ot"
wget -c "https://hf-mirror.com/gpt2/resolve/main/tf_model.h5" -O "tf_model.h5"
Start downloading 64-8bits.tflite.

@XizhiMaLY
Copy link

XizhiMaLY commented Mar 27, 2024

想问下大佬这个是什么原因呢

$ hfd baichuan-inc/Baichuan2-7B-Chat --tool aria2c -x 4
Downloading to Baichuan2-7B-Chat
Baichuan2-7B-Chat exists, Skip Clone.
已经是最新的。

Start Downloading lfs files, bash script:
cd Baichuan2-7B-Chat
aria2c --console-log-level=error -x 4 -s 4 -k 1M -c "https://hf-mirror.com/baichuan-inc/Baichuan2-7B-Chat/resolve/main/pytorch_model.bin" -d "." -o "pytorch_model.bin"
aria2c --console-log-level=error -x 4 -s 4 -k 1M -c "https://hf-mirror.com/baichuan-inc/Baichuan2-7B-Chat/resolve/main/tokenizer.model" -d "." -o "tokenizer.model"
Start downloading pytorch_model.bin.
[#b6d4e0 0B/0B CN:1 DL:0B]                                                                                                                                                                                                                                      
03/27 14:04:34 [ERROR] CUID#7 - Download aborted. URI=https://hf-mirror.com/baichuan-inc/Baichuan2-7B-Chat/resolve/main/pytorch_model.bin
Exception: [AbstractCommand.cc:403] errorCode=16 URI=https://cdn-lfs.hf-mirror.com/repos/98/6a/986a6f31289a1d4a20a3150d2e5a3962a2bf5091e6ff16b2cbea71139ed1a7c4/594b540d36751fa3b199c609617deafc0329050937c0767fdfb2d38b72d8bec9?response-content-disposition=att
achment%3B+filename*%3DUTF-8%27%27pytorch_model.bin%3B+filename%3D%22pytorch_model.bin%22%3B&response-content-type=application%2Foctet-stream&Expires=1711778674&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTcxMTc3O
DY3NH19LCJSZXNvdXJjZSI6Imh0dHBzOi8vY2RuLWxmcy5odWdnaW5nZmFjZS5jby9yZXBvcy85OC82YS85ODZhNmYzMTI4OWExZDRhMjBhMzE1MGQyZTVhMzk2MmEyYmY1MDkxZTZmZjE2YjJjYmVhNzExMzllZDFhN2M0LzU5NGI1NDBkMzY3NTFmYTNiMTk5YzYwOTYxN2RlYWZjMDMyOTA1MDkzN2MwNzY3ZmRmYjJkMzhiNzJkOGJlYzk%7E
cmVzcG9uc2UtY29udGVudC1kaXNwb3NpdGlvbj0qJnJlc3BvbnNlLWNvbnRlbnQtdHlwZT0qIn1dfQ__&Signature=zLlJ3J5w6SEzD-jxs0awPm9ucnqcd2CJAqjyOR-ST2NCBYx1437UWtcn8PMNzXgzmRmzWbLG9yIJD0cT7gjU-dRBFTE12Q42mcBJYQK6RxXIkT--2cWjlt2gJM3EhDSf-58qwi8WN2N9BOfCVQFHNcJepHhRvkQZkcPZMC
1hpp9lXpptcPyHEcrAoxOAM96HgRNCD31OuiFD6%7E0wbfd1RcQca-cV6DH14zS1rzvAiw-2vQfWjYZ0w6VVtSwDYuTrgD1R2uVPffzliuCcC8wQL9fk%7EybpT2s7a7zTmnz1MCeY1n3S-bJr8mH1H6gtYUd4aZoJlmAKPRNqgBxfevmwaA__&Key-Pair-Id=KVTP0A1DKRTAX
  -> [RequestGroup.cc:760] errorCode=16 Download aborted.
  -> [AbstractDiskWriter.cc:224] errNum=13 errorCode=16 Failed to open the file ./pytorch_model.bin, cause: Permission denied

Download Results:
gid   |stat|avg speed  |path/URI
======+====+===========+=======================================================
b6d4e0|ERR |       0B/s|./pytorch_model.bin

Status Legend:
(ERR):error occurred.

aria2 will resume download if the transfer is restarted.
If there are any errors, then see the log file. See '-l' option in help/man page for details.
Failed to download https://hf-mirror.com/baichuan-inc/Baichuan2-7B-Chat/resolve/main/pytorch_model.bin.

@bigcash
Copy link

bigcash commented Apr 3, 2024

hfd方式下载数据集,中间万一断掉,支持断点继续下载吗?测试了一下,提示要下载的目录已存在

@padeoe
Copy link
Author

padeoe commented Apr 3, 2024

hfd方式下载数据集,中间万一断掉,支持断点继续下载吗?测试了一下,提示要下载的目录已存在

支持断点续传的,"提示要下载的目录已存在" 这个具体的日志什么样的

@RileyRetzloff
Copy link

This is amazing, thanks for sharing! I've already saved a ton of time using it.

I don't have a ton of shell scripting experience, but I wonder if it wouldn't be too difficult to implement some form of self-clean-up behavior at the end of the script to deal with the empty files/folders leftover from cloning the repo.

Maybe something like...

  • traverse the repo directory tree bottom-up, getting the name of each file and...
  • if file_name !matches --include pattern || matches --exclude pattern
  • delete file_name
  • for each directory on the way to the root, delete the directory if it contains 0 files after being traversed

@LucienShui
Copy link

#!/usr/bin/env bash
# Color definitions
RED='\033[0;31m'
GREEN='\033[0;32m'
YELLOW='\033[1;33m'
NC='\033[0m' # No Color

trap 'printf "${YELLOW}\nDownload interrupted. If you re-run the command, you can resume the download from the breakpoint.\n${NC}"; exit 1' INT

display_help() {
    cat << EOF
Usage:
  hfd <repo_id> [--include include_pattern] [--exclude exclude_pattern] [--hf_username username] [--hf_token token] [--tool aria2c|wget] [-x threads] [--dataset] [--local-dir path]    

Description:
  Downloads a model or dataset from Hugging Face using the provided repo ID.

Parameters:
  repo_id        The Hugging Face repo ID in the format 'org/repo_name'.
  --include       (Optional) Flag to specify a string pattern to include files for downloading.
  --exclude       (Optional) Flag to specify a string pattern to exclude files from downloading.
  include/exclude_pattern The pattern to match against filenames, supports wildcard characters. e.g., '--exclude *.safetensor', '--include vae/*'.
  --hf_username   (Optional) Hugging Face username for authentication. **NOT EMAIL**.
  --hf_token      (Optional) Hugging Face token for authentication.
  --tool          (Optional) Download tool to use. Can be aria2c (default) or wget.
  -x              (Optional) Number of download threads for aria2c. Defaults to 4.
  --dataset       (Optional) Flag to indicate downloading a dataset.
  --local-dir     (Optional) Local directory path where the model or dataset will be stored.

Example:
  hfd bigscience/bloom-560m --exclude *.safetensors
  hfd meta-llama/Llama-2-7b --hf_username myuser --hf_token mytoken -x 4
  hfd lavita/medical-qa-shared-task-v1-toy --dataset
EOF
    exit 1
}

MODEL_ID=$1
shift

# Default values
TOOL="aria2c"
THREADS=4
HF_ENDPOINT=${HF_ENDPOINT:-"https://huggingface.co"}

while [[ $# -gt 0 ]]; do
    case $1 in
        --include) INCLUDE_PATTERN="$2"; shift 2 ;;
        --exclude) EXCLUDE_PATTERN="$2"; shift 2 ;;
        --hf_username) HF_USERNAME="$2"; shift 2 ;;
        --hf_token) HF_TOKEN="$2"; shift 2 ;;
        --tool) TOOL="$2"; shift 2 ;;
        -x) THREADS="$2"; shift 2 ;;
        --dataset) DATASET=1; shift ;;
        --local-dir) LOCAL_DIR="$2"; shift 2 ;;
        *) shift ;;
    esac
done

# Check if aria2, wget, curl, git, and git-lfs are installed
check_command() {
    if ! command -v $1 &>/dev/null; then
        echo -e "${RED}$1 is not installed. Please install it first.${NC}"
        exit 1
    fi
}

# Mark current repo safe when using shared file system like samba or nfs
ensure_ownership() {
    if git status 2>&1 | grep "fatal: detected dubious ownership in repository at" > /dev/null; then
        git config --global --add safe.directory "${PWD}"
        printf "${YELLOW}Detected dubious ownership in repository, mark ${PWD} safe using git, edit ~/.gitconfig if you want to reverse this.\n${NC}" 
    fi
}

[[ "$TOOL" == "aria2c" ]] && check_command aria2c
[[ "$TOOL" == "wget" ]] && check_command wget
check_command curl; check_command git; check_command git-lfs

[[ -z "$MODEL_ID" || "$MODEL_ID" =~ ^-h ]] && display_help

if [[ -z "$LOCAL_DIR" ]]; then
    LOCAL_DIR="${MODEL_ID#*/}"
fi

if [[ "$DATASET" == 1 ]]; then
    MODEL_ID="datasets/$MODEL_ID"
fi
echo "Downloading to $LOCAL_DIR"

if [ -d "$LOCAL_DIR/.git" ]; then
    printf "${YELLOW}%s exists, Skip Clone.\n${NC}" "$LOCAL_DIR"
    cd "$LOCAL_DIR" && ensure_ownership && GIT_LFS_SKIP_SMUDGE=1 git pull || { printf "${RED}Git pull failed.${NC}\n"; exit 1; }
else
    REPO_URL="$HF_ENDPOINT/$MODEL_ID"
    GIT_REFS_URL="${REPO_URL}/info/refs?service=git-upload-pack"
    echo "Testing GIT_REFS_URL: $GIT_REFS_URL"
    response=$(curl -s -o /dev/null -w "%{http_code}" "$GIT_REFS_URL")
    if [ "$response" == "401" ] || [ "$response" == "403" ]; then
        if [[ -z "$HF_USERNAME" || -z "$HF_TOKEN" ]]; then
            printf "${RED}HTTP Status Code: $response.\nThe repository requires authentication, but --hf_username and --hf_token is not passed. Please get token from https://huggingface.co/settings/tokens.\nExiting.\n${NC}"
            exit 1
        fi
        REPO_URL="https://$HF_USERNAME:$HF_TOKEN@${HF_ENDPOINT#https://}/$MODEL_ID"
    elif [ "$response" != "200" ]; then
        printf "${RED}Unexpected HTTP Status Code: $response\n${NC}"
        printf "${YELLOW}Executing debug command: curl -v %s\nOutput:${NC}\n" "$GIT_REFS_URL"
        curl -v "$GIT_REFS_URL"; printf "\n${RED}Git clone failed.\n${NC}"; exit 1
    fi
    echo "git clone $REPO_URL $LOCAL_DIR"

    GIT_LFS_SKIP_SMUDGE=1 git clone $REPO_URL $LOCAL_DIR && cd "$LOCAL_DIR" || { printf "${RED}Git clone failed.\n${NC}"; exit 1; }

    ensure_ownership

    for file in $(git lfs ls-files | awk '{print $3}'); do
        truncate -s 0 "$file"
    done
fi

printf "\nStart Downloading lfs files, bash script:\ncd $LOCAL_DIR\n"
files=$(git lfs ls-files | awk '{print $3}')
declare -a urls

for file in $files; do
    url="$HF_ENDPOINT/$MODEL_ID/resolve/main/$file"
    file_dir=$(dirname "$file")
    mkdir -p "$file_dir"
    if [[ "$TOOL" == "wget" ]]; then
        download_cmd="wget -c \"$url\" -O \"$file\""
        [[ -n "$HF_TOKEN" ]] && download_cmd="wget --header=\"Authorization: Bearer ${HF_TOKEN}\" -c \"$url\" -O \"$file\""
    else
        download_cmd="aria2c --console-log-level=error -x $THREADS -s $THREADS -k 1M -c \"$url\" -d \"$file_dir\" -o \"$(basename "$file")\""
        [[ -n "$HF_TOKEN" ]] && download_cmd="aria2c --header=\"Authorization: Bearer ${HF_TOKEN}\" --console-log-level=error -x $THREADS -s $THREADS -k 1M -c \"$url\" -d \"$file_dir\" -o \"$(basename "$file")\""
    fi
    [[ -n "$INCLUDE_PATTERN" && ! "$file" == $INCLUDE_PATTERN ]] && printf "# %s\n" "$download_cmd" && continue
    [[ -n "$EXCLUDE_PATTERN" && "$file" == $EXCLUDE_PATTERN ]] && printf "# %s\n" "$download_cmd" && continue
    printf "%s\n" "$download_cmd"
    urls+=("$url|$file")
done

for url_file in "${urls[@]}"; do
    IFS='|' read -r url file <<< "$url_file"
    printf "${YELLOW}Start downloading ${file}.\n${NC}" 
    file_dir=$(dirname "$file")
    if [[ "$TOOL" == "wget" ]]; then
        [[ -n "$HF_TOKEN" ]] && wget --header="Authorization: Bearer ${HF_TOKEN}" -c "$url" -O "$file" || wget -c "$url" -O "$file"
    else
        [[ -n "$HF_TOKEN" ]] && aria2c --header="Authorization: Bearer ${HF_TOKEN}" --console-log-level=error -x $THREADS -s $THREADS -k 1M -c "$url" -d "$file_dir" -o "$(basename "$file")" || aria2c --console-log-level=error -x $THREADS -s $THREADS -k 1M -c "$url" -d "$file_dir" -o "$(basename "$file")"
    fi
    [[ $? -eq 0 ]] && printf "Downloaded %s successfully.\n" "$url" || { printf "${RED}Failed to download %s.\n${NC}" "$url"; exit 1; }
done

printf "${GREEN}Download completed successfully.\n${NC}"

@mayaobuduyao
Copy link

Downloading to /home/zhengxiao/dataroot/models/Llama2_7b_chat_hf/
Testing GIT_REFS_URL: https://hf-mirror.com/meta-llama/Llama-2-7b-chat-hf/info/refs?service=git-upload-pack
git clone https://zhengxiao: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@hf-mirror.com/meta-llama/Llama-2-7b-chat-hf /home/zhengxiao/dataroot/models/Llama2_7b_chat_hf/
fatal: destination path '/home/zhengxiao/dataroot/models/Llama2_7b_chat_hf' already exists and is not an empty directory.
Git clone failed.
显示目录已存在,请问如何解决?

@LucienShui
Copy link

Downloading to /home/zhengxiao/dataroot/models/Llama2_7b_chat_hf/ Testing GIT_REFS_URL: https://hf-mirror.com/meta-llama/Llama-2-7b-chat-hf/info/refs?service=git-upload-pack git clone https://zhengxiao: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@hf-mirror.com/meta-llama/Llama-2-7b-chat-hf /home/zhengxiao/dataroot/models/Llama2_7b_chat_hf/ fatal: destination path '/home/zhengxiao/dataroot/models/Llama2_7b_chat_hf' already exists and is not an empty directory. Git clone failed. 显示目录已存在,请问如何解决?

Qwen1.5-72B-Chat-q3 的回答:

如果目录已经存在,你可以直接进入该目录检查是否已经下载了模型。如果模型已经存在,你不需要再次下载。如果模型不存在或者你需要更新到最新版本,你可以按照以下步骤操作:

  1. 删除现有目录:如果你想要重新下载,可以先删除已存在的目录。在终端中,使用以下命令:

    rm -rf /home/zhengxiao/dataroot/models/Llama2_7b_chat_hf/

    这将删除Llama2_7b_chat_hf目录及其所有内容。

  2. 重新下载:删除后,再次运行下载命令,它应该会重新创建目录并下载模型。

如果你不确定目录中是否有模型,也可以先查看目录内容,如果里面没有模型文件,直接运行下载命令应该会继续下载。如果有模型文件但需要更新,可能需要查看模型的更新日志或者使用特定的更新命令,这取决于你使用的具体模型和下载工具。

@cydiachen
Copy link

Hello, I have an issue when downloading a large dataset like oscar-corpus/OSCAR-2301.
I only want to download sub-folder like its chinese part. How would I execute the command?
I have tried --include reolve/main/en_meta or en_meta_*.zst, but didn;t work

@padeoe
Copy link
Author

padeoe commented Apr 25, 2024

Hello, I have an issue when downloading a large dataset like oscar-corpus/OSCAR-2301. I only want to download sub-folder like its chinese part. How would I execute the command? I have tried --include reolve/main/en_meta or en_meta_*.zst, but didn;t work

try --include en_meta/*, it should match the full path of target files.

@divisionblur
Copy link

哥,我怎么指定了--local-dir 不往这下载啊

@divisionblur
Copy link

发现了,我的问题,把路径写错了。

@divisionblur
Copy link

跪着感谢大哥的脚本,下载很快。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment