Skip to content

Instantly share code, notes, and snippets.

@padeoe
Last active May 17, 2024 07:38
Show Gist options
  • Save padeoe/697678ab8e528b85a2a7bddafea1fa4f to your computer and use it in GitHub Desktop.
Save padeoe/697678ab8e528b85a2a7bddafea1fa4f to your computer and use it in GitHub Desktop.
CLI-Tool for download Huggingface models and datasets with aria2/wget+git

🤗Huggingface Model Downloader

Considering the lack of multi-threaded download support in the official huggingface-cli, and the inadequate error handling in hf_transfer, this command-line tool smartly utilizes wget or aria2 for LFS files and git clone for the rest.

Features

  • ⏯️ Resume from breakpoint: You can re-run it or Ctrl+C anytime.
  • 🚀 Multi-threaded Download: Utilize multiple threads to speed up the download process.
  • 🚫 File Exclusion: Use --exclude or --include to skip or specify files, save time for models with duplicate formats (e.g., *.bin or *.safetensors).
  • 🔐 Auth Support: For gated models that require Huggingface login, use --hf_username and --hf_token to authenticate.
  • 🪞 Mirror Site Support: Set up with HF_ENDPOINT environment variable.
  • 🌍 Proxy Support: Set up with HTTPS_PROXY environment variable.
  • 📦 Simple: Only depend on git, aria2c/wget.

Usage

First, Download hfd.sh or clone this repo, and then grant execution permission to the script.

chmod a+x hfd.sh

you can create an alias for convenience

alias hfd="$PWD/hfd.sh"

Usage Instructions:

$ ./hfd.sh -h
Usage:
  hfd <repo_id> [--include include_pattern] [--exclude exclude_pattern] [--hf_username username] [--hf_token token] [--tool aria2c|wget] [-x threads] [--dataset] [--local-dir path]

Description:
  Downloads a model or dataset from Hugging Face using the provided repo ID.

Parameters:
  repo_id        The Hugging Face repo ID in the format 'org/repo_name'.
  --include       (Optional) Flag to specify a string pattern to include files for downloading.
  --exclude       (Optional) Flag to specify a string pattern to exclude files from downloading.
  include/exclude_pattern The pattern to match against filenames, supports wildcard characters. e.g., '--exclude *.safetensor', '--include vae/*'.
  --hf_username   (Optional) Hugging Face username for authentication. **NOT EMAIL**.
  --hf_token      (Optional) Hugging Face token for authentication.
  --tool          (Optional) Download tool to use. Can be aria2c (default) or wget.
  -x              (Optional) Number of download threads for aria2c. Defaults to 4.
  --dataset       (Optional) Flag to indicate downloading a dataset.
  --local-dir     (Optional) Local directory path where the model or dataset will be stored.

Example:
  hfd bigscience/bloom-560m --exclude *.safetensors
  hfd meta-llama/Llama-2-7b --hf_username myuser --hf_token mytoken -x 4
  hfd lavita/medical-qa-shared-task-v1-toy --dataset

Download a model:

hfd bigscience/bloom-560m

Download a model need login

Get huggingface token from https://huggingface.co/settings/tokens, then

hfd meta-llama/Llama-2-7b --hf_username YOUR_HF_USERNAME_NOT_EMAIL --hf_token YOUR_HF_TOKEN

Download a model and exclude certain files (e.g., .safetensors):

hfd bigscience/bloom-560m --exclude *.safetensors

Download with aria2c and multiple threads:

hfd bigscience/bloom-560m

Output: During the download, the file URLs will be displayed:

$ hfd bigscience/bloom-560m --tool wget --exclude *.safetensors
...
Start Downloading lfs files, bash script:

wget -c https://huggingface.co/bigscience/bloom-560m/resolve/main/flax_model.msgpack
# wget -c https://huggingface.co/bigscience/bloom-560m/resolve/main/model.safetensors
wget -c https://huggingface.co/bigscience/bloom-560m/resolve/main/onnx/decoder_model.onnx
...
#!/usr/bin/env bash
# Color definitions
RED='\033[0;31m'
GREEN='\033[0;32m'
YELLOW='\033[1;33m'
NC='\033[0m' # No Color
trap 'printf "${YELLOW}\nDownload interrupted. If you re-run the command, you can resume the download from the breakpoint.\n${NC}"; exit 1' INT
display_help() {
cat << EOF
Usage:
hfd <repo_id> [--include include_pattern] [--exclude exclude_pattern] [--hf_username username] [--hf_token token] [--tool aria2c|wget] [-x threads] [--dataset] [--local-dir path]
Description:
Downloads a model or dataset from Hugging Face using the provided repo ID.
Parameters:
repo_id The Hugging Face repo ID in the format 'org/repo_name'.
--include (Optional) Flag to specify a string pattern to include files for downloading.
--exclude (Optional) Flag to specify a string pattern to exclude files from downloading.
include/exclude_pattern The pattern to match against filenames, supports wildcard characters. e.g., '--exclude *.safetensor', '--include vae/*'.
--hf_username (Optional) Hugging Face username for authentication. **NOT EMAIL**.
--hf_token (Optional) Hugging Face token for authentication.
--tool (Optional) Download tool to use. Can be aria2c (default) or wget.
-x (Optional) Number of download threads for aria2c. Defaults to 4.
--dataset (Optional) Flag to indicate downloading a dataset.
--local-dir (Optional) Local directory path where the model or dataset will be stored.
Example:
hfd bigscience/bloom-560m --exclude *.safetensors
hfd meta-llama/Llama-2-7b --hf_username myuser --hf_token mytoken -x 4
hfd lavita/medical-qa-shared-task-v1-toy --dataset
EOF
exit 1
}
MODEL_ID=$1
shift
# Default values
TOOL="aria2c"
THREADS=4
HF_ENDPOINT=${HF_ENDPOINT:-"https://huggingface.co"}
while [[ $# -gt 0 ]]; do
case $1 in
--include) INCLUDE_PATTERN="$2"; shift 2 ;;
--exclude) EXCLUDE_PATTERN="$2"; shift 2 ;;
--hf_username) HF_USERNAME="$2"; shift 2 ;;
--hf_token) HF_TOKEN="$2"; shift 2 ;;
--tool) TOOL="$2"; shift 2 ;;
-x) THREADS="$2"; shift 2 ;;
--dataset) DATASET=1; shift ;;
--local-dir) LOCAL_DIR="$2"; shift 2 ;;
*) shift ;;
esac
done
# Check if aria2, wget, curl, git, and git-lfs are installed
check_command() {
if ! command -v $1 &>/dev/null; then
echo -e "${RED}$1 is not installed. Please install it first.${NC}"
exit 1
fi
}
# Mark current repo safe when using shared file system like samba or nfs
ensure_ownership() {
if git status 2>&1 | grep "fatal: detected dubious ownership in repository at" > /dev/null; then
git config --global --add safe.directory "${PWD}"
printf "${YELLOW}Detected dubious ownership in repository, mark ${PWD} safe using git, edit ~/.gitconfig if you want to reverse this.\n${NC}"
fi
}
[[ "$TOOL" == "aria2c" ]] && check_command aria2c
[[ "$TOOL" == "wget" ]] && check_command wget
check_command curl; check_command git; check_command git-lfs
[[ -z "$MODEL_ID" || "$MODEL_ID" =~ ^-h ]] && display_help
if [[ -z "$LOCAL_DIR" ]]; then
LOCAL_DIR="${MODEL_ID#*/}"
fi
if [[ "$DATASET" == 1 ]]; then
MODEL_ID="datasets/$MODEL_ID"
fi
echo "Downloading to $LOCAL_DIR"
if [ -d "$LOCAL_DIR/.git" ]; then
printf "${YELLOW}%s exists, Skip Clone.\n${NC}" "$LOCAL_DIR"
cd "$LOCAL_DIR" && ensure_ownership && GIT_LFS_SKIP_SMUDGE=1 git pull || { printf "${RED}Git pull failed.${NC}\n"; exit 1; }
else
REPO_URL="$HF_ENDPOINT/$MODEL_ID"
GIT_REFS_URL="${REPO_URL}/info/refs?service=git-upload-pack"
echo "Testing GIT_REFS_URL: $GIT_REFS_URL"
response=$(curl -s -o /dev/null -w "%{http_code}" "$GIT_REFS_URL")
if [ "$response" == "401" ] || [ "$response" == "403" ]; then
if [[ -z "$HF_USERNAME" || -z "$HF_TOKEN" ]]; then
printf "${RED}HTTP Status Code: $response.\nThe repository requires authentication, but --hf_username and --hf_token is not passed. Please get token from https://huggingface.co/settings/tokens.\nExiting.\n${NC}"
exit 1
fi
REPO_URL="https://$HF_USERNAME:$HF_TOKEN@${HF_ENDPOINT#https://}/$MODEL_ID"
elif [ "$response" != "200" ]; then
printf "${RED}Unexpected HTTP Status Code: $response\n${NC}"
printf "${YELLOW}Executing debug command: curl -v %s\nOutput:${NC}\n" "$GIT_REFS_URL"
curl -v "$GIT_REFS_URL"; printf "\n${RED}Git clone failed.\n${NC}"; exit 1
fi
echo "GIT_LFS_SKIP_SMUDGE=1 git clone $REPO_URL $LOCAL_DIR"
GIT_LFS_SKIP_SMUDGE=1 git clone $REPO_URL $LOCAL_DIR && cd "$LOCAL_DIR" || { printf "${RED}Git clone failed.\n${NC}"; exit 1; }
ensure_ownership
while IFS= read -r file; do
truncate -s 0 "$file"
done <<< $(git lfs ls-files | cut -d ' ' -f 3-)
fi
printf "\nStart Downloading lfs files, bash script:\ncd $LOCAL_DIR\n"
files=$(git lfs ls-files | cut -d ' ' -f 3-)
declare -a urls
while IFS= read -r file; do
url="$HF_ENDPOINT/$MODEL_ID/resolve/main/$file"
file_dir=$(dirname "$file")
mkdir -p "$file_dir"
if [[ "$TOOL" == "wget" ]]; then
download_cmd="wget -c \"$url\" -O \"$file\""
[[ -n "$HF_TOKEN" ]] && download_cmd="wget --header=\"Authorization: Bearer ${HF_TOKEN}\" -c \"$url\" -O \"$file\""
else
download_cmd="aria2c --console-log-level=error --file-allocation=none -x $THREADS -s $THREADS -k 1M -c \"$url\" -d \"$file_dir\" -o \"$(basename "$file")\""
[[ -n "$HF_TOKEN" ]] && download_cmd="aria2c --header=\"Authorization: Bearer ${HF_TOKEN}\" --console-log-level=error --file-allocation=none -x $THREADS -s $THREADS -k 1M -c \"$url\" -d \"$file_dir\" -o \"$(basename "$file")\""
fi
[[ -n "$INCLUDE_PATTERN" && ! "$file" == $INCLUDE_PATTERN ]] && printf "# %s\n" "$download_cmd" && continue
[[ -n "$EXCLUDE_PATTERN" && "$file" == $EXCLUDE_PATTERN ]] && printf "# %s\n" "$download_cmd" && continue
printf "%s\n" "$download_cmd"
urls+=("$url|$file")
done <<< "$files"
for url_file in "${urls[@]}"; do
IFS='|' read -r url file <<< "$url_file"
printf "${YELLOW}Start downloading ${file}.\n${NC}"
file_dir=$(dirname "$file")
if [[ "$TOOL" == "wget" ]]; then
[[ -n "$HF_TOKEN" ]] && wget --header="Authorization: Bearer ${HF_TOKEN}" -c "$url" -O "$file" || wget -c "$url" -O "$file"
else
[[ -n "$HF_TOKEN" ]] && aria2c --header="Authorization: Bearer ${HF_TOKEN}" --console-log-level=error --file-allocation=none -x $THREADS -s $THREADS -k 1M -c "$url" -d "$file_dir" -o "$(basename "$file")" || aria2c --console-log-level=error --file-allocation=none -x $THREADS -s $THREADS -k 1M -c "$url" -d "$file_dir" -o "$(basename "$file")"
fi
[[ $? -eq 0 ]] && printf "Downloaded %s successfully.\n" "$url" || { printf "${RED}Failed to download %s.\n${NC}" "$url"; exit 1; }
done
printf "${GREEN}Download completed successfully.\n${NC}"
@ZhouJYu
Copy link

ZhouJYu commented Mar 5, 2024

大佬,每个文件下载刚开始能到100M/s,但是到最后几十M都会特别慢(<100k/s),甚至直接卡住了是什么情况?
export HF_ENDPOINT="https://hf-mirror.com" export HF_HUB_ENABLE_HF_TRANSFER=1 huggingface-cli download --resume-download Qwen/Qwen-VL-Chat --local-dir Qwen#Qwen-VL-Chat --local-dir-use-symlinks False downloading https://hf-mirror.com/Qwen/Qwen-VL-Chat/resolve/f57cfbd358cb56b710d963669ad1bcfb44cdcdd8/pytorch_model-00001-of-00010.bin to /root/.cache/huggingface/hub/models--Qwen--Qwen-VL-Chat/blobs/d63e4b4238be3897d3b44d0f604422fc07dfceaf971ebde7adadd7be7a2a35bb.incomplete pytorch_model-00001-of-00010.bin: 99%|██████████████████████████████████████████████████████████████████████████████████████████████████▉ | 1.94G/1.96G [1:27:23<30:13, 11.6kB/s]

@A-raniy-day
Copy link

大佬,请问已经使用了-- exclude *.safetensors还是会下载safetensors文件怎么办

@A-raniy-day
Copy link

大佬,请问已经使用了-- exclude *.safetensors还是会下载safetensors文件怎么办

已解决,通配符问题,改为-- exclude .*.safetensors可以不下载safetensors文件

@A-raniy-day
Copy link

大佬,请问已经使用了-- exclude *.safetensors还是会下载safetensors文件怎么办

已解决,通配符问题,改为-- exclude .*.safetensors可以不下载safetensors文件

126 [[ -n "$EXCLUDE_PATTERN" && "$file" == $EXCLUDE_PATTERN ]] && printf "# %s\n" "$download_cmd" && continue 改为

   if [[ -n "$EXCLUDE_PATTERN" ]]; then
    shopt -s nullglob extglob
    if [[ "$file" =~ $EXCLUDE_PATTERN ]]; then
        printf "# Excluded by exclude pattern: %s\n" "$file"
        continue
    fi
fi

@padeoe
Copy link
Author

padeoe commented Mar 24, 2024

大佬,请问已经使用了-- exclude *.safetensors还是会下载safetensors文件怎么办

我没有复现你的说的问题,请提供下系统环境信息。以下是我的测试命令和输出,可以看到--exclude *.safetensors可以跳过 model.safetensors(对应命令前有#注释)

$ hfd gpt2 --exclude *.safetensors --tool wget
Downloading to gpt2
Testing GIT_REFS_URL: https://hf-mirror.com/gpt2/info/refs?service=git-upload-pack
git clone https://hf-mirror.com/gpt2 gpt2
Cloning into 'gpt2'...
remote: Enumerating objects: 87, done.
remote: Counting objects: 100% (3/3), done.
remote: Compressing objects: 100% (2/2), done.
remote: Total 87 (delta 0), reused 0 (delta 0), pack-reused 84
Unpacking objects: 100% (87/87), done.

Start Downloading lfs files, bash script:
cd gpt2
wget -c "https://hf-mirror.com/gpt2/resolve/main/64-8bits.tflite" -O "64-8bits.tflite"
wget -c "https://hf-mirror.com/gpt2/resolve/main/64-fp16.tflite" -O "64-fp16.tflite"
wget -c "https://hf-mirror.com/gpt2/resolve/main/64.tflite" -O "64.tflite"
wget -c "https://hf-mirror.com/gpt2/resolve/main/flax_model.msgpack" -O "flax_model.msgpack"
# wget -c "https://hf-mirror.com/gpt2/resolve/main/model.safetensors" -O "model.safetensors"
wget -c "https://hf-mirror.com/gpt2/resolve/main/onnx/decoder_model.onnx" -O "onnx/decoder_model.onnx"
wget -c "https://hf-mirror.com/gpt2/resolve/main/onnx/decoder_model_merged.onnx" -O "onnx/decoder_model_merged.onnx"
wget -c "https://hf-mirror.com/gpt2/resolve/main/onnx/decoder_with_past_model.onnx" -O "onnx/decoder_with_past_model.onnx"
wget -c "https://hf-mirror.com/gpt2/resolve/main/pytorch_model.bin" -O "pytorch_model.bin"
wget -c "https://hf-mirror.com/gpt2/resolve/main/rust_model.ot" -O "rust_model.ot"
wget -c "https://hf-mirror.com/gpt2/resolve/main/tf_model.h5" -O "tf_model.h5"
Start downloading 64-8bits.tflite.

@XizhiMaLY
Copy link

XizhiMaLY commented Mar 27, 2024

想问下大佬这个是什么原因呢

$ hfd baichuan-inc/Baichuan2-7B-Chat --tool aria2c -x 4
Downloading to Baichuan2-7B-Chat
Baichuan2-7B-Chat exists, Skip Clone.
已经是最新的。

Start Downloading lfs files, bash script:
cd Baichuan2-7B-Chat
aria2c --console-log-level=error -x 4 -s 4 -k 1M -c "https://hf-mirror.com/baichuan-inc/Baichuan2-7B-Chat/resolve/main/pytorch_model.bin" -d "." -o "pytorch_model.bin"
aria2c --console-log-level=error -x 4 -s 4 -k 1M -c "https://hf-mirror.com/baichuan-inc/Baichuan2-7B-Chat/resolve/main/tokenizer.model" -d "." -o "tokenizer.model"
Start downloading pytorch_model.bin.
[#b6d4e0 0B/0B CN:1 DL:0B]                                                                                                                                                                                                                                      
03/27 14:04:34 [ERROR] CUID#7 - Download aborted. URI=https://hf-mirror.com/baichuan-inc/Baichuan2-7B-Chat/resolve/main/pytorch_model.bin
Exception: [AbstractCommand.cc:403] errorCode=16 URI=https://cdn-lfs.hf-mirror.com/repos/98/6a/986a6f31289a1d4a20a3150d2e5a3962a2bf5091e6ff16b2cbea71139ed1a7c4/594b540d36751fa3b199c609617deafc0329050937c0767fdfb2d38b72d8bec9?response-content-disposition=att
achment%3B+filename*%3DUTF-8%27%27pytorch_model.bin%3B+filename%3D%22pytorch_model.bin%22%3B&response-content-type=application%2Foctet-stream&Expires=1711778674&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTcxMTc3O
DY3NH19LCJSZXNvdXJjZSI6Imh0dHBzOi8vY2RuLWxmcy5odWdnaW5nZmFjZS5jby9yZXBvcy85OC82YS85ODZhNmYzMTI4OWExZDRhMjBhMzE1MGQyZTVhMzk2MmEyYmY1MDkxZTZmZjE2YjJjYmVhNzExMzllZDFhN2M0LzU5NGI1NDBkMzY3NTFmYTNiMTk5YzYwOTYxN2RlYWZjMDMyOTA1MDkzN2MwNzY3ZmRmYjJkMzhiNzJkOGJlYzk%7E
cmVzcG9uc2UtY29udGVudC1kaXNwb3NpdGlvbj0qJnJlc3BvbnNlLWNvbnRlbnQtdHlwZT0qIn1dfQ__&Signature=zLlJ3J5w6SEzD-jxs0awPm9ucnqcd2CJAqjyOR-ST2NCBYx1437UWtcn8PMNzXgzmRmzWbLG9yIJD0cT7gjU-dRBFTE12Q42mcBJYQK6RxXIkT--2cWjlt2gJM3EhDSf-58qwi8WN2N9BOfCVQFHNcJepHhRvkQZkcPZMC
1hpp9lXpptcPyHEcrAoxOAM96HgRNCD31OuiFD6%7E0wbfd1RcQca-cV6DH14zS1rzvAiw-2vQfWjYZ0w6VVtSwDYuTrgD1R2uVPffzliuCcC8wQL9fk%7EybpT2s7a7zTmnz1MCeY1n3S-bJr8mH1H6gtYUd4aZoJlmAKPRNqgBxfevmwaA__&Key-Pair-Id=KVTP0A1DKRTAX
  -> [RequestGroup.cc:760] errorCode=16 Download aborted.
  -> [AbstractDiskWriter.cc:224] errNum=13 errorCode=16 Failed to open the file ./pytorch_model.bin, cause: Permission denied

Download Results:
gid   |stat|avg speed  |path/URI
======+====+===========+=======================================================
b6d4e0|ERR |       0B/s|./pytorch_model.bin

Status Legend:
(ERR):error occurred.

aria2 will resume download if the transfer is restarted.
If there are any errors, then see the log file. See '-l' option in help/man page for details.
Failed to download https://hf-mirror.com/baichuan-inc/Baichuan2-7B-Chat/resolve/main/pytorch_model.bin.

@bigcash
Copy link

bigcash commented Apr 3, 2024

hfd方式下载数据集,中间万一断掉,支持断点继续下载吗?测试了一下,提示要下载的目录已存在

@padeoe
Copy link
Author

padeoe commented Apr 3, 2024

hfd方式下载数据集,中间万一断掉,支持断点继续下载吗?测试了一下,提示要下载的目录已存在

支持断点续传的,"提示要下载的目录已存在" 这个具体的日志什么样的

@RileyRetzloff
Copy link

This is amazing, thanks for sharing! I've already saved a ton of time using it.

I don't have a ton of shell scripting experience, but I wonder if it wouldn't be too difficult to implement some form of self-clean-up behavior at the end of the script to deal with the empty files/folders leftover from cloning the repo.

Maybe something like...

  • traverse the repo directory tree bottom-up, getting the name of each file and...
  • if file_name !matches --include pattern || matches --exclude pattern
  • delete file_name
  • for each directory on the way to the root, delete the directory if it contains 0 files after being traversed

@LucienShui
Copy link

#!/usr/bin/env bash
# Color definitions
RED='\033[0;31m'
GREEN='\033[0;32m'
YELLOW='\033[1;33m'
NC='\033[0m' # No Color

trap 'printf "${YELLOW}\nDownload interrupted. If you re-run the command, you can resume the download from the breakpoint.\n${NC}"; exit 1' INT

display_help() {
    cat << EOF
Usage:
  hfd <repo_id> [--include include_pattern] [--exclude exclude_pattern] [--hf_username username] [--hf_token token] [--tool aria2c|wget] [-x threads] [--dataset] [--local-dir path]    

Description:
  Downloads a model or dataset from Hugging Face using the provided repo ID.

Parameters:
  repo_id        The Hugging Face repo ID in the format 'org/repo_name'.
  --include       (Optional) Flag to specify a string pattern to include files for downloading.
  --exclude       (Optional) Flag to specify a string pattern to exclude files from downloading.
  include/exclude_pattern The pattern to match against filenames, supports wildcard characters. e.g., '--exclude *.safetensor', '--include vae/*'.
  --hf_username   (Optional) Hugging Face username for authentication. **NOT EMAIL**.
  --hf_token      (Optional) Hugging Face token for authentication.
  --tool          (Optional) Download tool to use. Can be aria2c (default) or wget.
  -x              (Optional) Number of download threads for aria2c. Defaults to 4.
  --dataset       (Optional) Flag to indicate downloading a dataset.
  --local-dir     (Optional) Local directory path where the model or dataset will be stored.

Example:
  hfd bigscience/bloom-560m --exclude *.safetensors
  hfd meta-llama/Llama-2-7b --hf_username myuser --hf_token mytoken -x 4
  hfd lavita/medical-qa-shared-task-v1-toy --dataset
EOF
    exit 1
}

MODEL_ID=$1
shift

# Default values
TOOL="aria2c"
THREADS=4
HF_ENDPOINT=${HF_ENDPOINT:-"https://huggingface.co"}

while [[ $# -gt 0 ]]; do
    case $1 in
        --include) INCLUDE_PATTERN="$2"; shift 2 ;;
        --exclude) EXCLUDE_PATTERN="$2"; shift 2 ;;
        --hf_username) HF_USERNAME="$2"; shift 2 ;;
        --hf_token) HF_TOKEN="$2"; shift 2 ;;
        --tool) TOOL="$2"; shift 2 ;;
        -x) THREADS="$2"; shift 2 ;;
        --dataset) DATASET=1; shift ;;
        --local-dir) LOCAL_DIR="$2"; shift 2 ;;
        *) shift ;;
    esac
done

# Check if aria2, wget, curl, git, and git-lfs are installed
check_command() {
    if ! command -v $1 &>/dev/null; then
        echo -e "${RED}$1 is not installed. Please install it first.${NC}"
        exit 1
    fi
}

# Mark current repo safe when using shared file system like samba or nfs
ensure_ownership() {
    if git status 2>&1 | grep "fatal: detected dubious ownership in repository at" > /dev/null; then
        git config --global --add safe.directory "${PWD}"
        printf "${YELLOW}Detected dubious ownership in repository, mark ${PWD} safe using git, edit ~/.gitconfig if you want to reverse this.\n${NC}" 
    fi
}

[[ "$TOOL" == "aria2c" ]] && check_command aria2c
[[ "$TOOL" == "wget" ]] && check_command wget
check_command curl; check_command git; check_command git-lfs

[[ -z "$MODEL_ID" || "$MODEL_ID" =~ ^-h ]] && display_help

if [[ -z "$LOCAL_DIR" ]]; then
    LOCAL_DIR="${MODEL_ID#*/}"
fi

if [[ "$DATASET" == 1 ]]; then
    MODEL_ID="datasets/$MODEL_ID"
fi
echo "Downloading to $LOCAL_DIR"

if [ -d "$LOCAL_DIR/.git" ]; then
    printf "${YELLOW}%s exists, Skip Clone.\n${NC}" "$LOCAL_DIR"
    cd "$LOCAL_DIR" && ensure_ownership && GIT_LFS_SKIP_SMUDGE=1 git pull || { printf "${RED}Git pull failed.${NC}\n"; exit 1; }
else
    REPO_URL="$HF_ENDPOINT/$MODEL_ID"
    GIT_REFS_URL="${REPO_URL}/info/refs?service=git-upload-pack"
    echo "Testing GIT_REFS_URL: $GIT_REFS_URL"
    response=$(curl -s -o /dev/null -w "%{http_code}" "$GIT_REFS_URL")
    if [ "$response" == "401" ] || [ "$response" == "403" ]; then
        if [[ -z "$HF_USERNAME" || -z "$HF_TOKEN" ]]; then
            printf "${RED}HTTP Status Code: $response.\nThe repository requires authentication, but --hf_username and --hf_token is not passed. Please get token from https://huggingface.co/settings/tokens.\nExiting.\n${NC}"
            exit 1
        fi
        REPO_URL="https://$HF_USERNAME:$HF_TOKEN@${HF_ENDPOINT#https://}/$MODEL_ID"
    elif [ "$response" != "200" ]; then
        printf "${RED}Unexpected HTTP Status Code: $response\n${NC}"
        printf "${YELLOW}Executing debug command: curl -v %s\nOutput:${NC}\n" "$GIT_REFS_URL"
        curl -v "$GIT_REFS_URL"; printf "\n${RED}Git clone failed.\n${NC}"; exit 1
    fi
    echo "git clone $REPO_URL $LOCAL_DIR"

    GIT_LFS_SKIP_SMUDGE=1 git clone $REPO_URL $LOCAL_DIR && cd "$LOCAL_DIR" || { printf "${RED}Git clone failed.\n${NC}"; exit 1; }

    ensure_ownership

    for file in $(git lfs ls-files | awk '{print $3}'); do
        truncate -s 0 "$file"
    done
fi

printf "\nStart Downloading lfs files, bash script:\ncd $LOCAL_DIR\n"
files=$(git lfs ls-files | awk '{print $3}')
declare -a urls

for file in $files; do
    url="$HF_ENDPOINT/$MODEL_ID/resolve/main/$file"
    file_dir=$(dirname "$file")
    mkdir -p "$file_dir"
    if [[ "$TOOL" == "wget" ]]; then
        download_cmd="wget -c \"$url\" -O \"$file\""
        [[ -n "$HF_TOKEN" ]] && download_cmd="wget --header=\"Authorization: Bearer ${HF_TOKEN}\" -c \"$url\" -O \"$file\""
    else
        download_cmd="aria2c --console-log-level=error -x $THREADS -s $THREADS -k 1M -c \"$url\" -d \"$file_dir\" -o \"$(basename "$file")\""
        [[ -n "$HF_TOKEN" ]] && download_cmd="aria2c --header=\"Authorization: Bearer ${HF_TOKEN}\" --console-log-level=error -x $THREADS -s $THREADS -k 1M -c \"$url\" -d \"$file_dir\" -o \"$(basename "$file")\""
    fi
    [[ -n "$INCLUDE_PATTERN" && ! "$file" == $INCLUDE_PATTERN ]] && printf "# %s\n" "$download_cmd" && continue
    [[ -n "$EXCLUDE_PATTERN" && "$file" == $EXCLUDE_PATTERN ]] && printf "# %s\n" "$download_cmd" && continue
    printf "%s\n" "$download_cmd"
    urls+=("$url|$file")
done

for url_file in "${urls[@]}"; do
    IFS='|' read -r url file <<< "$url_file"
    printf "${YELLOW}Start downloading ${file}.\n${NC}" 
    file_dir=$(dirname "$file")
    if [[ "$TOOL" == "wget" ]]; then
        [[ -n "$HF_TOKEN" ]] && wget --header="Authorization: Bearer ${HF_TOKEN}" -c "$url" -O "$file" || wget -c "$url" -O "$file"
    else
        [[ -n "$HF_TOKEN" ]] && aria2c --header="Authorization: Bearer ${HF_TOKEN}" --console-log-level=error -x $THREADS -s $THREADS -k 1M -c "$url" -d "$file_dir" -o "$(basename "$file")" || aria2c --console-log-level=error -x $THREADS -s $THREADS -k 1M -c "$url" -d "$file_dir" -o "$(basename "$file")"
    fi
    [[ $? -eq 0 ]] && printf "Downloaded %s successfully.\n" "$url" || { printf "${RED}Failed to download %s.\n${NC}" "$url"; exit 1; }
done

printf "${GREEN}Download completed successfully.\n${NC}"

@mayaobuduyao
Copy link

Downloading to /home/zhengxiao/dataroot/models/Llama2_7b_chat_hf/
Testing GIT_REFS_URL: https://hf-mirror.com/meta-llama/Llama-2-7b-chat-hf/info/refs?service=git-upload-pack
git clone https://zhengxiao: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@hf-mirror.com/meta-llama/Llama-2-7b-chat-hf /home/zhengxiao/dataroot/models/Llama2_7b_chat_hf/
fatal: destination path '/home/zhengxiao/dataroot/models/Llama2_7b_chat_hf' already exists and is not an empty directory.
Git clone failed.
显示目录已存在,请问如何解决?

@LucienShui
Copy link

Downloading to /home/zhengxiao/dataroot/models/Llama2_7b_chat_hf/ Testing GIT_REFS_URL: https://hf-mirror.com/meta-llama/Llama-2-7b-chat-hf/info/refs?service=git-upload-pack git clone https://zhengxiao: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@hf-mirror.com/meta-llama/Llama-2-7b-chat-hf /home/zhengxiao/dataroot/models/Llama2_7b_chat_hf/ fatal: destination path '/home/zhengxiao/dataroot/models/Llama2_7b_chat_hf' already exists and is not an empty directory. Git clone failed. 显示目录已存在,请问如何解决?

Qwen1.5-72B-Chat-q3 的回答:

如果目录已经存在,你可以直接进入该目录检查是否已经下载了模型。如果模型已经存在,你不需要再次下载。如果模型不存在或者你需要更新到最新版本,你可以按照以下步骤操作:

  1. 删除现有目录:如果你想要重新下载,可以先删除已存在的目录。在终端中,使用以下命令:

    rm -rf /home/zhengxiao/dataroot/models/Llama2_7b_chat_hf/

    这将删除Llama2_7b_chat_hf目录及其所有内容。

  2. 重新下载:删除后,再次运行下载命令,它应该会重新创建目录并下载模型。

如果你不确定目录中是否有模型,也可以先查看目录内容,如果里面没有模型文件,直接运行下载命令应该会继续下载。如果有模型文件但需要更新,可能需要查看模型的更新日志或者使用特定的更新命令,这取决于你使用的具体模型和下载工具。

@cydiachen
Copy link

Hello, I have an issue when downloading a large dataset like oscar-corpus/OSCAR-2301.
I only want to download sub-folder like its chinese part. How would I execute the command?
I have tried --include reolve/main/en_meta or en_meta_*.zst, but didn;t work

@padeoe
Copy link
Author

padeoe commented Apr 25, 2024

Hello, I have an issue when downloading a large dataset like oscar-corpus/OSCAR-2301. I only want to download sub-folder like its chinese part. How would I execute the command? I have tried --include reolve/main/en_meta or en_meta_*.zst, but didn;t work

try --include en_meta/*, it should match the full path of target files.

@divisionblur
Copy link

哥,我怎么指定了--local-dir 不往这下载啊

@divisionblur
Copy link

发现了,我的问题,把路径写错了。

@divisionblur
Copy link

跪着感谢大哥的脚本,下载很快。

@XxxAtlantis
Copy link

-> [SocketCore.cc:1015] errorCode=1 SSL/TLS handshake failure: not signed by known authorities or invalid' expired'

Download Results:
gid |stat|avg speed |path/URI
======+====+===========+=======================================================
cc2659|ERR | 0B/s|./pytorch_model-00001-of-00002.bin

Status Legend:
(ERR):error occurred.

aria2 will resume download if the transfer is restarted.
If there are any errors, then see the log file. See '-l' option in help/man page for details.

请问,为什么会出现这种错误?

@Tvrco
Copy link

Tvrco commented May 9, 2024

佬,能否加一个临时走代理

@jieran233
Copy link

Consider adding --file-allocation=none to aria2c ? I think pre-allocation will double the wear of SSD.

@padeoe
Copy link
Author

padeoe commented May 15, 2024

Some updates and related contributors.

@char-1ee
Copy link

请问是需要在提前安装aria2c吗,还是说我使用方法不对

$ ./hfd.sh deepseek-ai/DeepSeek-V2-Chat --tool aria2c -x 4 
aria2c is not installed. Please install it first.                

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment