So you generated 50000 images for computing FID or whatever, and now you want to upload those samples to HF.
You try, but one of the filetransfers fails, and you lose all your progress.
I mean it'd be nice if HF could just… fix this… like, put retries into huggingface-cli upload
instead of just discarding tens of gigabytes of progress… but we live in the world in which we live.
So let's make it easier. instead of 50k small files, let's upload 50 big files. Collate 'em into .tars.
I'm not sure this makes a valid WDS, but it's close; I think you would need to rename the files to 000000.img.png
if you wanted that.
Directory structure (as given by tree
command):
samples
├── 000000.png
├── 000001.png
├── …
└── 049999.png
Let's make a sibling directory, splits
:
.
├── samples
│ ├── 000000.png
│ ├── 000001.png
│ ├── …
│ └── 049999.png
└── splits
Ensure you are cd
ed into the splits
directory.
We'll generate text files x00…x49
detailing the list of files we want in each chunk:
split -l 1000 --numeric-suffixes --suffix-length=2 <(find ../samples -printf '%P\n' -type f -name '*.png' | awk NF | sort -V)
Now we have the following files:
.
├── samples
│ ├── 000000.png
│ ├── 000001.png
│ ├── …
│ └── 049999.png
└── splits
├── x00
├── x01
├── …
└── x49
Split files such as x00
have content like this (a list of files):
000000.png
000001.png
…
000999.png
Still in the splits
directory, let's make a tar
directory:
.
├── samples
│ ├── 000000.png
│ ├── 000001.png
│ ├── …
│ └── 049999.png
└── splits
├── tar
├── x00
├── x01
├── …
└── x49
Now let's read the file listings in every x00…x49
split, and create .tar
chunks of said file listings:
for i in {0..49}; do tar -C ../samples/ -cvf "$(printf 'tar/%02d000.tar' $i)" --files-from "$(printf 'x%02d' $i)"; done
This gives us a folder of .tar
s:
.
├── eval_0
│ ├── 000000.png
│ ├── 000001.png
│ ├── …
│ └── 049999.png
└── splits
├── tar
│ ├── 00000.tar
│ ├── 01000.tar
│ ├── …
│ └── 49000.tar
├── x00
├── x01
├── …
└── x50
Each such tar contains 1000 pngs:
tar -tvf tar/00000.tar
-rw-rw-r-- birch/birch 1174871 2024-02-02 23:51 000000.png
-rw-rw-r-- birch/birch 1415042 2024-02-02 23:51 000001.png
…
-rw-rw-r-- birch/birch 1488682 2024-02-02 23:57 000999.png
cd
into the tar
directory, and upload all its files to a dataset on HF:
huggingface-cli upload --repo-type=dataset hfusername/my-cool-dataset . .
if you don't care about chunking, here's another command I found useful for the following situation:
produces:
where
cat.tar
contains: