Skip to content

Instantly share code, notes, and snippets.

@Birch-san
Last active April 24, 2024 12:07
Show Gist options
  • Save Birch-san/772e948a27e5b6ffdfbbaecec54b18fc to your computer and use it in GitHub Desktop.
Save Birch-san/772e948a27e5b6ffdfbbaecec54b18fc to your computer and use it in GitHub Desktop.
Chunking a folder of pngs into .tar files

Uploading a folder of many files to HF, by chunking it into .tars

So you generated 50000 images for computing FID or whatever, and now you want to upload those samples to HF.
You try, but one of the filetransfers fails, and you lose all your progress.
I mean it'd be nice if HF could just… fix this… like, put retries into huggingface-cli upload instead of just discarding tens of gigabytes of progress… but we live in the world in which we live.

So let's make it easier. instead of 50k small files, let's upload 50 big files. Collate 'em into .tars.

I'm not sure this makes a valid WDS, but it's close; I think you would need to rename the files to 000000.img.png if you wanted that.

Starting point

Directory structure (as given by tree command):

samples
├── 000000.png
├── 000001.png
├──      …
└── 049999.png

Let's make a sibling directory, splits:

.
├── samples
│   ├── 000000.png
│   ├── 000001.png
│   ├──      …
│   └── 049999.png
└── splits

Create splits

Ensure you are cded into the splits directory.

We'll generate text files x00…x49 detailing the list of files we want in each chunk:

split -l 1000 --numeric-suffixes --suffix-length=2 <(find ../samples -printf '%P\n' -type f -name '*.png' | awk NF | sort -V)

Now we have the following files:

.
├── samples
│   ├── 000000.png
│   ├── 000001.png
│   ├──      …
│   └── 049999.png
└── splits
    ├── x00
    ├── x01
    ├──  …
    └── x49

Split files such as x00 have content like this (a list of files):

000000.png
000001.png
…
000999.png

tar the splits

Still in the splits directory, let's make a tar directory:

.
├── samples
│   ├── 000000.png
│   ├── 000001.png
│   ├──      …
│   └── 049999.png
└── splits
    ├── tar
    ├── x00
    ├── x01
    ├──  …
    └── x49

Now let's read the file listings in every x00…x49 split, and create .tar chunks of said file listings:

for i in {0..49}; do tar -C ../samples/ -cvf "$(printf 'tar/%02d000.tar' $i)" --files-from "$(printf 'x%02d' $i)"; done

This gives us a folder of .tars:

.
├── eval_0
│   ├── 000000.png
│   ├── 000001.png
│   ├──      …
│   └── 049999.png
└── splits
    ├── tar
    │   ├── 00000.tar
    │   ├── 01000.tar
    │   ├──     …
    │   └── 49000.tar
    ├── x00
    ├── x01
    ├──  …
    └── x50

Each such tar contains 1000 pngs:

tar -tvf tar/00000.tar
-rw-rw-r-- birch/birch 1174871 2024-02-02 23:51 000000.png
-rw-rw-r-- birch/birch 1415042 2024-02-02 23:51 000001.png
…
-rw-rw-r-- birch/birch 1488682 2024-02-02 23:57 000999.png

Uploading to HF

cd into the tar directory, and upload all its files to a dataset on HF:

huggingface-cli upload --repo-type=dataset hfusername/my-cool-dataset . .
@Birch-san
Copy link
Author

Birch-san commented Apr 24, 2024

if you don't care about chunking, here's another command I found useful for the following situation:

.
├── cat
│   ├── 001.png
│   ├── 002.png
│   ├──      …
│   └── 999.png
└── dog
    ├── 001.png
    ├── 002.png
    ├──      …
    └── 999.png
find . -maxdepth 1 -type d -exec tar -cvf {}.tar {} \;

produces:

.
├── cat.tar
├── cat
│   ├── 001.png
│   ├── 002.png
│   ├──      …
│   └── 999.png
├── dog.tar
└── dog
    ├── 001.png
    ├── 002.png
    ├──      …
    └── 999.png

where cat.tar contains:

.
└── cat
    ├── 001.png
    ├── 002.png
    ├──      …
    └── 999.png

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment