Skip to content

Instantly share code, notes, and snippets.

@shcheklein
Created December 19, 2022 20:26
Show Gist options
  • Save shcheklein/a12e160ca509aa1c9135d991065fcbc9 to your computer and use it in GitHub Desktop.
Save shcheklein/a12e160ca509aa1c9135d991065fcbc9 to your computer and use it in GitHub Desktop.
Download LAION metadata
#!/bin/bash
# A script to download LAION metada in parallel
#
# Based on https://github.com/rom1504/img2dataset/blob/main/dataset_examples/laion5B.md
#
# - Use dedicated EC2 instance, with high network bandwith to make it faster
# - It should be taking an hour to download everything
#
# Usage example for normal dataset:
#
# ./download.sh laion2B-en 5114fd87-297e-42b0-9d11-50f1df323dfa-c000
# ./download.sh laion2B-multi fc82da14-99c9-4ff6-ab6a-ac853ac82819-c000
# ./download.sh laion1B-nolang d6a94da9-d368-4d5b-9ab7-3f6d3c7abdb3-c000
#
# It will upload missing parts to the s3://dvc-private/laion/metadata.
# Run the same command a few times in a row to make sure that everything is
# downloaded.
set -u
PREFIX=$1
HASH=$2
for i in {00000..00127}; do
aws s3api head-object \
--bucket dvc-private \
--key "laion/metadata/$PREFIX/part-$i-$HASH.snappy.parquet" \
--output json
if [ $? -eq 255 ]; then
echo
wget "https://huggingface.co/datasets/laion/$PREFIX/resolve/main/part-$i-$HASH.snappy.parquet" -O - | aws s3 cp - "s3://dvc-private/laion/metadata/$PREFIX/part-$i-$HASH.snappy.parquet"&
fi
done
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment