Skip to content

Instantly share code, notes, and snippets.

@theRealSuperMario
Created February 21, 2020 14:16
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save theRealSuperMario/dfeac51b04a383fa3cfa2bd1829a93a4 to your computer and use it in GitHub Desktop.
Save theRealSuperMario/dfeac51b04a383fa3cfa2bd1829a93a4 to your computer and use it in GitHub Desktop.
Download unaligned celeba (in the wild) in .tgz archive which is significantly faster to extract than .7z

Celeba dataset as explained here:

http://mmlab.ie.cuhk.edu.hk/projects/CelebA.html

  • img_celeba.gz: contains unaligned "in the Wild" images
    • originally, the dataset was released as a .7z archive splitted into 14 subfiles (.7z.001 ... .7z.014).
    • the problem is that unpacking .7z on linux is not parallelized on linux (https://unix.stackexchange.com/questions/210671/7-zip-slows-down-over-time-on-ubuntu-but-not-windows)
    • it is therefore significantly faster to
      1. download the files
      2. extract them on windows
      3. compress them with a different format
      4. move them to the server and then uncompress again, using a parallelized compression algorithm
    • In order to prevent others from this hazzle, I host the .tgz compressed file here.
    • use the following commands to download and extract
    wget http://datasets.sandrobraun.de/celeba/img_celeba.tgz
    pigz -dc img_celeba.tgz | tar xf -
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment