Skip to content

Instantly share code, notes, and snippets.

@bwindsor
Last active February 7, 2020 09:29
Show Gist options
  • Save bwindsor/eab6477135e9b372ea508d0da299dc2b to your computer and use it in GitHub Desktop.
Save bwindsor/eab6477135e9b372ea508d0da299dc2b to your computer and use it in GitHub Desktop.
Reproducability of Zip file (and Terraform)
#!/usr/bin/env python
"""
This provides reproducible zipping, i.e. the zip file has doesn't change every time you recreate it unless the contents of any file within it have changed
See https://unix.stackexchange.com/questions/14705/the-zip-formats-external-file-attribute since files different because of file permissions
"""
import os
import sys
from subprocess import check_output
from tempfile import TemporaryDirectory
import shutil
def reproducible_zip(input_dir, zip_file):
if os.path.exists(zip_file):
raise FileExistsError("{0} already exists".format(zip_file))
with TemporaryDirectory() as td:
print(os.listdir(td))
shutil.copytree(input_dir, td, dirs_exist_ok=True)
rewrite_timestamps(td)
check_output(f"cd {td} && find . -print | sort | zip -X {zip_file} -@", shell=True)
def rewrite_timestamps(input_dir):
for parent, folders, files in os.walk(input_dir):
check_output(f"chmod 777 {parent} && touch -t 1701011215 {parent}", shell=True)
for filename in files:
os.chmod(os.path.join(parent, filename), 0o666)
os.utime(os.path.join(parent, filename), (1500000000, 1500000000))
def display_help():
print("""
USAGE
zip.py INPUT_DIR OUTPUT_FILE
e.g. reproducible_zip.py my_directory output.zip
""")
if __name__ == '__main__':
if len(sys.argv) < 3:
display_help()
sys.exit(1)
input_dir = sys.argv[1]
output_file = sys.argv[2]
reproducible_zip(input_dir, output_file)

When deploying with Terraform I wanted zip files to be reproducable. By that I mean that the hash of the zip files are identical, every byte matches.

The contents of two zip files can be identical, but various things will cause their hashes to be different.

These are the things which were required in order to make this happen. Each has more detail below. The attached Python function puts all this together to take a folder and make it into a reproducible directory.

  1. Files must be added to the zip archived in the same order
  2. Files must have identical last modified timestamps
  3. Files must have identical permissions

File order

If the order is different, the hash is different. To solve this, you just have to sort the files before adding them to a zip file. To zip directory INPUT_DIR into zip file ZIP_FILE:

cd $INPUT_DIR
find . -print | sort | zip -X $ZIP_FILE -@"

Modified times

Even if the hash of all the individual file contents in the zip are identical, their metadata may not be. Assuming we do not care about preserving modification time and that we just want the zip to be reproducable, this is achievable by running the command

touch -t 1701011215 FILE_NAME

on every file in the source directory before adding them to the zip archive. Here it will set their modified time to 12:15 on 1st January 2017.

Permissions

Zip files also store "external file permissions", 4 bytes for each file. These four bytes are explained in this excellent StackOverflow answer. Bits 5-16 are most likely to be different, but bits 24-32 may get affected on Windows. So additionally, before adding files and folders to the zip file, run

chmod 777 DIR_NAME

on every directory, and

chmod 666 DIR_NAME

on every file.

Terraform

The Terraform archive provider archive_file seems to give reproducible results, provided you use source blocks and not the source_file parameter. It looks like with the source_file just zips up that file including permissions and timestamps, whereas source blocks seem to be independent of them. In code, use this:

data "archive_file" "zip_file" {
  type        = "zip"
  output_path = "archive.zip"

  source {
    content  = file("myfile.txt")
    filename = "myfile.txt"
  }
}

Not this:

data "archive_file" "init" {
  type        = "zip"
  source_file = "myfile.txt"
  output_path = "archive.zip"
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment