Skip to content

Instantly share code, notes, and snippets.

@AlexAtkinson
Last active March 8, 2024 16:02
Show Gist options
  • Save AlexAtkinson/37d64e4e0527f6e5f9d57280669ad055 to your computer and use it in GitHub Desktop.
Save AlexAtkinson/37d64e4e0527f6e5f9d57280669ad055 to your computer and use it in GitHub Desktop.
A quick DevOps primer on archive artifacting for the SDLC.

Artifacting

RELATED: Versioning.md

This primer on artifacting demonstrates how to package files as zip and tar.gz, leverage a .artifactignore file similar to .gitignore, and generate and use a checksum file.

Artifacting is the process of packaging a project for distribution and/or release, and is essential to the SDLC as it mitigates many risks in both producing and consuming software products. Aside from archives, there are binaries and other language specific formats and frameworks that have their own packaging methods, but they are outside of the scope of this document. If you're interested, the serverless packaging mechanism is a good demonstration of some of the same concpets discussed here.

Generally artifacts should conform to a standard naming scheme such as: '<project name>_<version>.<extension>'.

💡 See 12factor and O'Reilly:Beyond the Twelve-Factor App for more on maturing your SDLC.

🗒️ ProTip - Update the 'product' variable to use '${PWD##*/}' if executing from top level of any project, or 'topdir=$(git rev-parse --show-toplevel; project=${topdir##*/})' when executing in a git repo to automatically set the name of the artifact.

Mitigated Risks & Benefits (Not all inclusive)

  • Inconsistent package content from one build to the next for languages with volitile software suppy chains, such as JavaScript or Python. For example, a QA build might have a dependency of a dependency at 0.1.0, while a prod build may have it at 1.0.2, as this depenency hell is out of our control.
  • Inconsistent configuration data. See here for a great article on this.
  • The cost of multiple build-test cycles across each environment by establishing a build-once posture. Related to the first point, with the risk of builds diverging for each environment, each must be tested discretely to qualify them against quality standards.
  • Istrumenting the product with build and release pipelines early ensures early identification of related issues.
  • Storing credentials in the repository by ensuring software ingests environemnt variables at processes start.
  • Operational complexity (cost++) surrounding '-rc', or '-beta' style pre-release identifiers. (There are some exceptions to this.)
  • The potential for software to be modified post-release vai the inclusion a file hash for verificaiton.

💡 To reiterate the point on configuration data: Configuration & Dependency Ownership is a critical concern that many organizations struggle with. Operations Owns Configuration Data and 3rd Party Service Dependencies. IE: If the app depends on a 3rd party weather service, then Ops has to setup and manage the service.

Archive Artifacts

Archive type artifacts generally come in the form of a zip, or tar.gz file. The following examples demonstrate how to accomplish this in linux, using modern gnu utilities. (MacOS native utilies not guaranteed to function. Tip: run brew install zip unzip gnu-tar)

As far as the versioning aspect of this goes, use either semver, or calver, but do use one of them. Don't make up something that nobody else will understand.

NOTE: Aligning to industry standards is important as many SDLC toolchain components expect to be able to validate/parse your version string.

NOTE: If both zip and tar.gz artifacts will be produced for a project, special attention must be given to the pattern matching used by each utility, as they are not the same. For example, include 'foo/', 'foo/**' for zip, and '**/foo' and '**/foo/*' for tar to fully exclude the 'foo' directory from the archive. WARN: Zip does not support '**/..' patterns.

Zip

The following script packages the current directory contents ignoring anything specified in the '.artifactignore' file.

NOTE: Exclusion patterns are NOT like those used by .gitignore. For example, the pattern '**/foo' does not function as it does with tar or git. See here for more.

create-artifact-zip.sh

#!/usr/bin/env bash
product=$1
version=$2
tmp_dir='/tmp'
exclude_file='.artifactignore'
artifact_name="${product}-${version}.zip"

zip -x@${exclude_file} ${tmp_dir}/${artifact_name} .

Usage:

./create-artifact-zip.sh <project-name> <version>

Tar.gz

The following script packages the current directory contents, ignoring anything specified in the '.artifactignore' file.

NOTE: Exclusion patterns are like those used by .gitignore, which you can reference here.

create-artifact-tar-gz.sh

#!/usr/bin/env bash
product=$1
version=$2
tmp_dir='/tmp'
exclude_file='.artifactignore'
artifact_name="${product}-${version}.tgz"

tar -X ${exclude_file} -zcvf ${tmp_dir}/${artifact_name} .

Usage:

./create-artifact-tar.gz.sh <project-name> <version>

.artifactignore

The contents of the .artifactignore file will generally be language and project specific. Have a look at the github/gitignore repo for a collection of .gitignore templates to get started.

As an example, the .artifactignore file may contain the following:

.artifactignore
.gitignore
**/README.md
**/logs
**/*.log
**/trace.*
**/*.env
.git/
.git/**
.github/
.github/**
node_modules

Checksum

A checksum is a computational hash of a file, meaning that in theory, if it's changed then the checksum will no longer match.

Assuring consumers that the artifact they receive is complete and unmodified is a critical component of the artifacting strategy. Originally, providing an md5sum with a file was important as internet technolgoies could not always guarantee complete delivery of a file, but is more important these days as bad actors may attempt to modify a file in a malicious way.

NOTE: md5 is no longer recognized as secure. Use sha256.

Generation

Checksum files can be generated discretely for each file:

$ echo dog > pets.txt
$ sha256 pets.txt > pets.txt.sha256

Or for multiple files:

$ echo dog > pets.txt
$ echo wrench > tools.txt
$ sha256sum pets.txt >> sha256sum.txt
$ sha256sum tools.txt >> sha256sum.txt

Validation

Files can be validated individually:

$ sha256sum -c pets.txt.sha256
pets.txt: OK
$ echo cat > pets.txt
$ sha256sum -c pets.txt.sha256
pets.txt: FAILED
sha256sum: WARNING: 1 computed checksum did NOT match

Or multiple files can be validated at once:

$ sha256sum -c sha256sums.txt
tools.txt: OK
pets.txt: OK
$ echo cat > pets.txt
$ sha256sum -c sha256sums.txt
tools.txt: OK
pets.txt: FAILED
sha256sum: WARNING: 1 computed checksum did NOT match

The sharp-eyed will have noticed that there is operationally no difference between generating and validating a single or multiple files. While some prefer a single file with all the checksums, some users or systems rely on being able to retrieve the file and it's related checksum file discretely. Why not do both?

Implementation

Delivery Assurance

Simply generate a '.sha256', or 'sha256sums.txt' file (or both) with the artifact(s) and make it available. Clients can download and validate as needed.

Bad Actor Mitigation

Relying on Github's Release feature ensures that the checksum files cannot be modified after relase without there being a record. Additionally, the checksums can be included in the release notes.

Where artifacts are made available via nother method, such as a website, distributing a record of the artifacts and their checksums via another channel, such as via the product docs, is a good way to provide another layer of assurance to consumers.

BONUS: You can establish an asset healthcheck system by emitting their urls and checksums into a database so that they can be validated intermittently. Depending on the type of product or service being run, this could be considered among the most basic of operational assurances.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment