Skip to content

Instantly share code, notes, and snippets.

@ato
Last active March 5, 2019 13:06
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save ato/38e37b9df44fa67d55dbb827e04f3c65 to your computer and use it in GitHub Desktop.
Save ato/38e37b9df44fa67d55dbb827e04f3c65 to your computer and use it in GitHub Desktop.
WARC conversion software fields draft

WARC conversion software fields (draft)

When converting content in an archive it is useful for diagnostic purposes to record the versions of major software components used and important conversion options. Another common use case is to identify records that later need to be reconverted with newer software in order to improve conversion quality or fix records misconverted due to a bug or incorrect option.

WARC-Conversion-Software

The WARC-Conversion-Software field indicates the version of software components used in the conversion of the record's content. The field value has the same format as a HTTP User-Agent field (see RFC7231 section 5.5.3) and consists of a list of one or more product identifiers and zero or more comments.

WARC-Conversion-Software = product *( RWS ( product / comment ) )

product         = token [ "/" product-version ]
product-version = token
comment         = "(" *( ctext / quoted-pair / comment ) ")"

For example:

WARC-Conversion-Software: ImageMagick/6.9.9-38 (x86 linux)

Multiple product identifiers may be used to indicate the version of important subcomponents such as codec libraries used when encoding a video.

WARC-Conversion-Software: ffmpeg/4.0.3 libvpx/1.8.0 libopus/1.3

When product identifiers represent multiple steps in a processing pipeline they should be listed in processing order and otherwise in decreasing order of significance for identifying the software. For example a TIFF image decoded with an unknown version of libtiff and then re-encoded with libjpeg version 9c could be recorded as:

WARC-Conversion-Software: libtiff libjpeg/9c

Software components unimportant to the conversion process, such as other codecs that a video transcoder happens to support but did not use, should not be listed.

The WARC-Conversion-Software field may be used in ‘conversion’ type records and shall not be used for other record types.

WARC-Conversion-Options

The WARC-Conversion-Options field indicates the options used when converting the content. The format of the field value is specific to the conversion software used.

WARC-Conversion-Software = *TEXT

Some examples:

WARC-Conversion-Options: acodec=mp3 bitrate=64

WARC-Conversion-Options: {"lossless": true, layers: 5}

By convention when the conversion software is configured through command-line options a full command-line should be included with the tokens {input} and {output} representing the input and output file respectively.

WARC-Conversion-Options: ffmpeg -y -i {input} -c:v vp9 -c:a libopus -speed 4 {output}

A conversion involving multiple steps may be indicated using a shell pipeline

WARC-Conversion-Options: bzip2 -d | gzip -9

or multiple sequential commands separated by semi-colons:

WARC-Conversion-Options: ddjvu -format=tiff {input} tmp.tif; convert tmp.tif tmp.png;
                         pngcrush tmp.png {output}

If the conversion options are not representable in a short text form suitable for including in a header field they may be recorded separately in one or more ‘metadata’ records. In such cases the WARC-Conversion-Options field may still include a short textual summary of only the most important options for diagnostic purposes.

The WARC-Conversion-Options field may be used in ‘conversion’ type records and shall not be used for other record types.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment