thelink2012/0000-custom-headers.md

## 0000-custom-headers.md

      
    Raw
  

              0000-custom-headers.md
            
          
Feature Name: custom-headers
Start Date: 11/11/2016

Knowledge of the SCM format is assumed.
Summary

This is a proposal for a standard format for custom header chunks in the SCM bytecode.
Motivation

Custom headers are needed for some features, and the community hasn't yet defined a standard format for those. In this scenario, the trend would be for each of those features to define a unique format. At some point, it'd be difficult to parse and maintain custom headers reliably.
Detailed design

By following Rockstar North standards, a header shall happen before any code segment, and shall be preceded by a GOTO instruction to the next segment, then, an alignment byte.
An alignment byte, however, is not enough, some special magic bytes following the GOTO are needed. For reasons explained below, those bytes are defined as FF 7F FE 00 00.
In a multifile (i.e. main.scm), those custom header chunks should never appear before the usual multifile headers. Mission scripts and streamed scripts may have custom headers at their beggining.
The GOTO in the header branches into  either code or yet another [custom] header GOTO instruction. If the branch target is not ahead of the current program counter, this is not a custom header.
Following the GOTO and those magic bytes, a custom header chunk begins. The structure of such chunk must:

Start with a FourCC, to identify which custom header chunk this is.
Be 4-byte aligned.
Have little-endian byte ordering.

Additionally, it is recommended that pointers in the header are relative to the FourCC offset, in such a way that the header can be memcpyed with no further problems.
In summary:
(02 00 01)h + 32 bit int        Jump to next segment (may be a local jump)
(FF 7F FE 00 00)h               Custom headers magic number
(?? ?? ?? ??)h                  Signature of an custom header (FourCC).
[...]

FourCC naming convention

The FourCC naming convention is based off the PNG 1.2 Specification at §3.3 with very few modifications highlighted below.
The FourCC is four byte sequence which can be used to identify the custom header chunks, such bytes may or may not be ASCII characters.
The bit 5 of each byte is used to convey header properties

Runtime ancillary bit: bit 5 of first byte

0 (uppercase) = critical, 1 (lowercase) = ancillary.
Chunks that are not strictly necessary to interpret the bytecode (by a runtime) should have the ancillary bit set.
Furthermore, a runtime encountering an unknown chunk in which the ancillary bit is 1 can safely ignore the chunk and proceed to run the bytecode.
Examples of such ancillary chunk would be something like the VAR segment of the Sanny Builder footer, which stores the name of the variables used in the source code.


Reserved bit: bit 5 of the second byte

Must be 0 (uppercase).


Reserved bit: bit 5 of the third byte

Must be 0 (uppercase).


Reserved bit: bit 5 of fourth byte

Must be 0 (uppercase).


It is worth noting that the property bits are an inherent part of the chunk name, and hence are fixed for any chunk type. Thus, BLOB and bLOB would be unrelated chunk type codes, not the same chunk with different properties.
Rationale for the magic number

The magic bytes needs to accomplish a few missions:

It shall not be ambiguous with SCM instructions.
It shall not be ambiguous with x86 instructions.
It should avoid looking like user data.

The last two points are to overcome custom data in scripts added by the means of the HEX...END directive in Sanny Builder.
With that in mind, a single byte (after the GOTO) is not enough to make this distinction. And, since the custom header data should be aligned in a 4-byte boundary, a minimum of 5 bytes is left to discriminate the data. Luckly, 5-bytes is enough:

For bytecode, it's known the last command id (32767) shall not be used, as it marks the end of the commands enumeration. Therefore, FF 7F follows the GOTO.
As FF followed by 7F is not a valid x86 opcode, this also discards assembly code. After the EIP increment, however, 7F is fetched, which is an jg rel8 x86 instruction and the FF preceding it could just be an alignment byte. To discard the jg, an FE is used, causing a branch into itself, which is an illogical operation.
At this point the magic is defined as FF 7F FE ?? ??, this easily discards user data as well, which is usually either ASCII characters or null bytes.
The bytes left should be set to 00.

Drawbacks

Why should we not do this?
Alternatives


Instead of magic numbers, a custom command preceding the header (or even performing the branch?) could be used.

This adds third-party dependencies to header making.
But is much more precise on whether the data following the branch is indeed a header.


Use the Sanny Builder Footer format.

The size of the bytecode must be known in order to parse this header.


Use a separate file for the header data.

Precise, but adds extra disk seeking.
Adds another unit of information, which is bad.


Unresolved questions


Runtimes should or shouldn't ignore the bit 5 of the second, third and fourth byte of the FourCC?