- Feature Name: custom-headers
- Start Date: 11/11/2016
Knowledge of the SCM format is assumed.
This is a proposal for a standard format for custom header chunks in the SCM bytecode.
Custom headers are needed for some features, and the community hasn't yet defined a standard format for those. In this scenario, the trend would be for each of those features to define a unique format. At some point, it'd be difficult to parse and maintain custom headers reliably.
By following Rockstar North standards, a header shall happen before any code segment, and shall be preceded by a GOTO
instruction to the next segment, then, an alignment byte.
An alignment byte, however, is not enough, some special magic bytes following the GOTO
are needed. For reasons explained below, those bytes are defined as FF 7F FE 00 00
.
In a multifile (i.e. main.scm
), those custom header chunks should never appear before the usual multifile headers. Mission scripts and streamed scripts may have custom headers at their beggining.
The GOTO
in the header branches into either code or yet another [custom] header GOTO
instruction. If the branch target is not ahead of the current program counter, this is not a custom header.
Following the GOTO
and those magic bytes, a custom header chunk begins. The structure of such chunk must:
- Start with a FourCC, to identify which custom header chunk this is.
- Be 4-byte aligned.
- Have little-endian byte ordering.
Additionally, it is recommended that pointers in the header are relative to the FourCC offset, in such a way that the header can be memcpy
ed with no further problems.
In summary:
(02 00 01)h + 32 bit int Jump to next segment (may be a local jump)
(FF 7F FE 00 00)h Custom headers magic number
(?? ?? ?? ??)h Signature of an custom header (FourCC).
[...]
The FourCC naming convention is based off the PNG 1.2 Specification at §3.3 with very few modifications highlighted below.
The FourCC is four byte sequence which can be used to identify the custom header chunks, such bytes may or may not be ASCII characters.
The bit 5 of each byte is used to convey header properties
- Runtime ancillary bit: bit 5 of first byte
- 0 (uppercase) = critical, 1 (lowercase) = ancillary.
- Chunks that are not strictly necessary to interpret the bytecode (by a runtime) should have the ancillary bit set.
- Furthermore, a runtime encountering an unknown chunk in which the ancillary bit is 1 can safely ignore the chunk and proceed to run the bytecode.
- Examples of such ancillary chunk would be something like the
VAR
segment of the Sanny Builder footer, which stores the name of the variables used in the source code.
- Reserved bit: bit 5 of the second byte
- Must be 0 (uppercase).
- Reserved bit: bit 5 of the third byte
- Must be 0 (uppercase).
- Reserved bit: bit 5 of fourth byte
- Must be 0 (uppercase).
It is worth noting that the property bits are an inherent part of the chunk name, and hence are fixed for any chunk type. Thus, BLOB
and bLOB
would be unrelated chunk type codes, not the same chunk with different properties.
The magic bytes needs to accomplish a few missions:
- It shall not be ambiguous with SCM instructions.
- It shall not be ambiguous with x86 instructions.
- It should avoid looking like user data.
The last two points are to overcome custom data in scripts added by the means of the HEX...END
directive in Sanny Builder.
With that in mind, a single byte (after the GOTO
) is not enough to make this distinction. And, since the custom header data should be aligned in a 4-byte boundary, a minimum of 5 bytes is left to discriminate the data. Luckly, 5-bytes is enough:
- For bytecode, it's known the last command id (32767) shall not be used, as it marks the end of the commands enumeration. Therefore,
FF 7F
follows theGOTO
. - As
FF
followed by7F
is not a valid x86 opcode, this also discards assembly code. After the EIP increment, however,7F
is fetched, which is anjg rel8
x86 instruction and theFF
preceding it could just be an alignment byte. To discard thejg
, anFE
is used, causing a branch into itself, which is an illogical operation. - At this point the magic is defined as
FF 7F FE ?? ??
, this easily discards user data as well, which is usually either ASCII characters or null bytes. - The bytes left should be set to
00
.
Why should we not do this?
- Instead of magic numbers, a custom command preceding the header (or even performing the branch?) could be used.
- This adds third-party dependencies to header making.
- But is much more precise on whether the data following the branch is indeed a header.
- Use the Sanny Builder Footer format.
- The size of the bytecode must be known in order to parse this header.
- Use a separate file for the header data.
- Precise, but adds extra disk seeking.
- Adds another unit of information, which is bad.
- Runtimes should or shouldn't ignore the bit 5 of the second, third and fourth byte of the FourCC?