generalmimon/ks-serialization-notes.md

## ks-serialization-notes.md

      
    Raw
  

              ks-serialization-notes.md
            
          
    Kaitai Struct — serialization

Notes on individual features


fixed contents — mapped to a literal byte array (note: shouldn't it write magic1() instead? but then it would have to be checked again so that it doesn't violate consistency):
    public void _read() {
        this.magic1 = this._io.readBytes(6);
        if (!(Arrays.equals(magic1(), new byte[] { 80, 65, 67, 75, 45, 49 }))) {
            throw new KaitaiStream.ValidationNotEqualError(new byte[] { 80, 65, 67, 75, 45, 49 }, magic1(), _io(), "/seq/0");
        }
        // ...
    }

    public void _write() {
        this._io.writeBytes(new byte[] { 80, 65, 67, 75, 45, 49 });
        // ...
    }
The problem with special contents handling is that it does not apply to valid/eq, even though the idea of valid when introduced in 0.9 was that contents would behave the same as the corresponding valid/eq.
If we would want to be really consistent in contents and valid handling, and also allow other valid types than eq (i.e. valid/{any-of,min,max,expr}), the general way would be to rely on the user to always set the correct value via the attribute setter and we'll be able to generate the validation expression in _check that will throw an exception if the value is not valid.
In the future, the compiler would be able to set some fields with valid/eq automatically, but probably not in the general case, because note that valid supports arbitrary expressions, and until you analyze all features of the expression language (and also realize what all they actually can do), you have no idea if there are some edge cases that may go wrong. Anything that involves the expression language looks simple, if you picture it only on simple examples - it's very tempting to draw general conclusions from these and completely forget about what "interesting" cases can the expression language create, where the idea of simplicity completely falls apart.
But we can be confident that we can always handle at least literal values (which would probably cover most common cases in real formats anyway) — detecting if the entire expression AST consists of a single node with the literal value is trivial too. However, I wouldn't address this in the initial version of serialization either, because it's not essential.


integer types — type: [us]1 (e.g. writeU1), type: [us][248]{,be,le} (e.g. writeS2le), nothing fancy here


bit-sized integers — type: bX


over 2 years ago, I wrote an initial version of the writeBitsInt method (KaitaiStream.java:431), but:

it supports only the big-endian direction (we have both readBitsInt{Be,Le} now, so there should be a corresponding methods writeBitsInt{Be,Le} as well)
it relies on seeks, so it would work only in seekable streams (which is not probably desirable; readBitsInt{Be,Le} don't require a seekable stream, so writeBitsInt{Be,Le} should work in non-seekable ones too)
depending on how many bit-sized integers encroach on a particular byte, a single byte may need to be overwritten up to 8 times
I think it can be written much better overall, the existing version is probably slow and inefficient too


I think a better approach to writing bit-sized ints would be not to write partial bytes (which of course lead to having to rewrite the bytes multiple times, which also required seeking); instead, write a byte only once when it's finished and you know it won't change (using the bits/bitsLeft properties to save state between calls makes it more like the current readBitsInt{Be,Le} implementation)


this makes it a bit more difficult to use the writeBitsInt{Be,Le} methods from the generated code, because if the write of the last bit field member ends on an unaligned bit position (i.e. somewhere in the middle of a byte, not on a byte boundary), you suddenly have to explicitly ensure that this partial byte gets written too (otherwise it wouldn't be)


so the compiler will additionally have to insert alignToByte calls usually at the end of the _write method — but again, we don't want to add them somewhere they don't belong, so the compiler will have to spend some time figuring out what might come next in the format (is it really something byte-aligned?) and where to insert the alignToByte call (inside the if body just before reading/writing the byte-aligned type? in the for loop? in the switch case block?), and if we need this analysis for functional writing anyway, we could finally fix these numerous issues in regard to wrong alignToByte insertions when parsing too (https://gitter.im/kaitai_struct/Lobby?at=628eb63309eea00adeb191b0)

because isn't it a pity that we have such perfect readBitsInt{Be,Le} implementations (that's right, I'm confident enough to call them perfect — kaitai-io/kaitai_struct#949), and then the stupid compiler ruins everything just with stuffing alignToByte somewhere it doesn't belong? Let's make the compiler smarter about alignToByte insertions and Kaitai Struct will suddenly be actually usable for unaligned bit streams! (kaitai-io/kaitai_struct#576)


of course, this alignToByte for writing will need to do something quite different than what alignToByte currently does for reading — although clearing bits and bitsLeft probably makes sense for writing too, the main task is to actually write the partial byte, but we don't want to write anything in reading mode (so the runtime library will have to know whether we're in writing mode or not, or we'll just keep alignToByte for reading only and add separate writeAlignToByte for writing)


the reason why I wrote the writeBitsInt the way I did was an "idea" to make it ready even for use from positional instances


because back then, for some reason it seemed necessary to me that in order to be able to use bit-sized integers from instances (which can seek anywhere to already written bytes and you don't want to accidentally clear any of the written bits if the currently written bit-sized integer starts and/or ends at an unaligned bit position "somewhere inside a byte"), I need to seek back by 1 byte to re-read the current byte, then seek to where the currently written bits int ends and read that byte too and then the actual writing to a temporary byte array can start (I know, it's indeed a really stupid implementation)


however, instances currently don't work for parsing type: bX in the first place (unless X is a multiple of 8, but that's usually not the case where you want to use a bit-sized integer type; kaitai-io/kaitai_struct#564), so this is pretty useless and I wouldn't worry about that for now


secondly, I've now realized that even with the new (less moronic and the one supporting non-seekable streams, although we're currently discussing instances which rely on seeking by definition) approach, the support for bit-sized integers in instances can be added too (so the only reason to use the old approach is void, things doesn't look good for it) — you just need to do is to call writeAlignToByte after the write to ensure the partial byte at the end gets written too.
Once we want to add bit addressing (I had a feeling that there was an issue about it, but I can't find it right now) on the top of byte addressing that we support now with the pos key to allow instances to seek in the middle of a byte, it'll be quite simple to achieve that by adding a special "bit-level seek" method to the runtime library which will change the state of bits/bitsLeft properties as if a previous writeBitsInt call has just occurred in _write of the seq structure (it will have to "peek" a byte to know what the bits are, which means reading a byte and then seeking back, but this is a "bit seeking" method anyway, so it's quite understandable that it needs seeking support).


I also remember that it was confusing for me that bits and maybe bitsLeft (both currently used for reading) would have to be repurposed by changing their value logic a bit for writing


so if you would mix writes and reads within one byte, e.g. writeBitsInt(3) + readBitsInt(5), the readBitsInt will misinterpret what writeBitsInt  has just set to the bits and bitsLeft properties, thus it would just do some undefined behavior


in fact, the "true" stream byte position (_io.pos) will be kind of repurposed for writing bit integers too — when reading, it moves onto the next byte immediately after you read even just 1 bit of the current one (and the yet unprocessed leftover 7 bits are kept in bits, bitsLeft is set to 7), but if we want to avoid seeks for writing, we would keep _io.pos in the current byte and only move it to the next byte when we know what all 8 bits of the current byte will look like (so that we can write its final value into the stream)


but I conclude that this "value incompatibility" is not a real problem, because we already have it between readBitsInt{Be,Le} methods, I've written in the docs that BE and LE bit integers can't share the same byte, and once we implement the check detecting this situation (kaitai-io/kaitai_struct#155 (comment)), this will be totally fine.
Besides that, unlike the bXbe - bXle situation which can occur in a real .ksy specification, consecutive mixed {write,read}BitsInt* calls should not happen (once everything is correctly implemented in the compiler). So this would be basically a concern not for compiler-generated code, but only for people using the runtime library standalone, and we should really not worry about that. The runtime library solely needs to be usable for the purposes of generated code, anything else is a side effect (but if someone finds the runtime library useful as a standalone library despite our lack of interest in that use case, well, good for them, of course).


enum types


a basic singular field (Enum0)
  - id: pet_1
    type: u4
    enum: animal
    public void _read() {
        this.pet1 = Animal.byId(this._io.readU4le());
        // ...
    }

    public void _write() {
        this._io.writeU4le((long) (pet1().id()));
        // ...
    }


repeated fields don't work (kaitai-io/kaitai_struct_compiler#181 attempts to fix it, but I'll have to replace the wrong Ast.CodeLiteral with new Ast.InternalName in the 0.10 codebase), see test DebugEnumName:
  - id: array_of_ints
    type: u1
    enum: test_enum2
    repeat: expr
    repeat-expr: 1


floating-point types — type: f[48]{be,le} (FloatingPoints)


byte arrays — per delimitation type:


fixed size — use writeBytes in _write, check that the length of the user-provided byte array equals the expected one in _check (SwitchBytearray)
    public void _read() {
        this.code = this._io.readBytes(1);
        // ...
    }

    public void _write() {
        this._io.writeBytes(this.code);
        // ...
    }

    public void _check() {
        if (code().length != 1)
            throw new ConsistencyError("code", code().length, 1);
    }


terminator without size (TermBytes)


default options (consume: true, include: false)
  - id: s1
    terminator: 0x7c
    public void _read() {
        this.s1 = this._io.readBytesTerm((byte) 124, false, true, true);
        // ...
    }

    public void _write() {
        this._io.writeBytes(this.s1);
        this._io.writeU1(124);
        // ...
    }
value consistency (terminator + include: false): field().indexOf(term) == -1


consume: false (+ include: false by default) — note: any sort of consistency check is missing here (there is no guarantee that the terminator byte occurs)
  - id: s2
    terminator: 0x7c
    consume: false
    public void _read() {
        // ...
        this.s2 = this._io.readBytesTerm((byte) 124, false, false, true);
        // ...
    }

    public void _write() {
        // ...
        this._io.writeBytes(this.s2);
        // ...
    }
The first thing to clear out is that when parsing, consume: false already serves as an implicit opt-out from support for non-seekable streams (see the readBytesTerm implementations — usually you find something like if (!consumeTerm) { seek(pos() - 1) } inside anyway), so it's fine to afford seeking for anything that consume: false fields need to do.
The fact that we don't know what field the terminator byte is in (it may not even be be part of any field, if the attribute with consume: false is the last of all seq structures in the current substream — then the required terminator byte could never be written by any further _write) makes it impossible to do it as a simple pure "value" check — it needs to access the stream and read the particular byte where terminator should be, so we can't do it in _check and it would have to be in _write.
The problem is that you won't always be able to do the check in the same type where the consume: false field is, you need to delay until the last possible seq (meaning that either the current type's seq if it has its own substream, or the parent type's seq, or the grandparent type's seq and so on, until you reach a type that is no longer in this substream) has been fully written into the substream where it is (to be absolutely sure that the byte where the terminator is supposed to be has been written) and then seek to that byte where it should be, read it and if it's not equal to terminator as it should, throw a ConsistencyError.
Another issue would be if the consume: false field effectively is the very last field in the substream that gets written (then it's of course absurd to "expect" that some further field after this last one in this substream will come to the rescue and magically write the terminator byte that needs to be there, so this of course does not happen and users get the ConsistencyError that they can do nothing about). But since seeking is allowed, this has a very simple and (at least in hindsight) obvious workaround — after writing the consume: false field, you also write the terminator byte and then seek 1 byte back to have the stream position at this terminator byte.
This is the ultimate one-size-fits-all solution — if other fields follow the current one, the "temporary" terminator byte will be rewritten and the _write of the topmost type still in the current substream will check if it still (after being overwritten) has the value that it must have. Otherwise, if the current consume: false field is the last one, the terminator byte as we wrote it will not be rewritten and will retain the correct value.
value consistency (terminator + include: false): field().indexOf(term) == -1


include: true (+ consume: true by default)
  - id: s3
    terminator: 0x40
    include: true
    public void _read() {
        // ...
        this.s3 = this._io.readBytesTerm((byte) 64, true, true, true);
    }

    public void _write() {
        // ...
        this._io.writeBytes(this.s3);
    }

    public void _check() {
        if (s3()[(s3()).length - 1] != 64)
            throw new ConsistencyError("s3", s3()[(s3()).length - 1], 64);
    }
A small oversight with this _check is that it implicitly assumes s3().length != 0, which is true that it has to apply for the s3 value to be consistent, but the question is whether we simply let the ArrayIndexOutOfBounds occur in case the users causes s3().length == 0 (the _check is especially meant to check arbitrary values and report if they are invalid, so it can't make any assumptions about these arbitrary values beforehand). If the _check method may also throw these other kinds of exceptions than ConsistencyError, it will be more difficult for the user to catch the exceptions related to consistency violations, because it will no longer be enough to just use try { obj._check(); } catch (ConsistencyError exc) { ... }.
value consistency (terminator + include: true): field().length != 0 && field().indexOf(term) == field().length - 1


consume: false + include: true is a degenerate combination that doesn't make sense and should be forbidden (you suddenly get one byte that is included in two consecutive seq fields at the same time, violating the idea of seq that a field starts exactly where the previous one ends)


size with terminator and/or pad-right (BytesPadTerm) - use writeBytesLimit in _write, check that the byte array length does not exceed the maximum serializable size


the anatomy of serialized bytes is the following:


with terminator (pad-right: 0x00 is assumed if not given explicitly); the diagram is designed for the default setting include: false (implying that the terminator byte is outside user-provided bytes) and field().length < `size`:
           specified `size` of the field in .ksy
  | <-------------------------------------------------> |
  |                                                     |
  | user-provided bytes | `terminator` |  `pad-right`   |______________
   \                   / \            / \                              \
    \ field().length  /   \    1     /   \ `size` - field().length - 1  \


without terminator (only pad-right), or also terminator with include: true
        `size` of the field in .ksy
  | <----------------------------------> |
  |                                      |
  | user-provided bytes |  `pad-right`   |__________
   \                   / \                          \
    \ field().length  /   \ `size` - field().length  \


regardless of the options, note that the presence of the terminator byte is optional here (when size is specified), meaning that if the user-provided bytes fill the entire size, it's OK — field().length == `size`:
        `size` of the field in .ksy
  | <----------------------------------> |
  |                                      |
  |         user-provided bytes          |
   \                                    /
    \          field().length          /


only pad-right — field().length can be at most size (padding is optional)
  - id: str_pad
    size: 20
    pad-right: 0x40
    public void _read() {
        this.strPad = KaitaiStream.bytesStripRight(this._io.readBytes(20), (byte) 64);
        // ...
    }

    public void _write() {
        this._io.writeBytesLimit(this.strPad, 20, (byte) 64, (byte) 64);
        // ...
    }

    public void _check() {
        if (strPad().length > 20)
            throw new ConsistencyError("str_pad", strPad().length, 20);
        // ...
    }
If we take "consistency" seriously, the last byte of user-provided bytes (warning: if there is a last byte; this user-provided byte array can have length 0 and it's fine) must not have the value of pad-right, because then parsing wouldn't be able to distinguish it from the padding. So likely there should be a consistency check for this.
value consistency (size + only pad-right): field().length == 0 || field()[field().length - 1] != padByte


terminator and pad-right — field().length can be at most size (both terminator byte and padding are optional)
  - id: str_term_and_pad
    size: 20
    terminator: 0x40
    pad-right: 0x2b
    public void _read() {
        // ...
        this.strTermAndPad = KaitaiStream.bytesTerminate(
          KaitaiStream.bytesStripRight(this._io.readBytes(20), (byte) 43),
          (byte) 64,
          false
        );
        // ...
    }

    public void _write() {
        // ...
        this._io.writeBytesLimit(this.strTermAndPad, 20, (byte) 64, (byte) 43);
        // ...
    }

    public void _check() {
        // ...
        if (strTermAndPad().length > 20)
            throw new ConsistencyError("str_term_and_pad", strTermAndPad().length, 20);
        // ...
    }
value consistency ([size +] terminator + include: false): field().indexOf(term) == -1


only terminator (+ include: false by default) — field().length can be at most size (note: the existing check if (field().length >= 20) { throw ... } in the generated code below is wrong, because it does not match this description — it has to be field().length > 20 instead) and it must not contain the terminator byte at all (not yet checked in _check); implicitly pad-right: 0x00
  - id: str_term
    size: 20
    terminator: 0x40
    # pad-right: 0x00 # implicit
    public void _read() {
        // ...
        this.strTerm = KaitaiStream.bytesTerminate(this._io.readBytes(20), (byte) 64, false);
        // ...
    }

    public void _write() {
        // ...
        this._io.writeBytesLimit(this.strTerm, 20, (byte) 64, (byte) 0);
        // ...
    }

    public void _check() {
        // ...
        if (strTerm().length >= 20)
            throw new ConsistencyError("str_term", strTerm().length, 20);
        // ...
    }
value consistency ([size +] terminator + include: false): field().indexOf(term) == -1


terminator with include: true — field().length can be at most size, field() must end immediately after the first occurrence of the terminator byte or may not contain terminator at all (because as mentioned earlier, the presence of terminator is always optional for size-delimited fields) — note that the current consistency check if (field()[(field()).length - 1] != 64) { throw ... } (see the generated code snippet below) doesn't match this characterization and is therefore incorrect
  - id: str_term_include
    size: 20
    terminator: 0x40
    include: true
    public void _read() {
        // ...
        this.strTermInclude = KaitaiStream.bytesTerminate(this._io.readBytes(20), (byte) 64, true);
    }

    public void _write() {
        // ...
        this._io.writeBytesLimit(this.strTermInclude, 20, (byte) 0, (byte) 0);
    }
    public void _check() {
        // ...
        if (strTermInclude().length > 20)
            throw new ConsistencyError("str_term_include", strTermInclude().length, 20);
        if (strTermInclude()[(strTermInclude()).length - 1] != 64)
            throw new ConsistencyError("str_term_include", strTermInclude()[(strTermInclude()).length - 1], 64);
    }
value consistency ([size +] terminator + include: true): field().indexOf(term) == -1 || field().indexOf(term) == field().length - 1 (note: the field().indexOf(term) == -1 already implicitly matches the 0-length value, so the || operator will short-circuit and it won't get into evaluating the second operand, meaning that the second expression doesn't even have to worry about failing on the 0-length value)


TODO: add tests and analyze these with size-eos: true — update: this is important, because after doing a quick check the terminator is not handled correctly here


byte arrays with process — again, if the terminator is used, encoded bytes from the process encoder must be scanned for a premature occurrence of the terminator byte


strings — everything should be the same as for byte arrays, only with the added encoding "string ⟶ bytes" conversion


terminator without size (TermStrz) - looks correct (git diff --no-index compiled/java/src/io/kaitai/struct/testwrite/Term{Bytes,Strz}.java)
     public void _write() {
-        this._io.writeBytes(this.s1);
+        this._io.writeBytes((s1()).getBytes(Charset.forName("UTF-8")));
         this._io.writeU1(124);
-        this._io.writeBytes(this.s2);
-        this._io.writeBytes(this.s3);
+        this._io.writeBytes((s2()).getBytes(Charset.forName("UTF-8")));
+        this._io.writeBytes((s3()).getBytes(Charset.forName("UTF-8")));
     }

     public void _check() {
-        if (s3()[(s3()).length - 1] != 64)
-            throw new ConsistencyError("s3", s3()[(s3()).length - 1], 64);
+        if ((s3()).getBytes(Charset.forName("UTF-8"))[((s3()).getBytes(Charset.forName("UTF-8"))).length - 1] != 64)
+            throw new ConsistencyError("s3", (s3()).getBytes(Charset.forName("UTF-8"))[((s3()).getBytes(Charset.forName("UTF-8"))).length - 1], 64);
     }


size with terminator and/or pad-right (StrPadTerm) FIXME: _check seems correct, but _write is wrong (it has to use the writeBytesLimit method too) — see git diff --word-diff --no-index compiled/java/src/io/kaitai/struct/testwrite/{Bytes,Str}PadTerm.java :
     public void _write() {
-        this._io.writeBytesLimit(this.strPad, 20, (byte) 64, (byte) 64);
-        this._io.writeBytesLimit(this.strTerm, 20, (byte) 64, (byte) 0);
-        this._io.writeBytesLimit(this.strTermAndPad, 20, (byte) 64, (byte) 43);
-        this._io.writeBytesLimit(this.strTermInclude, 20, (byte) 0, (byte) 0);
+        this._io.writeBytes((strPad()).getBytes(Charset.forName("UTF-8")));
+        this._io.writeBytes((strTerm()).getBytes(Charset.forName("UTF-8")));
+        this._io.writeU1(64);
+        this._io.writeBytes((strTermAndPad()).getBytes(Charset.forName("UTF-8")));
+        this._io.writeU1(64);
+        this._io.writeBytes((strTermInclude()).getBytes(Charset.forName("UTF-8")));
     }

     public void _check() {
-        if (strPad().length > 20)
-            throw new ConsistencyError("str_pad", strPad().length, 20);
-        if (strTerm().length >= 20)
-            throw new ConsistencyError("str_term", strTerm().length, 20);
-        if (strTermAndPad().length > 20)
-            throw new ConsistencyError("str_term_and_pad", strTermAndPad().length, 20);
-        if (strTermInclude().length > 20)
-            throw new ConsistencyError("str_term_include", strTermInclude().length, 20);
-        if (strTermInclude()[(strTermInclude()).length - 1] != 64)
-            throw new ConsistencyError("str_term_include", strTermInclude()[(strTermInclude()).length - 1], 64);
+        if ((strPad()).getBytes(Charset.forName("UTF-8")).length > 20)
+            throw new ConsistencyError("str_pad", (strPad()).getBytes(Charset.forName("UTF-8")).length, 20);
+        if ((strTerm()).getBytes(Charset.forName("UTF-8")).length >= 20)
+            throw new ConsistencyError("str_term", (strTerm()).getBytes(Charset.forName("UTF-8")).length, 20);
+        if ((strTermAndPad()).getBytes(Charset.forName("UTF-8")).length > 20)
+            throw new ConsistencyError("str_term_and_pad", (strTermAndPad()).getBytes(Charset.forName("UTF-8")).length, 20);
+        if ((strTermInclude()).getBytes(Charset.forName("UTF-8")).length > 20)
+            throw new ConsistencyError("str_term_include", (strTermInclude()).getBytes(Charset.forName("UTF-8")).length, 20);
+        if ((strTermInclude()).getBytes(Charset.forName("UTF-8"))[((strTermInclude()).getBytes(Charset.forName("UTF-8"))).length - 1] != 64)
+            throw new ConsistencyError("str_term_include", (strTermInclude()).getBytes(Charset.forName("UTF-8"))[((strTermInclude()).getBytes(Charset.forName("UTF-8"))).length - 1], 64);
     }


subtypes, no substreams (IfStruct)
    public void _read() {
        this.op1 = new Operation(this._io, this, _root);
        this.op1._read();
        // ...
    }

    public void _write() {
        this.op1._write(this._io);
        // ...
    }
    // ...
    public static class Operation extends KaitaiStruct.ReadWrite {
        // ...
        public void _write() { /* ... */ }
    }
public class KaitaiStruct {
    // ...
    public abstract static class ReadWrite extends ReadOnly {
        // ...
        public void _write(KaitaiStream io) {
            this._io = io;
            _write();
        }
        // ...
    }
}


subtypes with own substreams (BufferedStruct)


size key — as you can see, the substream length must be set manually for now (which is obviously not ideal, but it works for now):
  - id: len1
    type: u4
  - id: block1
    type: block
    size: len1
    public void _read() {
        this.len1 = this._io.readU4le();
        this._raw_block1 = this._io.readBytes(len1());
        KaitaiStream _io__raw_block1 = new ByteBufferKaitaiStream(_raw_block1);
        this.block1 = new Block(_io__raw_block1, this, _root);
        this.block1._read();
        // ...
    }

    public void _write() {
        this._io.writeU4le(this.len1);
        KaitaiStream _io__raw_block1 = new ByteBufferKaitaiStream(len1());
        this.block1._write(_io__raw_block1);
        this._io.writeStream(_io__raw_block1);
        // ...
    }
public class ByteBufferKaitaiStream extends KaitaiStream {
    // ...
    @Override
    public void writeStream(KaitaiStream other) {
        bb.put(other.toByteArray());
    }
    // ...
}
public abstract class KaitaiStream implements Closeable {
    // ...
    abstract public void writeStream(KaitaiStream other);
    // ...
    public byte[] toByteArray() {
        long pos = pos();
        seek(0);
        byte[] r = readBytesFull();
        seek(pos);
        return r;
    }
    // ...
}


terminator — actually, it seems that we're completely missing tests on wrapping a type in a substream created using terminator
The thing I would like to test here is the consistency check scanning the substream bytes (after the subtype has been serialized there) for the forbidden occurence of the terminator byte (because if it were silently allowed, the substream would be prematurely terminated there when parsing). If pad-right is used, it has its own consistency check as well, etc. There should be tests for that.


repetitions — _write is easy (once you can write one entry, you just repeat what you've just did), but in _check each type of repetition has its own ways how consistency can be violated


repeat: expr — the easy one, check the number of elements in the user-provided array for equality with repeat-expr (RepeatNStruct)
  - id: qty
    type: u4
  - id: chunks
    type: chunk
    repeat: expr
    repeat-expr: qty
    public void _write() {
        this._io.writeU4le(this.qty);
        for (int i = 0; i < this.chunks.size(); i++) {
            chunks().get((int) i)._write(this._io);
        }
    }

    public void _check() {
        if (chunks().size() != qty())
            throw new ConsistencyError("chunks", chunks().size(), qty());
        for (int i = 0; i < this.chunks.size(); i++) {
        }
    }


repeat: until — in principle exactly the same problems as with terminator, so in short an element matching the repeat-until condition must be at the end and nowhere else.
My pull request kaitai-io/kaitai_struct_compiler#183 adds the check that the last element must match repeat-until, but misses the checks that all non-last elements must not match the repeat-until condition (of course, the consistency would also be violated if any non-last element would match, since parsing would end after that element instead of the actual last one), as @UnePierre correctly pointed out in kaitai-io/kaitai_struct_compiler#183 (comment).
Obviously, terminator has quite the same logic as repeat: until, so they should both have the same set of consistency checks.
Unfortunately, as I mentioned in kaitai-io/kaitai_struct_compiler#183 (comment), it's sometimes tempting to use _io in the repeat-until expression in real formats, but it's not possible to reference _io in the _check method (this actually applies to all expressions, for example).


repeat: eos — there is currently no end-of-stream check to ensure consistency:
  - id: numbers
    type: u4
    repeat: eos
    public void _write() {
        for (int i = 0; i < this.numbers.size(); i++) {
            this._io.writeU4le(numbers().get((int) i));
        }
    }

    public void _check() {
        for (int i = 0; i < this.numbers.size(); i++) {
        }
    }
Since we will only deal with fixed-size streams in the initial serialization support anyway, we actually can check and require _io.eof to be true after writing the field with repeat: eos in _write (it's not unlike moving checks involving any _io from _check to _write — you can read about this later in these notes). If we're not at the end of the stream, throw a ConsistencyError — this should be enough ensure consistency with parsing.


type switching - in principle relatively simple (SwitchManualInt)
        public void _read() {
            this.code = this._io.readU1();
            switch (code()) {
            case 73: {
                this.body = new Intval(this._io, this, _root);
                ((Intval) (this.body))._read();
                break;
            }
            case 83: {
                this.body = new Strval(this._io, this, _root);
                ((Strval) (this.body))._read();
                break;
            }
            }
        }

        public void _write() {
            this._io.writeU1(this.code);
            switch (code()) {
            case 73: {
                ((Intval) (this.body))._write(this._io);
                break;
            }
            case 83: {
                ((Strval) (this.body))._write(this._io);
                break;
            }
            }
        }

        // ...
        private KaitaiStruct.ReadWrite body;


in practice there are a few cases with missing type casts (resulting namely in errors like "error: incompatible types: Object cannot be converted to byte[]" and "error: incompatible types: Long cannot be converted to int"), but these should be straightforward to fix


perhaps in _check there could be a check asserting that the actual type of the value stored in the combined property matches the expected type according to the value we're switching on. In Java, though, if there was an incorrect type, the type cast chosen based on the control value would typically fail in _write anyway with a ClassCastException, so it's a question whether there is any benefit in doing the "same" work in _check.
However, in dynamically-typed languages (JavaScript, Python, etc.), there will be no type conversion in _write that could fail — the


instances — quite problematic due to lazy evaluation and caching


value instances: they don't need setters (actually should not have them, the expected procedure is to change dependency values), but there must be a way to regenerate them after the values of properties they depend on change


in fact, some method to "invalidate" or unset the cached value of the instance would be enough, which in practice just sets the cache property to null and thus the next call of the instance will recalculate it


the naming suggestions that I can think of are _invalidate{Inst}, _reset{Inst} or _unset{Inst} — but "unset" is my least favorite, it sounds misleading and I'd rather avoid it (it just feels like "the opposite of setter" or "like a setter, but it just clears the property instead of setting it to a new value", but these interpretations suggest that you're permanently (un)setting something and you would read back the "no value". That's incorrect, and it misses the point of why you'd want to invalidate the value instance: to allow it to fetch a new meaningful value upon future request.)


vision for the future: value instances can actually be invalidated automatically already in setters of dependency values, and the methods for invalidating value instances could in turn invalidate other value instances in a recursive manner (however, in some cases, it may be a problem to reach all types that can be anywhere in the object structure, e.g. in all entries of a repeated field, etc.)

but this is again quite an advanced thing and can be done without it (by invalidating all dependent instances manually after changing something), so it's rather unlikely to be in the initial shot of serialization


simple example (Expr2) — if you change len_orig via a setter, len_mod must be invalidated, otherwise the serialized structure will be inconsistent (because the len_orig has been updated, but str would still use the old wrong cached len_mod)
types:
  mod_str:
    seq:
      - id: len_orig
        type: u2
      - id: str
        type: str
        size: len_mod
        encoding: UTF-8
      - id: rest
        type: tuple
        size: 3
    instances:
      len_mod:
        value: len_orig - 3


actually, it seems now clear to me "invalidating" is the correct operation to implement and "regenerating" is not, because only invalidation allows you to delay the evaluation until you actually want it to happen. The moment you know that the old cached value is no longer valid may not be the right time to load a new value.
For example, consider saving _io.pos in a value instance (from https://github.com/kaitai-io/coreldraw_cdr.ksy/blob/333988b/coreldraw_cdr.ksy#L1451-L1469, slightly simplified):
types:
  # ...
  points_list:
    params:
      - id: num_points
        type: s4
    seq:
      - size: 0
        if: ofs_points < 0
      - id: points
        type: point(point_types[_index])
        repeat: expr
        repeat-expr: num_points
    instances:
      ofs_points:
        value: _io.pos
      point_types:
        pos: ofs_points + point_size * num_points
        type: u1
        repeat: expr
        repeat-expr: num_points
      point_size:
        value: sizeof<s4> * 2
When parsing, the ofs_points value instance remembers the stream position just where points start — this is nice because you have a fixed reference point. However, when editing the parsed file, the points_list may get repositioned in the stream, so ofs_points needs to be invalidated. But the important part is that you invalidate ofs_points in application code (after you changed the length of any preceding field and you know this will eventually affect the offset of points), and it doesn't make sense sense to re-evaluate _io.pos at the same moment.
In this phase of editing the parsed objects in application code (which will eventually be serialized, but only when everything is ready), _io would be probably still in the state in which the parsing in this stream has ended (maybe at the end of stream, but not necessarily); or the input stream also may be already closed (so _io.pos may even throw IOException) and writing will be done to a completely different output stream, which will be supplied as a parameter to _write and thus is not available at this point yet.
So what you really want is to just invalidate ofs_points in application code and not touch it until the _write method of the points_list type is called, where ofs_points would be finally invoked and it would fetch the new _io.pos value valid in the output stream.


parse instances — quite problematic, it's difficult to foresee all the problems here


they need setters


the _write method meant as a counterpart of _read will not (and should not) serialize them, so they must have their own write mechanism


likewise, _check probably won't deal with them either (it should exactly accompany _write)


not all parse instances should be actually written: some are just lookaheads (purely parsing helpers), some are union members (so only one union member will be used) etc. — it's up to the user to deal with it


so the proposed interface would be as follows (assume a parse instance called inst):


setter


_checkInst — includes all kinds of consistency checks that _check would do, but only for the particular instance


_writeInst method — similar to _write, but only for the particular instance (and if you think about it, if the instance is of a user-defined type, the _writeInst would only deal with the seq structure of it, inner instances need to be written again explicitly)


the _write overload allowing to override _io, i.e. the following:
public class KaitaiStruct {
    // ...
    public abstract static class ReadWrite extends ReadOnly {
        // ...
        public void _write(KaitaiStream io) {
            this._io = io;
            _write();
        }
        // ...
    }
    // ...
}
will not be needed, because it's assumed that the _write of seq has been already called, so the current this._io is set to the correct output stream that user wants (or it may also be a substream created by a parent _write).
Parse instances also have the io key, so they may use a different stream than the _io of the current type, and overriding the stream that is set in io clearly makes no sense and can only cause inconsistency.

Actually, the io key itself also excludes any naive ideas like "why not write all instances at the end of the _write method", and let's pretend for a while that the already mentioned fact that all instances are often not meant to be written isn't a problem (even if it is) to make it more interesting. The referenced stream in io can easily be a "future" one, i.e. it's not that hard to imagine this situation in practice (I've seen it several times):
seq:
  - id: a
    type: foo
  - id: b
    size: 0x100
    type: slot
types:
  foo:
    seq:
      - id: baz
        size: 4
    instances:
      inst:
        io: _parent.b._io
        pos: 0x10
        type: u1
  slot: {}
Instances are lazy, so this will work seamlessly when parsing (well, unless we void the laziness e.g. by requiring the value of inst in some parsing decision in seq before the substream of b has been created and assigned — then it doesn't work, of course).
If we were to naively write inst at the end of Foo._write(), it would be wrong (too soon), because the output substream for b hasn't been created yet (so depending on what _parent.b._io would be at this point, we would be either writing to an old substream left from parsing which will be garbage-collected after it's replaced by the output substream — so that any writes there will be lost, and that's when we are lucky that it wasn't read-only, because in principle it could be; or to a null stream if b is a fresh object with no _io set by the user via a setter, in which case we would usually get a NullPointerException and the _write would end here). It's basically the same reason why this instance must be parsed lazily and eager parsing doesn't work — as you can see, eager writing doesn't work either.
So this just underlines the correctness of the decision that the user is in charge of writing instances one by one, without any general easy way around it without very careful consideration.


the fact that writing of each parse instance is disconnected from the main _write actually creates room for inconsistencies, because I imagine it won't be hard for the user to forget to write some instance deep in the object tree (so parsing this instance in the serialized stream would read garbage). Unfortunately, I don't think we can help much with that in the general case, because we don't know what instances the user wants to write.


user-defined and implicit _root, _parent, _io and _is_le parameters of subtypes


they can be always set to arbitrary value by instantiating the type in application code (for example, when creating a new file from scratch) and injecting it into the enclosing type via setter, so in _check of the type in which the parametric type is placed the parameter values should be checked for consistency (from the perspective of the parent type where this _check is performed, subtype._root and subtype._parent should be the same as what would the _read method normally pass to the corresponding parameters when instantiating the type; subtype._io will be actually overridden in _write, so it doesn't matter; user-defined parameters must be equal to what the expression used as the param value in the type invocation would evaluate to)


if the parametric type was created during parsing, it would be useful to be able to update its parameters (without having to recreate the type to set the new parameter value via constructor and then copy all the values from the old object to this new one)


<future-plans>

I don't think classic setters for direct use by the user are the right answer, because the parameter value should always be the same as the result of the particular expression in the invocation of the parametric type in the parent type, not set to an independent arbitrary value. Although it can't silently cause any inconsistency because the _check will throw a ConsistencyError if the parameter value does not equal to what is expected, it's probably not really a great API to have a setter just so that the user can (in fact has to) set the only valid value at the time (i.e. why Kaitai Struct doesn't just set the value itself when it's completely determined).
A better API would be not to generate usual setters, but a new method _updateParams (naming completely random, of course), which would update _root, _parent, _io (actually, the value of _io doesn't matter at this point and can be null or anything else, because it will be replaced by a parent _write anyway), _is_le and user-defined parameters (using the same expressions as when invoking the type normally in _read or instance getters).
The _updateParams method must be in the type that includes the parametric type in question — for simplicity, it would update parameters of all subtypes included via seq within the type (so that we don't have to worry about targeting a specific field, or even a specific entry index for repeated fields). OK, the name is now confusing, let's maybe call it _updateParams_Seq. For parametric subtypes in instances, _updateParams{Inst} would probably do. As for the implementation of _updateParams_Seq and _updateParams{Inst}, they are outside the type whose parameters they are supposed to update, so they can only use its public API (i.e. nothing more than the application code can use). Which means that the parametric type either needs to have public setters, or the _updateParams* methods would again have to resort to recreating a new instance of the parametric type from scratch and copying everything to it (which is ugly).

</future-plans>

Note: well, this initially simple idea quite grew in complexity, and in the end it looks like we'll need public setters for parameters anyway, so let's simply add them for now and call it a day. The user will use the parameter setters if needed, and the _check in the parent type will inspect the parameters set in its subtypes and report any inconsistencies found.


ignore the previous point — just generate setters and add consistency checks to parent _check for:


_root


_parent


_io — unneeded, it would end up not used anyway


in fact, the _write method already serves as the _io setter, so you can ask any type to serialize to any KaitaiStream you provide (it's obviously also used by the generated code to serialize all subtypes, as you may have noticed earlier):
public class KaitaiStruct {
    // ...
    public abstract static class ReadWrite extends ReadOnly {
        // ...
        public void _write(KaitaiStream io) {
            this._io = io;
            _write();
        }
        // ...
    }
}
This ability will also be useful for purposes of calculating CRC-32 checksums in application code, for example — you take the object already filled with data that is enough to give you the entire input for CRC-32 once serialized, you _write it to a temporary stream, then extract the bytes it provided and pass them to the CRC-32 function. Since _write is deterministic (unless you do something funny with opaque types I guess), it will produce the same bytes for those values once the actual _write is performed, so the CRC-32 value calculated this way will be valid.


_is_le (if inherited endianness is used)


user-defined parameters


calculated default endianness (meta/endian/{switch-on,cases})

no new conceptual problems, the only tricky thing seems to be the _is_le parameter (for inherited endianness) and parameters have been already addressed in the previous bullet point


using _io in expressions (and other places which ultimately have impact on parsing decisions) and how it can mess up with the compiler (in)ability to derive sizes automatically (in the future, so that the users don't have to do the tedious calculations themselves) — actually implies quite a fascinating logical problem, I realized


maybe a small introduction to the problem by showing that how if expressions are used in _write (IfStruct)
        public void _read() {
            this.opcode = this._io.readU1();
            if (opcode() == 84) {
                this.argTuple = new ArgTuple(this._io, this, _root);
                this.argTuple._read();
            }
            if (opcode() == 83) {
                this.argStr = new ArgStr(this._io, this, _root);
                this.argStr._read();
            }
        }

        public void _write() {
            this._io.writeU1(this.opcode);
            if (opcode() == 84) {
                this.argTuple._write(this._io);
            }
            if (opcode() == 83) {
                this.argStr._write(this._io);
            }
        }
Nothing too surprising, I'd say, but it's important to realize that the user if expressions are used "as is" in _write in the same form as in _read.


the following occurred to me when pondering about the misery of users having to set all stream lengths themselves, if it can be fully automated in the future by the generated code while retaining all the features that Kaitai Struct has (including the use of _io in parsing decisions)


I realized that it's actually logically impossible in the following case:
seq:
  - id: len_a
    type: u1
  - id: a
    size: len_a
    type: foo
types:
  foo:
    seq:
      - id: prefix
        size: 4
      - id: optional_suffix
        size: 6
        if: not _io.eof
Imagine that you want to derive len_a automatically (let's say we want the contents of the type to exactly fit into the length we calculate, meaning that the last field of the type will align with EOS and parsing even 1 byte from there would result in an EOF error).
prefix is easy, it always needs 4 bytes, so len_a must be at least 4. Now we need to evaluate the expression "not _io.eof" to know if optional_suffix is present or not. Well, we somehow can't seem to know if we're already at EOF at this point (in order to be able to answer the _io.eof query), because we would have to know that all following fields in seq will occupy 0 bytes.
Let's break down both answers to the _io.eof question:


true — not _io.eof is then false and therefore optional_suffix won't be parsed (so we set len_a = 4).
A quick check whether this is consistent when parsing - after prefix, are we at EOF? Yes, because _io.pos == 4 and _io.size == 4, so _io.eof is indeed true.


false — not _io.eof is then true and therefore optional_suffix will be parsed (so we would set len_a = 4 + 6 = 10).
Consistency check - after prefix, are we at EOF? No, because _io.pos == 4 and _io.size == 10, so _io.eof is false.


So note that if we attempt to calculate what the length must be (without having this size set in advance by the user), if: not _io.eof doesn't tell us anything - we can set it both ways and it always satisfies itself. It looks like it's not this if expression that makes decisions, it's the other way around - the decision that user makes (in this case whether to include optional_suffix or not) affects how not _io.eof will evaluate.
In some cases, it also seems like you need to be able to reverse expressions to do the right thing:
types:
  foo:
    seq:
      - id: prefix
        size: 4
      - id: suffix
        size: _io.size - _io.pos - 2
Now the compiler would be probably supposed to derive that the right size for the stream would be 4 + suffix.length + 2 (because the unparsed two bytes at the end would remain when parsing as well).
Of course, it doesn't take much work to come up with a number of even weirder cases, because the ability to use _io everywhere in expressions allows you to do a lot of things (and it's not even always possible to guess what the user wants, so even the most automatic implementation that can be would sometimes still have to ask the user to disambiguate).


following the previous point, I think it's actually a correct decision to require the precalculated lengths of substreams and only support fixed-size streams, because it automatically solves the problems with referring to the _io (when everything has a known size, it's basically as easy as parsing), and also allows users to reserve additional space for writing instances afterwards and whatnot


generally, it follows the whole idea of the initial serialization version - the assumption is that the user somehow knows a bunch of (quite low-level) things and gives them to the KS-generated serializer, which would not have come with these values itself but at least it's able to give feedback whether they're right or wrong (ideally, if _check methods succeed and _write doesn't throw EOF or other errors too, it should be a guarantee that the parser will read back the exact same values properly)


however, inside the _check method, we will have to be careful if none of the user expressions that may appear there refers to the _io, because we don't have _io available in _check.
Simple example - I think the existing PoC implementation checks whether the size value matches the actual length of a byte array, so this would most likely be a problem:
seq:
  - id: rest
    size: _io.size - _io.pos
The above essentially does the same as size-eos: true. Nevertheless, I think it still makes sense to check the length of rest (the user must have provided the final _io.size because we only support fixed-size streams, and _io.pos is known as well), you just have to do it in _write just before writing the particular attribute, not in _check as usual. Only in _write you have the required streams and they will be in the correct state.
So we need to analyze all expressions that would normally go into _check whether they don't reference any _io directly or indirectly (i.e. by referring to value instances, which again may refer to _io directly or indirectly) and if they do, these checks must be moved into _write before the particular attribute is written.


Ideas


setters - per Java conventions, getter should be obviously called get{Something} and setter set{Something}. However, we already don't follow that for getters, so suddenly "following" that only for setters looks wrong to me:
    public long ofsTags() { return ofsTags; }
    public void setOfsTags(long _v) { ofsTags = _v; }
    public long numTags() { return numTags; }
    public void setNumTags(long _v) { numTags = _v; }
    public ArrayList<Tag> tags() { return tags; }
    public void setTags(ArrayList<Tag> _v) { tags = _v; }
In our case, it would be more consistent to leave out the common set prefix for setters too in my opinion:
    public long ofsTags() { return ofsTags; }
    public void ofsTags(long _v) { ofsTags = _v; }
    public long numTags() { return numTags; }
    public void numTags(long _v) { numTags = _v; }
    public ArrayList<Tag> tags() { return tags; }
    public void tags(ArrayList<Tag> _v) { tags = _v; }
Overloading like this is not a problem in Java, and it would be even simplify the implementation a bit inside the compiler, as we wouldn't have to construct a special name for setters.
Also it solves the question of setters for _root and _parent, which currently look weird, as if they were using "snake case" (and renaming them like set_Root doesn't help in my opinion):
    public SwitchBytearray _root() { return _root; }
    public void set_root(SwitchBytearray _v) { _root = _v; }
    public KaitaiStruct.ReadWrite _parent() { return _parent; }
    public void set_parent(KaitaiStruct.ReadWrite _v) { _parent = _v; }
I don't think removing set makes the API any less enjoyable to use - this is how the asserts after parsing look like:
      assertEquals(r.len1(), 0x10);
      assertEquals(r.block1().number1(), 0x42);
      assertEquals(r.block1().number2(), 0x43);

      assertEquals(r.len2(), 0x8);
      assertEquals(r.block2().number1(), 0x44);
      assertEquals(r.block2().number2(), 0x45);

      assertEquals(r.finisher(), 0xee);
and one would serialize the same values as follows:
        BufferedStruct r = new BufferedStruct();
        
        r.len1(0x10);
        
        BufferedStruct.Block block1 = new BufferedStruct.Block();
        block1.number1(0x42);
        block1.number2(0x43);
        r.block1(block1);

        r.len2(0x8);

        BufferedStruct.Block block2 = new BufferedStruct.Block();
        block2.number1(0x44);
        block2.number2(0x45);
        r.block2(block2);

        r.finisher(0xee);
        
        r._write(io);
instead of:
        BufferedStruct r = new BufferedStruct();
        
        r.setLen1(0x10);
        
        BufferedStruct.Block block1 = new BufferedStruct.Block();
        block1.setNumber1(0x42);
        block1.setNumber2(0x43);
        r.setBlock1(block1);

        r.setLen2(0x8);

        BufferedStruct.Block block2 = new BufferedStruct.Block();
        block2.setNumber1(0x44);
        block2.setNumber2(0x45);
        r.setBlock2(block2);

        r.setFinisher(0xee);

        r._write(io);
It's just a matter of habit.
This set prefix would be probably specific for Java anyway, because other target languages that we support often use different approaches to setters.


testing


our test suite of .ksy format specs can of course be reused to test serialization (that's kind of the point), but it's not clear how such tests will be conducted


there are several options what you can do, each option may expose a slightly different set of errors that may exist in the implementation — they correspond to different use cases of how you can use serialization (see kaitai-io/kaitai_struct#27 (comment))


naturally, we can take advantage of parsing that we already have (to get the initial objects to edit and then serialize, or to make sure that the serialized output is consistent with the values that have been written), it actually gives us interesting testing opportunities


on ensuring the consistency property (i.e. _check methods report violated constraints of user-provided values, _write )


unchanged parsed objects are always consistent, inconsistencies arise from setting properties with constraints to arbitrary values by the user


this allows us to do property-based testing, which is easy to implement and I think will be effective in finding bugs (because we already have pretty solid parsing, so if the serialization machinery will be doing something wrong, I think we can rely on parsing to detect it):


one easy thing to do that might detect faulty _checks would be to run _checks everywhere in the object tree hierarchy of freshly parsed objects and they must report no errors


another easy thing to reveal faulty _writes would be to try writing unchanged parsed objects and if parsing of the serialized bytes doesn't yield exactly the same as what was written, you know that some _write messed up


the general idea is that if there is some scenario in which the user is able to use setters, constructors to create new objects, etc. to set any values that the actual data types of properties in the target language accept (for example in Java anything with type KaitaiStruct would accept null, because it's just a reference type) and then calls _checks and _write in the way they're supposed to (of course, if you e.g. don't call the _write{Inst} method for a particular instance, it won't be written and thus you won't be able to read it back, but this is expected behavior and purely user error) and everything succeeds, and yet parsing of the serialized output does not exactly match the set values, it means that either there is an error in some of the _write methods (in case the values set by the user are actually consistent), or it was because the user provided values were not consistent and the appropriate _check method failed to report it


following the previous point, it would be ideal to integrate fuzzing that would set some interesting values via setters that the KS-generated classes expose. Inevitably, some set of values that the fuzzer provides will be inconsistent, and the idea is to reveal insufficent _checks by first finding a combination of values that passes the _check (I believe that coverage-based fuzzing can help here, as I imagine it, if the first check in _check fails, it would try to adjust the value so that this first check passes and when it does, it would keep it and try to fix other reported ConsistencyErrors) and then doing a write-read roundtrip to see if both the _write was really consistent.
I would like to avoid having to manually enumerate each specific hardcoded case (because it's too easy to miss important cases, especially when consistency of serialization will be so dependent on user interaction with the objects). Some special serialization use cases probably would have to be tested as the traditional unit test with all these asserts and different input values, but it would be nice to keep it to a minimum, as I believe that a fuzzer can do a better job (instead of a few hardcoded values that take a long time for human to figure out and provide little assurance that it maybe works, fuzzer has the potential to test millions of such values in almost no time).