Skip to content

Instantly share code, notes, and snippets.

@unknownbrackets
Created December 31, 2017 18:09
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save unknownbrackets/78c4631a4091044d381432ffb7f1bae4 to your computer and use it in GitHub Desktop.
Save unknownbrackets/78c4631a4091044d381432ffb7f1bae4 to your computer and use it in GitHub Desktop.
Valkyria Chronciles 3: file formats

Valkyria Chronicles 3 - File Formats

DATA.BIN datafile

This is a PGD-encrypted CPK file. The file can be decrypted using widely available tools.

CPK file

The actual CPK has a general format, and can be worked with using CRIWARE's crifilesystem utilities. The game appears to accept CPKs generated by older versions of the CRIWARE's software.

The general format seems to consist of:

  • CPK header.
  • TOC (file records, filenames) pointed to by header.
  • file data (in blocks) pointed by to TOC.

Everything is aligned to a block size and padded with NULs.

This description is based heavily on the work done by Halley's Comet Software (http://hcs64.com/) and Luigi Auriemma (http://aluigi.altervista.org/).

CPK header

The header is a single row of key-value pairs in the same format as the TOC.

The packet identifier is "CPK " (ending in a U+0020 SPACE.) The rest of the format is described below. Once read, the important values are:

  • ContentOffset: offset within the file to the start of data.
  • TocOffset: offset of TOC table header.
  • TocSize: size of TOC table packet.
  • EnabledPackedSize: double the sum size of all decompressed files.
  • EnabledDataSize: EnabledPackedSize less the bytes saved by compression.
  • Files: number of files in the TOC.
  • Align: block size alignment (2048.)

The last 6 bytes of the CPK header block are "(c)CRI".

CPK TOC

The TOC is multiple rows of key-value pairs in the common table format.

The packet identifier is "TOC " (ending in a U+0020 SPACE.) The rest of the format is described below. The most important keys in the table are:

  • FileName
  • FileSize (with compression)
  • ExtractSize
  • FileOffset

CPK table format

Each table has several sections:

  • Packet header (16 bytes)
  • Table header (8 bytes)
  • Table info (24 bytes)
  • Column schema
  • Row data
  • String data (null terminated)
  • Other data (not null terminated)

Everything is big endian unless otherwise specified.

Packet header

The packet header is the only thing in little endian.

char[4] magic;
uint32  unknown;     // always 0xFF
uint32  packet_size; // length of data after this header.
uint32  unknown;     // always 0

Table header

char[4] magic;       // always "@UTF"
uint32  table_size;  // always packet_size - 8 (size after this header.)

Table info

All offsets are from after the table header unless otherwise specified.

uint32  rows_offset;
uint32  strings_offset;
uint32  data_offset;
uint32  table_name;  // string pointer relative to `strings_offset`.
uint16  num_columns;
uint16  row_length;  // in bytes.
uint32  num_rows;

Column schema

These five bytes repeat for each column (num_columns times.) In some cases, the column is more than five bytes (e.g. when flags & 0x30.)

uint8   flags;
uint32  name;        // string pointer relative to `strings_offset`.

The high four bits of flags are the storage method, and the low 4 bits are the data type.

Storage flags:

  • 0x10: Data has a zero or null value for all rows.
  • 0x30: Data is constant for all rows, and immediately follows name pointer.
  • 0x50: Data varies per row, starting at rows_offset.

Type flags:

  • 0x00, 0x01: uint8
  • 0x02, 0x03: uint16
  • 0x04, 0x05: uint32
  • 0x06, 0x07: uint64
  • 0x08: float (single precision, 32-bit)
  • 0x0A: uint32 string pointer relative to strings_offset.
  • 0x0B: two uint32s - a pointer relative to data_offset, and data length.

Other values may exist but aren't yet known.

Row data

After the column schema, the row data immediately follows. There will be num_rows * row_length bytes of data. It's stored as specified in the column schema above, each row with its columns in the same order as above.

CPK compression

Some of the files within the CPK are compressed, as represented by FileSize being smaller than ExtractSize. These use a form of sliding-window compression.

CRILAYLA format

Everything in the data is little endian (unlike the big endian CPK.) The general format of the content is:

  • 16 byte header
  • compressed payload
  • 256 bytes of uncompressed data

Worth noting is that decompressing the file will give you the original data, but backwards. The 256 bytes of uncompressed data are actually the first part of the file.

It's unclear why this algorithm is used, because it's likely to be slower and less space efficient than e.g. zlib. A file smaller than 275 bytes cannot be made smaller using this algorithm.

The header format is:

char[8] magic;             // "CRILAYLA".
uint32  uncompressed_size; // not counting uncompressed portion.
uint32  compressed_size;   // bytes of compressed data.

Compressed payload

The data is stored at the bit level, in small patterns of backreferences or raw data. A decompressor should read bytes off the end of the file, and read bits from those bytes with the most significant bit first. Decompression also builds the file backwards.

Each pattern starts with a bit, is_backref.

When is_backref is 0, 8 bits of raw data immediately follow. It then resets to the is_backref state (next pattern.)

When is_backref is 1, 13 bits follow which indicate how far back in the uncompressed data buffer to go for the end of 3 bytes of data to repeat. For example, if the current position was 4, and the offset was 0, then bytes 2, 3, and 4 would repeat as bytes 5, 6, and 7 respectively.

After this, 2 bits follow indicating more bytes to copy (going forwards.) In our previous example, if these 2 bits were "1", then byte 5 (which was just copied from byte 2) would be copied to byte 8.

If these 2 bits were both set (0b11), then more bits follow for more bytes to copy. This repeats, like so (starting from is_backref for clarity):

uint1   is_backref;
uint13  backref_pos;
uint2   more_bytes2;
uint3   more_bytes3;
uint5   more_bytes5;
uint8   more_bytes8;
...
uint8   more_bytesN;

It stops when not all of the bits in the value set. That is, the last 8 bits repeat forever until they are not all set (e.g. < 255.) If more_bytes2 is < 3, then there will be no more_bytes3.

So for example, to express copying 12 bytes, it would be encoded as (after the backref_pos): initial 3 + 3 + 6, and then it'd stop. That would only require 5 bits (or 19 if you include is_backref and backref_pos.)

It's not clear if a chunk can overlap itself - that is, it's probably best to perform the copy byte by byte (like LZ77) instead of the entire batch of e.g. 255 bytes at a time.

After that, it returns to the is_backref state.

Note that a backreference can't go farther back than 2 ^ 13 - 1 + 3 or 8194 bytes.

Files inside the CPK

There are several types of files in the CPK, but they all use a common overall format. High level, it looks something like this:

  • packet header
    • packet data
    • packet header
      • packet data
  • packet header
    • packet data

The NAD files don't exactly conform to this format, though, for some reason.

Packet headers

Each packet has a 4 byte identifier, like MTPA. The header format appears to follow this format (in little endian):

char[4] magic;
uint32  packet_size; // not including header, round up to 16.
uint32  header_size;
uint32  flags;

If the header_size is 32 or greater (except for MSCR packets), the next 16 bytes are as follows:

uint32  unknown;
uint32  data_size;   // not including header, round up to 16.
uint32  unknown;
uint32  unknown;

Packet data and sub-packets

For packets with headers of 32 bytes or more, there may be "sub-packets." These generally look like:

+------------------------------+
| Containing Packet Header     |
+------------------------------+
| Containing Packet Data       |
| ...                          |
|                              |
| +--------------------------+ |
| | Sub Packet #1 Header     | |
| +--------------------------+ |
| | Sub Packet #1 Data       | |
| |                          | |
| +--------------------------+ |
| | Sub Packet #2 Header     | |
| +--------------------------+ |
| | Sub Packet #2 Data       | |
| |                          | |
| +--------------------------+ |
+------------------------------+

Everything is aligned to 16 bytes, which is very convenient for hex editors. The data size or packet size can be 0, which means there's no data.

Sub-structures

Even though the files already have a nesting capability for the headers, sometimes there will be a data packet that is opaque, but itself is just another file formatted in this same way (with headers and nesting all over again.)

For example, MLX files (which contain graphics) have an IZCA packet that works exactly this way.

XOR encryption

Some files have their data segments encrypted using a rolling XOR. How it determines the first byte is not understood, but generally not needed because the files follow a consistent format.

If the "flags" uint32 in the header has its 19th bit set (0x40000), then this encryption is being used.

You can simply XOR each byte by the previous byte (pre-encrypted.) This is easy to decrypt and re-encrypt.

MTP files (MTPA packets)

MTPA packets are fairly simple and just have the Shift-JIS text with each byte incremented by one for no apparent reason.

Note that pointers within the data are generally relative to the header. That is, if the header is 32 bytes, then 0x20 would point to the beginning of the data.

struct info_header (16 bytes)
    uint32 unknown5         always 0x4000000f
    uint32 pointer_count    number of pointer records
    uint32 data_size        number of uint32s each data record is
    uint32 data_count       number of data records

struct unknown6[]           repeats data_size times
    uint32 unknown7         always <= 2

<pointer segment>
struct pointer_record[]     repeats pointer_count times
    uint32 data_pos         pointer into data record segment

<data_segment>
struct data_record[]        repeats data_count times
if data_size = 2
    uint32 id               id of voice data within OD_VOICE.AFS
    uint32 text_pos         position of text within text segment
if data size = 4
    uint32 flags1?          unknown meaning, varies wildly
    uint32 id               id of voice data within OD_VOICE.AFS
    uint32 text_pos         position of text within text segment
    uint32 flags3?          0x00 or 0x01 with 4 mysterious unique exceptions

struct unknown8[]           always once?
    uint32 unknown9         unknown meaning
        ** EACH BYTE INCREMENTED **

<text_segment>
struct text_record[]        undetermined length?
    ubyte* shiftjis         text in shift jis, null terminated
        ** EACH BYTE INCREMENTED **

struct text_padding
    ubyte* padding          always 0x00 (padding to align to 4 bytes)
        ** EACH BYTE INCREMENTED **

struct footer_padding
    uint32 padding          always 0x00 (padding to align ENRS)

MXE files (MXEC packets)

MXEC packets are quite complicated, but consistent.

Note that pointers within the data are generally relative to the header. That is, if the header is 32 bytes, than 0x20 would point to the beginning of the data.

    uint32 unknown                varies, xor, doesn't seem important
    uint32 unknown                always 0x60
    uint32 something4_header_ptr  0x00 or pointer to something4 header.
    uint32 something2_header_ptr  0x00 or pointer to something2 header.
    uint32 unknown                meaning unknown, 0x00/0x01.
    uint32 unknown                always 0x00 (ends at 24)
    uint32 unknown                sometimes 0x00 or 0x01? MAYBE something6_count??
    uint32 something6_ptr         pointer to something6 data.
    uint32[9] unknown             always 0x00 (ends at 68)
    uint32 something1_count       number of something1 records
    uint32 unknown                always 0xA0
    uint32[13] unknown            always 0x00

something1[]                      always something1_count of them.
    uint32 id                     seems like an id, counts up...
    uint32 type_ptr               points to ascii identifier in file.
    uint32 length                 length of data.
    uint32 data_ptr               points to beginning of data.

<something1 data>                 variable in size, ends at last data_ptr + last length.
    uint32[] varies               varies per record type.
(padded by 0x00 to a multiple of 16 bytes.)

<something4 header> (optional)
    uint32 unknown                always 0x00
    uint32 something4_count       number of something4 records.
    uint32 something4_ptr         pointer to something4 records.
    uint32 unknown                always 0x00
    uint32[12] unknown            always 0x00

something4[]                      always something4_count of them.
    uint32 unknown                increasing, appears pointer like?  unknown meaning.
    uint32 unknown                optional text pointer, sometimes 0x00.
    uint32 something5_count       count of sub-something5's inside the something4.
    uint32 something5_ptr         pointer to the something5 records.
    uint32[6] unknown             always 0x00
    uint32 unknown                usually 0x00, sometimes 0x01?
    uint32 weird_ptr              0x00 or pointer after text segment.
    uint32[4] unknown             always 0x00

something5[]                      always something5_count of them PER something4.
    uint32 text_ptr               pointer to an ascii identifier.
    uint32 unknown                seems like a number?  maybe value for text_ptr.
    uint32 data_ptr               points to some extra data.
    uint32 unknown                always 0x00?

<something5 data>                 variable size?
    uint32 unknown                one per pointer?

(padded to a multiple of 16 bytes after ALL the something5s.)

<something2 header> (optional)
    uint32 unknown                always 0x00
    uint32 something2_count       count of something2 records.
    uint32 something2_ptr         pointer to something2 records.
    uint32 something3_count       count of something3 records.
    uint32 something3_ptr         pointer to something3 records.
    uint32[3] unknown             always 0x00

something2[]                      always something2_count of them.
    uint32[2] unknown             unknown meaning.
    uint32 path_ptr               pointer to path string.
    uint32 filename_ptr           pointer to filename string.
    uint32[6] unknown             unknown meaning.

something3[]                      always something3_count of them.
    uint32 unknown                increases, seems like a pointer?  doesn't seem to match file?
(padded to a multiple of 16 bytes.)

<something6 data> (optional)
    uint32 unknown                unknown meaning.
(padded to a multiple of 16 bytes.)

<text starts here>
(padded to a multiple of 16 bytes.)

<weird data>
(padded to a multiple of 16 bytes.)

Not everything is understood. Some of the records have varying structure defined by their identifier pointer, and may have string pointers embedded within those structures.

The text itself is in Shift-JIS, null terminated.

Comparison with Valkyria Chronicles 2

Valkyria Chroncles 2 seems to use the same data files and format, in general. The primary difference is that rather than PGD-encrypting the DATA.BIN file, instead they encrypt each file within the CPK.

Each file within the CPK has a 16 byte header which serves as a key.

The file is treated as a series of sets of 4 uint32s, and uses the following basic algorithm:

uint32[4] key;
uint32[] data;

for (int i = 0; i < data.length; i++)
{
    int key_i = i % 4;

    key[key_i] = key[key_i] * 3 + 1;
    data[i] ^= key[key_i];
}

However, when it hits EOFC packets or other boundaries, it appears to do something different, so this is not a complete description of the format.

TODO: Possibly these are int32s not uint32s, causing the discrepancy?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment