This is a PGD-encrypted CPK file. The file can be decrypted using widely available tools.
The actual CPK has a general format, and can be worked with using CRIWARE's crifilesystem utilities. The game appears to accept CPKs generated by older versions of the CRIWARE's software.
The general format seems to consist of:
- CPK header.
- TOC (file records, filenames) pointed to by header.
- file data (in blocks) pointed by to TOC.
Everything is aligned to a block size and padded with NULs.
This description is based heavily on the work done by Halley's Comet Software (http://hcs64.com/) and Luigi Auriemma (http://aluigi.altervista.org/).
The header is a single row of key-value pairs in the same format as the TOC.
The packet identifier is "CPK " (ending in a U+0020 SPACE.) The rest of the format is described below. Once read, the important values are:
- ContentOffset: offset within the file to the start of data.
- TocOffset: offset of TOC table header.
- TocSize: size of TOC table packet.
- EnabledPackedSize: double the sum size of all decompressed files.
- EnabledDataSize: EnabledPackedSize less the bytes saved by compression.
- Files: number of files in the TOC.
- Align: block size alignment (2048.)
The last 6 bytes of the CPK header block are "(c)CRI".
The TOC is multiple rows of key-value pairs in the common table format.
The packet identifier is "TOC " (ending in a U+0020 SPACE.) The rest of the format is described below. The most important keys in the table are:
- FileName
- FileSize (with compression)
- ExtractSize
- FileOffset
Each table has several sections:
- Packet header (16 bytes)
- Table header (8 bytes)
- Table info (24 bytes)
- Column schema
- Row data
- String data (null terminated)
- Other data (not null terminated)
Everything is big endian unless otherwise specified.
The packet header is the only thing in little endian.
char[4] magic;
uint32 unknown; // always 0xFF
uint32 packet_size; // length of data after this header.
uint32 unknown; // always 0
char[4] magic; // always "@UTF"
uint32 table_size; // always packet_size - 8 (size after this header.)
All offsets are from after the table header unless otherwise specified.
uint32 rows_offset;
uint32 strings_offset;
uint32 data_offset;
uint32 table_name; // string pointer relative to `strings_offset`.
uint16 num_columns;
uint16 row_length; // in bytes.
uint32 num_rows;
These five bytes repeat for each column (num_columns
times.) In some cases,
the column is more than five bytes (e.g. when flags & 0x30
.)
uint8 flags;
uint32 name; // string pointer relative to `strings_offset`.
The high four bits of flags are the storage method, and the low 4 bits are the data type.
Storage flags:
- 0x10: Data has a zero or null value for all rows.
- 0x30: Data is constant for all rows, and immediately follows name pointer.
- 0x50: Data varies per row, starting at
rows_offset
.
Type flags:
- 0x00, 0x01: uint8
- 0x02, 0x03: uint16
- 0x04, 0x05: uint32
- 0x06, 0x07: uint64
- 0x08: float (single precision, 32-bit)
- 0x0A: uint32 string pointer relative to
strings_offset
. - 0x0B: two uint32s - a pointer relative to
data_offset
, and data length.
Other values may exist but aren't yet known.
After the column schema, the row data immediately follows. There will be
num_rows * row_length
bytes of data. It's stored as specified in the column
schema above, each row with its columns in the same order as above.
Some of the files within the CPK are compressed, as represented by FileSize being smaller than ExtractSize. These use a form of sliding-window compression.
Everything in the data is little endian (unlike the big endian CPK.) The general format of the content is:
- 16 byte header
- compressed payload
- 256 bytes of uncompressed data
Worth noting is that decompressing the file will give you the original data, but backwards. The 256 bytes of uncompressed data are actually the first part of the file.
It's unclear why this algorithm is used, because it's likely to be slower and less space efficient than e.g. zlib. A file smaller than 275 bytes cannot be made smaller using this algorithm.
The header format is:
char[8] magic; // "CRILAYLA".
uint32 uncompressed_size; // not counting uncompressed portion.
uint32 compressed_size; // bytes of compressed data.
The data is stored at the bit level, in small patterns of backreferences or raw data. A decompressor should read bytes off the end of the file, and read bits from those bytes with the most significant bit first. Decompression also builds the file backwards.
Each pattern starts with a bit, is_backref
.
When is_backref
is 0, 8 bits of raw data immediately follow. It then resets
to the is_backref
state (next pattern.)
When is_backref
is 1, 13 bits follow which indicate how far back in the
uncompressed data buffer to go for the end of 3 bytes of data to repeat.
For example, if the current position was 4, and the offset was 0, then bytes
2, 3, and 4 would repeat as bytes 5, 6, and 7 respectively.
After this, 2 bits follow indicating more bytes to copy (going forwards.) In our previous example, if these 2 bits were "1", then byte 5 (which was just copied from byte 2) would be copied to byte 8.
If these 2 bits were both set (0b11), then more bits follow for more bytes to
copy. This repeats, like so (starting from is_backref
for clarity):
uint1 is_backref;
uint13 backref_pos;
uint2 more_bytes2;
uint3 more_bytes3;
uint5 more_bytes5;
uint8 more_bytes8;
...
uint8 more_bytesN;
It stops when not all of the bits in the value set. That is, the last 8 bits
repeat forever until they are not all set (e.g. < 255.) If more_bytes2
is < 3, then there will be no more_bytes3
.
So for example, to express copying 12 bytes, it would be encoded as (after the backref_pos): initial 3 + 3 + 6, and then it'd stop. That would only require 5 bits (or 19 if you include is_backref and backref_pos.)
It's not clear if a chunk can overlap itself - that is, it's probably best to perform the copy byte by byte (like LZ77) instead of the entire batch of e.g. 255 bytes at a time.
After that, it returns to the is_backref
state.
Note that a backreference can't go farther back than 2 ^ 13 - 1 + 3
or 8194
bytes.
There are several types of files in the CPK, but they all use a common overall format. High level, it looks something like this:
- packet header
- packet data
- packet header
- packet data
- packet header
- packet data
The NAD files don't exactly conform to this format, though, for some reason.
Each packet has a 4 byte identifier, like MTPA. The header format appears to follow this format (in little endian):
char[4] magic;
uint32 packet_size; // not including header, round up to 16.
uint32 header_size;
uint32 flags;
If the header_size is 32 or greater (except for MSCR packets), the next 16 bytes are as follows:
uint32 unknown;
uint32 data_size; // not including header, round up to 16.
uint32 unknown;
uint32 unknown;
For packets with headers of 32 bytes or more, there may be "sub-packets." These generally look like:
+------------------------------+
| Containing Packet Header |
+------------------------------+
| Containing Packet Data |
| ... |
| |
| +--------------------------+ |
| | Sub Packet #1 Header | |
| +--------------------------+ |
| | Sub Packet #1 Data | |
| | | |
| +--------------------------+ |
| | Sub Packet #2 Header | |
| +--------------------------+ |
| | Sub Packet #2 Data | |
| | | |
| +--------------------------+ |
+------------------------------+
Everything is aligned to 16 bytes, which is very convenient for hex editors. The data size or packet size can be 0, which means there's no data.
Even though the files already have a nesting capability for the headers, sometimes there will be a data packet that is opaque, but itself is just another file formatted in this same way (with headers and nesting all over again.)
For example, MLX files (which contain graphics) have an IZCA packet that works exactly this way.
Some files have their data segments encrypted using a rolling XOR. How it determines the first byte is not understood, but generally not needed because the files follow a consistent format.
If the "flags" uint32 in the header has its 19th bit set (0x40000), then this encryption is being used.
You can simply XOR each byte by the previous byte (pre-encrypted.) This is easy to decrypt and re-encrypt.
MTPA packets are fairly simple and just have the Shift-JIS text with each byte incremented by one for no apparent reason.
Note that pointers within the data are generally relative to the header. That is, if the header is 32 bytes, then 0x20 would point to the beginning of the data.
struct info_header (16 bytes)
uint32 unknown5 always 0x4000000f
uint32 pointer_count number of pointer records
uint32 data_size number of uint32s each data record is
uint32 data_count number of data records
struct unknown6[] repeats data_size times
uint32 unknown7 always <= 2
<pointer segment>
struct pointer_record[] repeats pointer_count times
uint32 data_pos pointer into data record segment
<data_segment>
struct data_record[] repeats data_count times
if data_size = 2
uint32 id id of voice data within OD_VOICE.AFS
uint32 text_pos position of text within text segment
if data size = 4
uint32 flags1? unknown meaning, varies wildly
uint32 id id of voice data within OD_VOICE.AFS
uint32 text_pos position of text within text segment
uint32 flags3? 0x00 or 0x01 with 4 mysterious unique exceptions
struct unknown8[] always once?
uint32 unknown9 unknown meaning
** EACH BYTE INCREMENTED **
<text_segment>
struct text_record[] undetermined length?
ubyte* shiftjis text in shift jis, null terminated
** EACH BYTE INCREMENTED **
struct text_padding
ubyte* padding always 0x00 (padding to align to 4 bytes)
** EACH BYTE INCREMENTED **
struct footer_padding
uint32 padding always 0x00 (padding to align ENRS)
MXEC packets are quite complicated, but consistent.
Note that pointers within the data are generally relative to the header. That is, if the header is 32 bytes, than 0x20 would point to the beginning of the data.
uint32 unknown varies, xor, doesn't seem important
uint32 unknown always 0x60
uint32 something4_header_ptr 0x00 or pointer to something4 header.
uint32 something2_header_ptr 0x00 or pointer to something2 header.
uint32 unknown meaning unknown, 0x00/0x01.
uint32 unknown always 0x00 (ends at 24)
uint32 unknown sometimes 0x00 or 0x01? MAYBE something6_count??
uint32 something6_ptr pointer to something6 data.
uint32[9] unknown always 0x00 (ends at 68)
uint32 something1_count number of something1 records
uint32 unknown always 0xA0
uint32[13] unknown always 0x00
something1[] always something1_count of them.
uint32 id seems like an id, counts up...
uint32 type_ptr points to ascii identifier in file.
uint32 length length of data.
uint32 data_ptr points to beginning of data.
<something1 data> variable in size, ends at last data_ptr + last length.
uint32[] varies varies per record type.
(padded by 0x00 to a multiple of 16 bytes.)
<something4 header> (optional)
uint32 unknown always 0x00
uint32 something4_count number of something4 records.
uint32 something4_ptr pointer to something4 records.
uint32 unknown always 0x00
uint32[12] unknown always 0x00
something4[] always something4_count of them.
uint32 unknown increasing, appears pointer like? unknown meaning.
uint32 unknown optional text pointer, sometimes 0x00.
uint32 something5_count count of sub-something5's inside the something4.
uint32 something5_ptr pointer to the something5 records.
uint32[6] unknown always 0x00
uint32 unknown usually 0x00, sometimes 0x01?
uint32 weird_ptr 0x00 or pointer after text segment.
uint32[4] unknown always 0x00
something5[] always something5_count of them PER something4.
uint32 text_ptr pointer to an ascii identifier.
uint32 unknown seems like a number? maybe value for text_ptr.
uint32 data_ptr points to some extra data.
uint32 unknown always 0x00?
<something5 data> variable size?
uint32 unknown one per pointer?
(padded to a multiple of 16 bytes after ALL the something5s.)
<something2 header> (optional)
uint32 unknown always 0x00
uint32 something2_count count of something2 records.
uint32 something2_ptr pointer to something2 records.
uint32 something3_count count of something3 records.
uint32 something3_ptr pointer to something3 records.
uint32[3] unknown always 0x00
something2[] always something2_count of them.
uint32[2] unknown unknown meaning.
uint32 path_ptr pointer to path string.
uint32 filename_ptr pointer to filename string.
uint32[6] unknown unknown meaning.
something3[] always something3_count of them.
uint32 unknown increases, seems like a pointer? doesn't seem to match file?
(padded to a multiple of 16 bytes.)
<something6 data> (optional)
uint32 unknown unknown meaning.
(padded to a multiple of 16 bytes.)
<text starts here>
(padded to a multiple of 16 bytes.)
<weird data>
(padded to a multiple of 16 bytes.)
Not everything is understood. Some of the records have varying structure defined by their identifier pointer, and may have string pointers embedded within those structures.
The text itself is in Shift-JIS, null terminated.
Valkyria Chroncles 2 seems to use the same data files and format, in general. The primary difference is that rather than PGD-encrypting the DATA.BIN file, instead they encrypt each file within the CPK.
Each file within the CPK has a 16 byte header which serves as a key.
The file is treated as a series of sets of 4 uint32s, and uses the following basic algorithm:
uint32[4] key;
uint32[] data;
for (int i = 0; i < data.length; i++)
{
int key_i = i % 4;
key[key_i] = key[key_i] * 3 + 1;
data[i] ^= key[key_i];
}
However, when it hits EOFC packets or other boundaries, it appears to do something different, so this is not a complete description of the format.
TODO: Possibly these are int32s not uint32s, causing the discrepancy?