OswaldHurlem/mmalex_serialization_and_formats.log Secret

## mmalex_serialization_and_formats.log
: What's a good technique for giving format versions to your data files?
Answer:
Hah now there’s a question :)
Depends on a lot of what you want; do you want backwards compat? (almost certainly). forward compat? (less important usually, but sometimes useful). self describing format? (ie can a dumb tool 'parse' it without lots of special knowledge)? is size on disk/wire important? is serialisation speed important?
options you should look at for inspiration, relating to versioning: the golang gob format (which goes to great lengths to be self describing); something like json, and very simple & good for inter-operation between tools, which is also self describing but costly for space/time; and cap'n'proto which emphasises speed and is essentially a better protocol buffers
protobuf & capnproto both take the approach of having a limited 'data description' format, aka a schema, which you must have to understand the data stream. but the schema is simple enough to 'compile' that you can easily write tools to parse/compile it and thus understand data.
so theres a spectrum of choices. tearaway took the route of something like golang blobs.
before tearaway, there was what I'll call the LBP method. and despite it having some major drawbacks, it is so simple, yet so powerful, I keep coming back to it.
I tried various extensions in early dreams, but current dreams is pure lbp method.
I'll describe it now; take it or leave it. it's very opinionated and has some sharp issues, but, its very simple. it has the highest bang for buck of any versioning system I;ve come across.
you have a single version number, that starts at 1; every time you make ANY change to ANY structure, you have to bump it.
 you define your data structures recursively. so for basic types, like int, float, vectors, matrices, you just dump them out.
 this is a 0 overhead format. there are no fields at all. just the data. it is the opposite of self describing: you need the C code to read/write it. the C code, is the data format.
to save a structure, you just write out each of its fields in turn.
now comes the twist.
we declare a C(++) function for each type, called Serialise(R r, D &d) { }
R is the structure that holds the file handle or buffer, or whatever. D is the thing we are serialising
so you have Serialise(R r, float &f)
Serialise(R r, int &i)
 Serialise(R r, char*string
 etc for every type in your entire game.
sorry yes R is a ref too
my bad
you even have Serialise(R r, <every container type ever>)
and pointers, and whatever you want.
each of these serialise functions can do what they want, but we use the same function for reading and writing.
that is, r has a member IsReading() and IsWriting(); the leafiest functions turn into if IsReading() { fread(r.file,...blah); { else {fwrite(r.file, blah) {
kind of thing.
now versioning.
 r also has a .version integer field. when writing, you set this to the 'current' latest version.
when reading, you load it from the header of the file. this way we are going to have perfect backwards compatibility, but no forwards compat. (that is, the code must be newer than the file it is reading)
to implement Serialise for some structure, you just call Serialise on each member.
but if you happen to add a field, you just say if (r.revision>5) Serialise(r, d.mynewfield)
ie you use C to control the 'upgrade'.
 we wrap this in a macro called ADD, #define ADD(revision, field) if (r.version>=revision) Serialise(r, d.field);
you can also write REM(added, removed, type, field, default) type field=default; if (r.version>=added && r.version<removed) Serialise(r, field);
with ADD and REM, you basically document the addition and removal of any fields. note that REM declares a local variable that receives either the old data (that was removed), or a default value.
now that's it. but those building blocks are insanely powerful
as an exdmaple of something you just cant do with protobufs, imagine you had two bools, and you want to remove them and replace them with a single integer bitfield.
say you added b1 on rev 3, added b2 on rev 5, and then want to remove both on rev 7, and replace them with a bitfield;
you'd write
REM(3,7,bool,b1,false);
 REM(5,7,bool,b2,true); // b2 defaults true if the file doesnt have it
 if (r.revision<7) d.newbitmask=(b1?1:0)|(b2?2:0)
this is defining the functions that literally load and save oyur data. every time.
 fianlly, ADD(7,newbitmask)
 basically the power is that you just have a function per type, two macros (ADD, REM) and the opportunity to write arbitrary upgrade code in C. as a versioning system, the power per concept is insanely high.
 because you dont want to write two functions for every type in your whole system, and have them go out of sync
 ere is some real code from dreams; most of these macros expand to a simple C function definition
SERIALISE_TYPE(RayHitSeq) {
ADD(SR_HOVER_WIRES, id);
ADD(SR_HOVER_WIRES, dist);
ADD(SR_HOVER_WIRES, spline_t);
} SERIALISE_END
SERIALISE_TYPE becomes void Serialise(R &r, type &d)
SERIALISE_END is nothing. ADD is as above. SR_ is just an enum of revisions.
we are at revision over 900. LittleBigPlanet 3 is backwards compatible with LittleBigPlanet 1, over several thousand versions.
at runtime, your callstack becomes the nesting of your data structures. so if you put a breakpoint in your file read (fread) function, you see a callstack that looks like Serialise(...int...) called by Serialise(...some structure...) called by Serialise(....some bigger structure...) and so on
 and the leafiest Serialise is doing if (reading) fread() else fwrite()
 here's a random paste of a some slightly more complex case:
ADD(SR_PHYSICAL_COLLIDABLE, Movable)
ADD(SR_PHYSICAL_COLLIDABLE, Collidable)
ADD(SR_IMP_INTERACTION, ImpInteraction)
if(r.IsReading() && r.revision < SR_PHYSICAL_COLLIDABLE)
{
t->Movable = ((int)t->h.stamped_from & 0x01) != 0;// was physical_deprecated;
t->Collidable = ((int)t->h.stamped_from & 0x10) != 0;// was collidable_deprecated;
t->h.stamped_from = EInventoryType::EditModeInventory;
}
ADD(SR_PHYSICAL_DENSITY, Density)

ADD(0, Power)
ADD_CONDITIONAL(0, !t->Movable || r.revision>=SR_PHYSICS_FIX,Pos) // DS: I deeply regret that this was done! the 'physical' flag was being used wildly inconsistently! It means 'glued to group'
ADD(0, Colour)
REM(0, SR_BRUSHSTROKE_SIZES, BubblesBrushing)
here's our SR enum: (start of...) enum {
SR_INITIAL=0,
SR_LATE_POSITIONS=4,// 4 ae:   move positions into later part of file
SR_STRIP_MPH=5,// ldv:  add strip_mph option + mph arrays into game chunks
SR_REMOVE_THINGFLAGS=6,// ae:   remove thingflags lol
SR_SYNC_HASH=7,// ae:   add sync_hash
SR_PLAYERMASK_SELECTED=8,// ae:   playermask_selected
SR_PARENT_THINGIDX=9,// ae:   parents are thingidx
SR_REMOVE_SALT=10,// ae:  remove salt
SR_UNDOREDO_COUNTER=11,// ldv: add undo redo counter
SR_GO_TO_SNAPSHOT=12,// ldv: add snapshot idx to switch to next game update
SR_RAYHIT_TYPE=13,// ldv: change rayhit from Seq+dist to RayHit, capable of hitting bokeh

ok so the sharp issues are:
you need to compile our C code base to make sense of the binary format. got a binary file for dreams? you need a C compiler and a few thousand lines of C code to be able to read it.
(workaround: compile that code into a generic binary<->json converter. when you need to inter-operate with dumb tools, use that tool as a translator to and from dumb-land)
second sharp issue: you absolutely must preserve a simple linear ordering of revisions.
 you have to 'grab' the next revision. if your collaborator next to you wants to use that revision and publish some data, tough shit.
 this puts a limit on team size. its less of an issue than it looks. it scales fine to 20 coders, but probably not beyond.
 third sharp issue: no forward compat. if the game you have was compiled for revision 900, and you hand it a file from revision 901, it can tell its from the future but cannot load it at all. the file is not self describing *at all*. it needs the code for revision 901 to understand it, so all it can do is give up and say 'cant load file from the future'. (actual dreams error message)
 other than that, its fucking awesome.
completely arbitrary upgrades/downgrades, and believe me, we do some pretty drastic re-organisations and its never ever been an issue. liek, total restructuring of entire data structures. no other system copes with that. next: 0 bytes wasted on keys/structure/etc. its just pure data, all the way.  last: its relatively fast: not as fast as 'load directly into ram and fix pointers because I know all the data on disk matches the in ram format perfectly', but faster than almost any 'look up key names in hash tables and shit'.
we can churn through a few gigs a second on a ps4 and we havent had to optimise it as a result.
ie its faster than our harddrives and network, so its not the bottleneck
sorry - you asked a deep question :) and I happen to have been around this particular issue a *lt*
lot.
still recommend looking at golang gobs design (theres a nice doc online) as a counterpoint
 for something self describing.
 funnily enoguh after LBP2, where we tried to have branchable revision numbers (dont do it) and then I tried per-structure revisions (dont do it) and then tearaway did full self describing (totally limiting in what you can now upgrade), I was trying so hard to find a 'fancier' / 'better' system that tackled some of the thorny issues. but so far , every alternative ends up more complex. and in the end, the LBP system is just. so damned simple. you can write it in an afternoon.
 the beauty is that latest code can read old data, older custom LBP levels get converted 'on the fly' as a side effect of the code.
 the code reading 'fresh' data, almost every if (revision<blah) will *not* be taken, because revision from the file will == the highest revision ever seen.
 we just add em in, and if you dont do it right (its not hard, but beginners do fuck it up), the game just falls apart instantly.
it fails early & hard.
 what I do is I insert every few hundred bytes or so a magic integer that just increments. and the loader expects to see those, and checks they are incrementing. and if they are, all good. else bail SOMEONE COCKED UP SERIALISATION ASSERT ASSERT
in other words, every now and then (actually: at the end of every 'big' structure), I effectively scatter this code { int check=r.counter; Serialise(r,check); ASSERT(check==r.counter, "uhoh"); r.counter++; }
if the assert fires, it means someone cocked up the upgrade code or changed a structure without bumping the revision, and the culprit will be between this check and the last one.
which means you pretty much instantly get feedback that someone made a mistake, and you get an assert close to the mistake. so it takes a few minutes to find and fix. but we get these maybe once a month at most at mm, with 10 coders actively making changes.
 I had a look the other day, we're currently bumping revision about 7 times a day between us.
 the flow goes like this
edit master_verison_list.h, add a new line SR_ALEX_COOL_SHIT=666; add lines to serialise functions ADD(SR_ALEX_COOL_SHIT,newhotness); REM(SR_OLD, SR_ALEX_COOL_SHIT,int, olddumbness,0); etc
then update in source contorl. get a conflict in master_version_list.h fuck liam took 666!
 never mind, I'll make mine be 667. but! now I must throw away my test data that assumed 666 was mine. but thats no big sweat.
	: What's a good technique for giving format versions to your data files?
	Answer:
	Hah now there’s a question :)
	Depends on a lot of what you want; do you want backwards compat? (almost certainly). forward compat? (less important usually, but sometimes useful). self describing format? (ie can a dumb tool 'parse' it without lots of special knowledge)? is size on disk/wire important? is serialisation speed important?
	options you should look at for inspiration, relating to versioning: the golang gob format (which goes to great lengths to be self describing); something like json, and very simple & good for inter-operation between tools, which is also self describing but costly for space/time; and cap'n'proto which emphasises speed and is essentially a better protocol buffers
	protobuf & capnproto both take the approach of having a limited 'data description' format, aka a schema, which you must have to understand the data stream. but the schema is simple enough to 'compile' that you can easily write tools to parse/compile it and thus understand data.
	so theres a spectrum of choices. tearaway took the route of something like golang blobs.
	before tearaway, there was what I'll call the LBP method. and despite it having some major drawbacks, it is so simple, yet so powerful, I keep coming back to it.
	I tried various extensions in early dreams, but current dreams is pure lbp method.
	I'll describe it now; take it or leave it. it's very opinionated and has some sharp issues, but, its very simple. it has the highest bang for buck of any versioning system I;ve come across.
	you have a single version number, that starts at 1; every time you make ANY change to ANY structure, you have to bump it.
	you define your data structures recursively. so for basic types, like int, float, vectors, matrices, you just dump them out.
	this is a 0 overhead format. there are no fields at all. just the data. it is the opposite of self describing: you need the C code to read/write it. the C code, is the data format.
	to save a structure, you just write out each of its fields in turn.
	now comes the twist.
	we declare a C(++) function for each type, called Serialise(R r, D &d) { }
	R is the structure that holds the file handle or buffer, or whatever. D is the thing we are serialising
	so you have Serialise(R r, float &f)
	Serialise(R r, int &i)
	Serialise(R r, char*string
	etc for every type in your entire game.
	sorry yes R is a ref too
	my bad
	you even have Serialise(R r, <every container type ever>)
	and pointers, and whatever you want.
	each of these serialise functions can do what they want, but we use the same function for reading and writing.
	that is, r has a member IsReading() and IsWriting(); the leafiest functions turn into if IsReading() { fread(r.file,...blah); { else {fwrite(r.file, blah) {
	kind of thing.
	now versioning.
	r also has a .version integer field. when writing, you set this to the 'current' latest version.
	when reading, you load it from the header of the file. this way we are going to have perfect backwards compatibility, but no forwards compat. (that is, the code must be newer than the file it is reading)
	to implement Serialise for some structure, you just call Serialise on each member.
	but if you happen to add a field, you just say if (r.revision>5) Serialise(r, d.mynewfield)
	ie you use C to control the 'upgrade'.
	we wrap this in a macro called ADD, #define ADD(revision, field) if (r.version>=revision) Serialise(r, d.field);
	you can also write REM(added, removed, type, field, default) type field=default; if (r.version>=added && r.version<removed) Serialise(r, field);
	with ADD and REM, you basically document the addition and removal of any fields. note that REM declares a local variable that receives either the old data (that was removed), or a default value.
	now that's it. but those building blocks are insanely powerful
	as an exdmaple of something you just cant do with protobufs, imagine you had two bools, and you want to remove them and replace them with a single integer bitfield.
	say you added b1 on rev 3, added b2 on rev 5, and then want to remove both on rev 7, and replace them with a bitfield;
	you'd write
	REM(3,7,bool,b1,false);
	REM(5,7,bool,b2,true); // b2 defaults true if the file doesnt have it
	if (r.revision<7) d.newbitmask=(b1?1:0)\|(b2?2:0)
	this is defining the functions that literally load and save oyur data. every time.
	fianlly, ADD(7,newbitmask)
	basically the power is that you just have a function per type, two macros (ADD, REM) and the opportunity to write arbitrary upgrade code in C. as a versioning system, the power per concept is insanely high.
	because you dont want to write two functions for every type in your whole system, and have them go out of sync
	ere is some real code from dreams; most of these macros expand to a simple C function definition
	SERIALISE_TYPE(RayHitSeq) {
	ADD(SR_HOVER_WIRES, id);
	ADD(SR_HOVER_WIRES, dist);
	ADD(SR_HOVER_WIRES, spline_t);
	} SERIALISE_END
	SERIALISE_TYPE becomes void Serialise(R &r, type &d)
	SERIALISE_END is nothing. ADD is as above. SR_ is just an enum of revisions.
	we are at revision over 900. LittleBigPlanet 3 is backwards compatible with LittleBigPlanet 1, over several thousand versions.
	at runtime, your callstack becomes the nesting of your data structures. so if you put a breakpoint in your file read (fread) function, you see a callstack that looks like Serialise(...int...) called by Serialise(...some structure...) called by Serialise(....some bigger structure...) and so on
	and the leafiest Serialise is doing if (reading) fread() else fwrite()
	here's a random paste of a some slightly more complex case:
	ADD(SR_PHYSICAL_COLLIDABLE, Movable)
	ADD(SR_PHYSICAL_COLLIDABLE, Collidable)
	ADD(SR_IMP_INTERACTION, ImpInteraction)
	if(r.IsReading() && r.revision < SR_PHYSICAL_COLLIDABLE)
	{
	t->Movable = ((int)t->h.stamped_from & 0x01) != 0;// was physical_deprecated;
	t->Collidable = ((int)t->h.stamped_from & 0x10) != 0;// was collidable_deprecated;
	t->h.stamped_from = EInventoryType::EditModeInventory;
	}
	ADD(SR_PHYSICAL_DENSITY, Density)

	ADD(0, Power)
	ADD_CONDITIONAL(0, !t->Movable \|\| r.revision>=SR_PHYSICS_FIX,Pos) // DS: I deeply regret that this was done! the 'physical' flag was being used wildly inconsistently! It means 'glued to group'
	ADD(0, Colour)
	REM(0, SR_BRUSHSTROKE_SIZES, BubblesBrushing)
	here's our SR enum: (start of...) enum {
	SR_INITIAL=0,
	SR_LATE_POSITIONS=4,// 4 ae: move positions into later part of file
	SR_STRIP_MPH=5,// ldv: add strip_mph option + mph arrays into game chunks
	SR_REMOVE_THINGFLAGS=6,// ae: remove thingflags lol
	SR_SYNC_HASH=7,// ae: add sync_hash
	SR_PLAYERMASK_SELECTED=8,// ae: playermask_selected
	SR_PARENT_THINGIDX=9,// ae: parents are thingidx
	SR_REMOVE_SALT=10,// ae: remove salt
	SR_UNDOREDO_COUNTER=11,// ldv: add undo redo counter
	SR_GO_TO_SNAPSHOT=12,// ldv: add snapshot idx to switch to next game update
	SR_RAYHIT_TYPE=13,// ldv: change rayhit from Seq+dist to RayHit, capable of hitting bokeh

	ok so the sharp issues are:
	you need to compile our C code base to make sense of the binary format. got a binary file for dreams? you need a C compiler and a few thousand lines of C code to be able to read it.
	(workaround: compile that code into a generic binary<->json converter. when you need to inter-operate with dumb tools, use that tool as a translator to and from dumb-land)
	second sharp issue: you absolutely must preserve a simple linear ordering of revisions.
	you have to 'grab' the next revision. if your collaborator next to you wants to use that revision and publish some data, tough shit.
	this puts a limit on team size. its less of an issue than it looks. it scales fine to 20 coders, but probably not beyond.
	third sharp issue: no forward compat. if the game you have was compiled for revision 900, and you hand it a file from revision 901, it can tell its from the future but cannot load it at all. the file is not self describing at all. it needs the code for revision 901 to understand it, so all it can do is give up and say 'cant load file from the future'. (actual dreams error message)
	other than that, its fucking awesome.
	completely arbitrary upgrades/downgrades, and believe me, we do some pretty drastic re-organisations and its never ever been an issue. liek, total restructuring of entire data structures. no other system copes with that. next: 0 bytes wasted on keys/structure/etc. its just pure data, all the way. last: its relatively fast: not as fast as 'load directly into ram and fix pointers because I know all the data on disk matches the in ram format perfectly', but faster than almost any 'look up key names in hash tables and shit'.
	we can churn through a few gigs a second on a ps4 and we havent had to optimise it as a result.
	ie its faster than our harddrives and network, so its not the bottleneck
	sorry - you asked a deep question :) and I happen to have been around this particular issue a lt
	lot.
	still recommend looking at golang gobs design (theres a nice doc online) as a counterpoint
	for something self describing.
	funnily enoguh after LBP2, where we tried to have branchable revision numbers (dont do it) and then I tried per-structure revisions (dont do it) and then tearaway did full self describing (totally limiting in what you can now upgrade), I was trying so hard to find a 'fancier' / 'better' system that tackled some of the thorny issues. but so far , every alternative ends up more complex. and in the end, the LBP system is just. so damned simple. you can write it in an afternoon.
	the beauty is that latest code can read old data, older custom LBP levels get converted 'on the fly' as a side effect of the code.
	the code reading 'fresh' data, almost every if (revision<blah) will not be taken, because revision from the file will == the highest revision ever seen.
	we just add em in, and if you dont do it right (its not hard, but beginners do fuck it up), the game just falls apart instantly.
	it fails early & hard.
	what I do is I insert every few hundred bytes or so a magic integer that just increments. and the loader expects to see those, and checks they are incrementing. and if they are, all good. else bail SOMEONE COCKED UP SERIALISATION ASSERT ASSERT
	in other words, every now and then (actually: at the end of every 'big' structure), I effectively scatter this code { int check=r.counter; Serialise(r,check); ASSERT(check==r.counter, "uhoh"); r.counter++; }
	if the assert fires, it means someone cocked up the upgrade code or changed a structure without bumping the revision, and the culprit will be between this check and the last one.
	which means you pretty much instantly get feedback that someone made a mistake, and you get an assert close to the mistake. so it takes a few minutes to find and fix. but we get these maybe once a month at most at mm, with 10 coders actively making changes.
	I had a look the other day, we're currently bumping revision about 7 times a day between us.
	the flow goes like this
	edit master_verison_list.h, add a new line SR_ALEX_COOL_SHIT=666; add lines to serialise functions ADD(SR_ALEX_COOL_SHIT,newhotness); REM(SR_OLD, SR_ALEX_COOL_SHIT,int, olddumbness,0); etc
	then update in source contorl. get a conflict in master_version_list.h fuck liam took 666!
	never mind, I'll make mine be 667. but! now I must throw away my test data that assumed 666 was mine. but thats no big sweat.