colstrom/README.md Secret

## README.md

      
    Raw
  

              README.md
            
          
    Discovery Process for Undocumented Binary Data Formats

First, look at the file extension. Is it something normal-looking? Maybe there's a
crate you can use instead of writing your own. If not, check it with the
file command (from libmagic) or the
infer crate. Maybe it's a common format
with a weird extension. Maybe it's compressed? If so, see if there's a crate for
whatever compression format it uses. If so, decompress it and repeat this process.
If not...
Run xxd on the file and just visually scan
the output. You're looking for long runs of zeroes, which might be padding. At
some particular output width, these might stand out more than others. Play
around with the -c option (try 32 and 64, for example).
If you can see a few clearly padded sections early on, that's a good sign that
it's not a compressed format (since that would be a waste of space) or an
encrypted format (which would generally look more like noise).
If you're lucky, a consistent byte alignment may be visible in the output.
Otherwise, guessing is fine. 4 bytes is a pretty reasonable guess, since most
stuff doesn't need really big numbers. The age of the program that creates or
consumes the data may provide a hint here, plus inferring from context about
what the format is used for. 64-bit floats might be more useful in an audio
file than a model file, for example.
Now look at the sections you have. Remember that some program needs to read this
data, and it doesn't do so randomly. Mostly likely, one of two things is true:

The file is so rigidly structured that whatever reads it can rely on
everything always being in the same place.
The file provides some way for whatever reads it to know where to find
things that it wants to find. There's a finite number of things it needs
to find, but how many of each may vary from file to file.

If you have multiple samples available, you can look at the sizes of each file.

If all the sizes are the same (in a sufficiently large set of samples), then
Scenario 1 is more likely.
If the sizes are all completely different, then Scenario 2 is more likely.
If some sizes differ, but there are a limited number of variations (say all
files are EITHER 20kb OR 28kb OR 40kb, for example), then Scenario 1 is more
likely, but you may have multiple versions of a format that evolved over time.

You can also use some intuition here by thinking about what the file is used for.


For example, saved game data files often have a fixed size and layout, so they
can pre-allocate disk space on first save, so that the game doesn't have to
worry (as much) about the system running out of space mid-game.


A model file, for instance, is more likely to have a variable number of whatever
things it has, like vertices or curves or shapes or stuff like that. So it's
more likely to be Scenario 2. But that format may ALSO need to support new
features as the data format evolved with the program that produces it.


If you have multiple samples available, compare the rough shape of the data,
such as where sections are, how big they are, where the data is and where the
padding is, etc.
Note that we are not looking at the specific data here, just the general shape of it.
With enough samples to compare, it can be fairly obvious when it's Scenario 1.
If so, the data types of each section are likely to be consistent. Look at the
byte alignment for hints.
Otherwise, focus on the sections early on in the file if it's Scenario 2.
Remember that whatever needs to read this type of file needs to know where to
find things. At least that data needs to be somewhere consistent, and that's
usually at the beginning of the file, at some fixed location. Look for sections
that start at the same position in all the samples you have, even if the total
size of that section varies from file to file.
Thinking from the perspective of whatever program reads this file, a few key
pieces of information are required to be able to read things.

Which types of things are in this file.
How big a thing of that type is.
Where to find things of that type.

Worst case, where each thing specifically is.
More likely, things of the same type are stored together. This is more
efficient to access, and easier to implement. If you're rolling your own data format...


How many of each thing there are.

Now, 1 and 2 can be known to the program itself, and may not be present in the
data itself. 3 and 4 cannot be known to the program until it reads the data,
since this data format does not have a fixed number of slots for each data type.
So we're looking for two sections. One that describes offsets, and one that
describes counts. These sections are related, so it's likely that among the
sections you found, two of them have the same number of entries.
But how to tell which is which?
Look at the minimum and maximum values within each section. Whichever section
has the offsets should be pointing to things that come after it, so the minimum
value should be no less than the position of the last entry in that section. The
maximum value should be no greater than the total length of the file.
We can also reason that the offset section could contain the offset of the
counts section, but not vice-versa, so it is likely to be the earlier section.
Once you have identified which section contains the offsets, see if any of those
match with the start of the counts section. If we're correct so far, one should.
With these two pieces of information in hand, we have what we need to find
everything else in the file, just not to know what any of those things are.

As an aside, check the offsets of the sections you were eyeballing earlier in
the process. How many of them line up with one of one of the entries in the
offsets section? How many don't? We don't need to eyeball it anymore, since we
have a proper map now, but it can still be nice to see that the rough map was
pointing in the right direction, even if not perfectly.

Now, what we need to do is figure out what sort of data each of those sections
contains.
We have a list of offsets for each section, right? And by definition, each must
end before the next one begins. This tells us the maximum size of each section.
There may be some padding at the end of a section, for byte alignment and such.
What we don't yet have is a size of each entry in a section.
We do have a list of counts, but which count applies to which section?
Now, one of two things is likely true:

The offset list and count list are both in the same order, and the program
only needs to know how many entries are in both lists (which is the same).
The offset list and count list are ordered differently, in which case ALL
programs expected to read this data need to know that order, or how to find it.

We can do a quick sanity check here by looking at the offset list, because we
know that it refers to positions within the file. Is each entry in the offset
list greater than the last?

If so, they are listed in order. It is likely that the counts are as well.
If not, they are out of order, and it would be very inconvenient if the
counts were ordered differently.

We can then check our assumed count order by dividing the size of each section
by the count, and seeing if the numbers look reasonable.

Would that give us something weird like entries that are 3 bytes wide, or are
any of the counts larger than the number of bytes in that section? If so, the
fields are probably out of order.

In that case, we can step through the counts one at a time, mapping out
which sections they can or can't apply to, somewhat like a game of Sudoku.

This may end up with a few that are ambiguous, such as multiple sections
with the same count.
You can likely resolve this ambiguity if you have enough data files.
Focus on ones that have the same sized offset section. Remember, a program
needs to read this data. The order isn't random. It's likely consistent,
at least for a given version of the data format.


Once we've determined the size of the entries in each section, we have all four
pieces of information needed to read everything else in the file.

Next, we need to figure out what these things are.
Let's split out data types into things that are numbers, and things that are not.
The size gives us a clue. If it's really big, it's probably not just a number.
The xxd output from earlier can hint at this. These are often Strings.
For the numbers, we need to figure out what TYPE of number it is. There are some
heuristics we can use to make non-terrible guesses, but it is still guesswork at
this stage. Once we have context, we can refine these guesses, and with
sufficient sample data, we can even be reasonably accurate.
Let's assume for convenience that all the numbers were aligned to 4-byte boundaries.
Take all the numbers in a given section, and parse them as the three most
likely types of that size: u32, i32, and f32.
Keep track of the largest and smallest value found in the whole section as both
integers (i64 can fit u32::MAX) and floats.
Did any of the values parse as f32::NaN, f32::INFINITY, or f32::NEG_INFINITY?

If so, this section probably doesn't contain floats.

Were most of the values really small? Like 0.0000000000001 kind of small?

If so, this section probably doesn't contain floats.

Was the minimum value negative?

The section is probably signed integers or really big unsigned integers.

Look at the rest of the values for clues.


Was the minimum 0 and the maximum 1?

The section probably contains bool values.

Was the minimum 0?

The section probably does not contain signed integers.

Were the minimum and maximum values both between the appropriate N-bit ::MIN and ::MAX?

The section probably contains either uN or iN depending on other heuristics.

Were the minimum and maximum values both 0?

It's really hard to tell what this is, then. It looks like padding, but it's
within one of the sections included in the offsets list. Check other samples to
get a better idea.
It may be a bool that's just usually false.