Skip to content

Instantly share code, notes, and snippets.

@colstrom
Last active November 16, 2023 03:25
Show Gist options
  • Save colstrom/f671d1583662de47b505a42a75b3a44b to your computer and use it in GitHub Desktop.
Save colstrom/f671d1583662de47b505a42a75b3a44b to your computer and use it in GitHub Desktop.
Discovery Process for Undocumented Binary Data Formats

Discovery Process for Undocumented Binary Data Formats

First, look at the file extension. Is it something normal-looking? Maybe there's a crate you can use instead of writing your own. If not, check it with the file command (from libmagic) or the infer crate. Maybe it's a common format with a weird extension. Maybe it's compressed? If so, see if there's a crate for whatever compression format it uses. If so, decompress it and repeat this process.

If not...

Run xxd on the file and just visually scan the output. You're looking for long runs of zeroes, which might be padding. At some particular output width, these might stand out more than others. Play around with the -c option (try 32 and 64, for example).

If you can see a few clearly padded sections early on, that's a good sign that it's not a compressed format (since that would be a waste of space) or an encrypted format (which would generally look more like noise).

If you're lucky, a consistent byte alignment may be visible in the output. Otherwise, guessing is fine. 4 bytes is a pretty reasonable guess, since most stuff doesn't need really big numbers. The age of the program that creates or consumes the data may provide a hint here, plus inferring from context about what the format is used for. 64-bit floats might be more useful in an audio file than a model file, for example.

Now look at the sections you have. Remember that some program needs to read this data, and it doesn't do so randomly. Mostly likely, one of two things is true:

  1. The file is so rigidly structured that whatever reads it can rely on everything always being in the same place.
  2. The file provides some way for whatever reads it to know where to find things that it wants to find. There's a finite number of things it needs to find, but how many of each may vary from file to file.

If you have multiple samples available, you can look at the sizes of each file.

  • If all the sizes are the same (in a sufficiently large set of samples), then Scenario 1 is more likely.
  • If the sizes are all completely different, then Scenario 2 is more likely.
  • If some sizes differ, but there are a limited number of variations (say all files are EITHER 20kb OR 28kb OR 40kb, for example), then Scenario 1 is more likely, but you may have multiple versions of a format that evolved over time.

You can also use some intuition here by thinking about what the file is used for.

  • For example, saved game data files often have a fixed size and layout, so they can pre-allocate disk space on first save, so that the game doesn't have to worry (as much) about the system running out of space mid-game.

  • A model file, for instance, is more likely to have a variable number of whatever things it has, like vertices or curves or shapes or stuff like that. So it's more likely to be Scenario 2. But that format may ALSO need to support new features as the data format evolved with the program that produces it.

If you have multiple samples available, compare the rough shape of the data, such as where sections are, how big they are, where the data is and where the padding is, etc.

Note that we are not looking at the specific data here, just the general shape of it.

With enough samples to compare, it can be fairly obvious when it's Scenario 1. If so, the data types of each section are likely to be consistent. Look at the byte alignment for hints.

Otherwise, focus on the sections early on in the file if it's Scenario 2. Remember that whatever needs to read this type of file needs to know where to find things. At least that data needs to be somewhere consistent, and that's usually at the beginning of the file, at some fixed location. Look for sections that start at the same position in all the samples you have, even if the total size of that section varies from file to file.

Thinking from the perspective of whatever program reads this file, a few key pieces of information are required to be able to read things.

  1. Which types of things are in this file.
  2. How big a thing of that type is.
  3. Where to find things of that type.
    1. Worst case, where each thing specifically is.
    2. More likely, things of the same type are stored together. This is more efficient to access, and easier to implement. If you're rolling your own data format...
  4. How many of each thing there are.

Now, 1 and 2 can be known to the program itself, and may not be present in the data itself. 3 and 4 cannot be known to the program until it reads the data, since this data format does not have a fixed number of slots for each data type.

So we're looking for two sections. One that describes offsets, and one that describes counts. These sections are related, so it's likely that among the sections you found, two of them have the same number of entries.

But how to tell which is which?

Look at the minimum and maximum values within each section. Whichever section has the offsets should be pointing to things that come after it, so the minimum value should be no less than the position of the last entry in that section. The maximum value should be no greater than the total length of the file.

We can also reason that the offset section could contain the offset of the counts section, but not vice-versa, so it is likely to be the earlier section.

Once you have identified which section contains the offsets, see if any of those match with the start of the counts section. If we're correct so far, one should.

With these two pieces of information in hand, we have what we need to find everything else in the file, just not to know what any of those things are.


As an aside, check the offsets of the sections you were eyeballing earlier in the process. How many of them line up with one of one of the entries in the offsets section? How many don't? We don't need to eyeball it anymore, since we have a proper map now, but it can still be nice to see that the rough map was pointing in the right direction, even if not perfectly.


Now, what we need to do is figure out what sort of data each of those sections contains.

We have a list of offsets for each section, right? And by definition, each must end before the next one begins. This tells us the maximum size of each section. There may be some padding at the end of a section, for byte alignment and such.

What we don't yet have is a size of each entry in a section.

We do have a list of counts, but which count applies to which section?

Now, one of two things is likely true:

  1. The offset list and count list are both in the same order, and the program only needs to know how many entries are in both lists (which is the same).
  2. The offset list and count list are ordered differently, in which case ALL programs expected to read this data need to know that order, or how to find it.

We can do a quick sanity check here by looking at the offset list, because we know that it refers to positions within the file. Is each entry in the offset list greater than the last?

  • If so, they are listed in order. It is likely that the counts are as well.
  • If not, they are out of order, and it would be very inconvenient if the counts were ordered differently.

We can then check our assumed count order by dividing the size of each section by the count, and seeing if the numbers look reasonable.

  • Would that give us something weird like entries that are 3 bytes wide, or are any of the counts larger than the number of bytes in that section? If so, the fields are probably out of order.
    • In that case, we can step through the counts one at a time, mapping out which sections they can or can't apply to, somewhat like a game of Sudoku.
      • This may end up with a few that are ambiguous, such as multiple sections with the same count.
      • You can likely resolve this ambiguity if you have enough data files. Focus on ones that have the same sized offset section. Remember, a program needs to read this data. The order isn't random. It's likely consistent, at least for a given version of the data format.

Once we've determined the size of the entries in each section, we have all four pieces of information needed to read everything else in the file.


Next, we need to figure out what these things are.

Let's split out data types into things that are numbers, and things that are not.

The size gives us a clue. If it's really big, it's probably not just a number. The xxd output from earlier can hint at this. These are often Strings.

For the numbers, we need to figure out what TYPE of number it is. There are some heuristics we can use to make non-terrible guesses, but it is still guesswork at this stage. Once we have context, we can refine these guesses, and with sufficient sample data, we can even be reasonably accurate.

Let's assume for convenience that all the numbers were aligned to 4-byte boundaries.

Take all the numbers in a given section, and parse them as the three most likely types of that size: u32, i32, and f32.

Keep track of the largest and smallest value found in the whole section as both integers (i64 can fit u32::MAX) and floats.

Did any of the values parse as f32::NaN, f32::INFINITY, or f32::NEG_INFINITY?

  • If so, this section probably doesn't contain floats.

Were most of the values really small? Like 0.0000000000001 kind of small?

  • If so, this section probably doesn't contain floats.

Was the minimum value negative?

  • The section is probably signed integers or really big unsigned integers.
    • Look at the rest of the values for clues.

Was the minimum 0 and the maximum 1?

  • The section probably contains bool values.

Was the minimum 0?

  • The section probably does not contain signed integers.

Were the minimum and maximum values both between the appropriate N-bit ::MIN and ::MAX?

  • The section probably contains either uN or iN depending on other heuristics.

Were the minimum and maximum values both 0?

  • It's really hard to tell what this is, then. It looks like padding, but it's within one of the sections included in the offsets list. Check other samples to get a better idea.
  • It may be a bool that's just usually false.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment