Skip to content

Instantly share code, notes, and snippets.

@sg-s
Last active May 28, 2020 20:09
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save sg-s/659b55cd334f88d65ca4fafb1058a6f2 to your computer and use it in GitHub Desktop.
Save sg-s/659b55cd334f88d65ca4fafb1058a6f2 to your computer and use it in GitHub Desktop.
Falsehoods I beleived about real data, and the price I paid for it

The data exists

If data is split across multiple files that are sequentially numbered, then it is foolish to beleive that every file exists. Some files can go missing, or be corrupted.

Files with the same name and the same size contain the same data

I learnt the hard way that one version can be corrupted, and the other version is fine, and it's all too easy to replace to good version with the corrupted version

The sampling frequency is going to be the same across a data set

Yeah it can change because a fixed sampling rate would make your analysis too easy

The sampling frequency is going to be some round number

Did I expect the sampling frequency to be some multiple of a microsecond? Yes. Was it, in reality? No. In some random subset of the data, it can be 1.0004 microseconds.

Metadata exists

Ha ha jokes on you who needs metadata.

Metadata is accurate

If you find a channel that is labelled "temperature", do not assume that it measures temperature. It could be literally anything.

Channels have a well-defined number

Just for fun, the data collector can decide to add some data channels. Or throw some away.

Channels have a well-defined name

Remember that channel called "temperature"? It can be accurate in the first half of the data, but wrong in the second half.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment