Our goal for this section is to parse a NetCDF file. NetCDF is a binary format, which is most often used for climatology/geoscience data. Because it is a custom binary format, we’re going to write a custom parser for it by hand.
Tip
|
If you have a format based on JSON or XML to parse, then you should start with JSON.jl or julia-xml-library, rather than writing it by hand. |
Tip
|
NetCDF was designed to store large arrays of data efficiently. If that’s a task you need to do, take a look at HDF5.jl. HDF5 is a file format that is more recent than NetCDF and that has similar goals; it is the successor to NetCDF. HDF5.jl can read and write files produced by other languages and can read and write .jld files (which can express Julia values, including their types). |
Every NetCDF file has headers that decribe what data is has, and then the data. In order to do anything with the file, we need to first read the headers. The headers tell us about variables stored in the file; they’ll tell us names and metadata for each variable, plus where the value is located and how big it is. Once we’ve understood the headers, we can find and read the value of any/all variables stored in the file.
The NetCDF file headers tell you what variables the file holds, their size and dimentions, and where in the file they are stored. There are four parts to the header. Each one answers a different question:
-
Which version of NetCDF are we using? (either 1 or 2)
-
What are the dimensions in this file? (name, ordered list of dimensions)
-
What are the attributes of this file? (metadata about source of the data, I think)
-
What are the variables in this file? (name, attributes, size, location in file)
The first one in that list is different (and simpler) than the others, so first we’re going to parse the version header, and then we’re going to talk about the rest of them.
The version header consists of the characters C
, D
, F
, and then a byte containing version value (either 0x01
or 0x02
).
This does not mean CDF1
, but CDF
and then an unprintable character whose bits are 00000001
.
Let’s write some code to read this:
julia> file = open("simple_xy.nc","r") # a simple NetCDF file
IOStream()
julia> read(file,Char) # C
'C'
julia> read(file,Char) # D
'D'
julia> read(file,Char) # F
'F'
julia> version = read(file,Uint8) # 1
0x01
The read
function takes a stream and then a type that we want to read from it. Streams will always, at the bottom, be represented as binary, because that’s what computers represent all data as. We have to tell read
how many bits to read and how to interpret them. Each 1
or 0
is a bit; a byte is 8 bits. You can interpret these in various ways.
An unsigned int, like Uint8
, corresponds to the standard way humans are taught to interpret base-2 numbers. If you want to convert from base-2 into decimal (base-10), then you can sum up the value of each column. A zero in any column is worth zero. A one in the right-most column is worth 1; a one in the column left of that is worth 2. As you move left, the value of a column doubles each time. Thus, 8 is 1000
, 12 is 1100
, and 15 is 1111
. The 8 in Uint8
indicates that it has 8 bits in it, which is 1 byte. Let’s look at the binary representation of that version number we just read.
julia> bits(version)
"00000001"
The bits
function shows us the binary representation of a value, as a string of zeroes and ones. There are eight characters in the string because version
is a Uint8
. What if we do some math with version?
julia> bits(version + 1)
"0000000000000000000000000000000000000000000000000000000000000010"
julia> bits(version * 2)
"0000000000000000000000000000000000000000000000000000000000000010"
That’s a lot more zeroes! This is happening because the +
and *
functions are converting to my computer’s native Int
type, Int64
, before doing the computation. This type conversion when doing arithemetic can be annoying; we can get back to a Uint8
by using the function uint8
.
julia> bits(uint8(version * 2))
"00000010"
julia> bits(uint8(version * 5))
"00000101"
A Char
is represented the same way as a Uint8
: 8 bits, which is 1 byte. The difference is that each value maps to a difference character, rather than being considered a number. We can look at the bits of characters, too:
julia> 'A'
'A'
julia> bits('A')
"00000000000000000000000001000001"
julia> bits('B')
"00000000000000000000000001000010"
julia> bits(' ')
"00000000000000000000000000100000"
Well, that looks strange. Those strings are 32 characters long (I checked them with length
), but characters are represented by 1 byte each. There is another function sizeof
which will tell use the size in bytes of a type or the number of bytes in a string. Let’s see what it says:
julia> typeof('A') # what is the type of A?
Char
julia> sizeof(Char) # how many bytes are in a Char?
4
julia> sizeof("CDF") # how many bytes are in this string?
3
I’m not sure what’s going on here, but clearly the string realizes that it only takes up 1 byte per character. Maybe it has something to do with Julia’s unicode support?
julia> typeof('')
Char
julia> sizeof("")
3
julia> sizeof("CDF")
6