Writing a Parser
Our goal for this section is to parse a NetCDF file. NetCDF is a binary format, which is most often used for climatology/geoscience data. Because it is a custom binary format, we’re going to write a custom parser for it by hand.
If you have a format based on JSON or XML to parse, then you should start with JSON.jl or julia-xml-library, rather than writing it by hand.
NetCDF was designed to store large arrays of data efficiently. If that’s a task you need to do, take a look at HDF5.jl. HDF5 is a file format that is more recent than NetCDF and that has similar goals; it is the successor to NetCDF. HDF5.jl can read and write files produced by other languages and can read and write .jld files (which can express Julia values, including their types).
Every NetCDF file has headers that decribe what data is has, and then the data. In order to do anything with the file, we need to first read the headers. The headers tell us about variables stored in the file; they’ll tell us names and metadata for each variable, plus where the value is located and how big it is. Once we’ve understood the headers, we can find and read the value of any/all variables stored in the file.
Parsing NetCDF Headers
The NetCDF file headers tell you what variables the file holds, their size and dimentions, and where in the file they are stored. There are four parts to the header. Each one answers a different question:
Which version of NetCDF are we using? (either 1 or 2)
What are the dimensions in this file? (name, ordered list of dimensions)
What are the attributes of this file? (metadata about source of the data, I think)
What are the variables in this file? (name, attributes, size, location in file)
The first one in that list is different (and simpler) than the others, so first we’re going to parse the version header, and then we’re going to talk about the rest of them.
Parsing the Version Header
The version header consists of the characters
F, and then a byte containing version value (either
This does not mean
CDF and then an unprintable character whose bits are
Let’s write some code to read this:
julia> file = open("simple_xy.nc","r") # a simple NetCDF file IOStream() julia> read(file,Char) # C 'C' julia> read(file,Char) # D 'D' julia> read(file,Char) # F 'F' julia> version = read(file,Uint8) # 1 0x01
read function takes a stream and then a type that we want to read from it. Streams will always, at the bottom, be represented as binary, because that’s what computers represent all data as. We have to tell
read how many bits to read and how to interpret them. Each
0 is a bit; a byte is 8 bits. You can interpret these in various ways.
An unsigned int, like
Uint8, corresponds to the standard way humans are taught to interpret base-2 numbers. If you want to convert from base-2 into decimal (base-10), then you can sum up the value of each column. A zero in any column is worth zero. A one in the right-most column is worth 1; a one in the column left of that is worth 2. As you move left, the value of a column doubles each time. Thus, 8 is
1000, 12 is
1100, and 15 is
1111. The 8 in
Uint8 indicates that it has 8 bits in it, which is 1 byte. Let’s look at the binary representation of that version number we just read.
julia> bits(version) "00000001"
bits function shows us the binary representation of a value, as a string of zeroes and ones. There are eight characters in the string because
version is a
Uint8. What if we do some math with version?
julia> bits(version + 1) "0000000000000000000000000000000000000000000000000000000000000010" julia> bits(version * 2) "0000000000000000000000000000000000000000000000000000000000000010"
That’s a lot more zeroes! This is happening because the
* functions are converting to my computer’s native
Int64, before doing the computation. This type conversion when doing arithemetic can be annoying; we can get back to a
Uint8 by using the function
julia> bits(uint8(version * 2)) "00000010" julia> bits(uint8(version * 5)) "00000101"
Char is represented the same way as a
Uint8: 8 bits, which is 1 byte. The difference is that each value maps to a difference character, rather than being considered a number. We can look at the bits of characters, too:
julia> 'A' 'A' julia> bits('A') "00000000000000000000000001000001" julia> bits('B') "00000000000000000000000001000010" julia> bits(' ') "00000000000000000000000000100000"
Well, that looks strange. Those strings are 32 characters long (I checked them with
length), but characters are represented by 1 byte each. There is another function
sizeof which will tell use the size in bytes of a type or the number of bytes in a string. Let’s see what it says:
julia> typeof('A') # what is the type of A? Char julia> sizeof(Char) # how many bytes are in a Char? 4 julia> sizeof("CDF") # how many bytes are in this string? 3
I’m not sure what’s going on here, but clearly the string realizes that it only takes up 1 byte per character. Maybe it has something to do with Julia’s unicode support?
julia> typeof('') Char julia> sizeof("") 3 julia> sizeof("CDF") 6