Skip to content

Instantly share code, notes, and snippets.

Embed
What would you like to do?
NetCDF section for Learning Julia (with full script); not planning to include in book

Writing a Parser

Our goal for this section is to parse a NetCDF file. NetCDF is a binary format, which is most often used for climatology/geoscience data. Because it is a custom binary format, we’re going to write a custom parser for it by hand.

Tip

If you have a format based on JSON or XML to parse, then you should start with JSON.jl or julia-xml-library, rather than writing it by hand.

Tip

NetCDF was designed to store large arrays of data efficiently. If that’s a task you need to do, take a look at HDF5.jl. HDF5 is a file format that is more recent than NetCDF and that has similar goals; it is the successor to NetCDF. HDF5.jl can read and write files produced by other languages and can read and write .jld files (which can express Julia values, including their types).

Every NetCDF file has headers that decribe what data is has, and then the data. In order to do anything with the file, we need to first read the headers. The headers tell us about variables stored in the file; they’ll tell us names and metadata for each variable, plus where the value is located and how big it is. Once we’ve understood the headers, we can find and read the value of any/all variables stored in the file.

Parsing NetCDF Headers

The NetCDF file headers tell you what variables the file holds, their size and dimentions, and where in the file they are stored. There are four parts to the header. Each one answers a different question:

  1. Which version of NetCDF are we using? (either 1 or 2)

  2. What are the dimensions in this file? (name, ordered list of dimensions)

  3. What are the attributes of this file? (metadata about source of the data, I think)

  4. What are the variables in this file? (name, attributes, size, location in file)

The first one in that list is different (and simpler) than the others, so first we’re going to parse the version header, and then we’re going to talk about the rest of them.

Parsing the Version Header

The version header consists of the characters C, D, F, and then a byte containing version value (either 0x01 or 0x02). This does not mean CDF1, but CDF and then an unprintable character whose bits are 00000001. Let’s write some code to read this:

julia> file = open("simple_xy.nc","r") # a simple NetCDF file
IOStream()

julia> read(file,Char) # C
'C'

julia> read(file,Char) # D
'D'

julia> read(file,Char) # F
'F'

julia> version = read(file,Uint8) # 1
0x01

The read function takes a stream and then a type that we want to read from it. Streams will always, at the bottom, be represented as binary, because that’s what computers represent all data as. We have to tell read how many bits to read and how to interpret them. Each 1 or 0 is a bit; a byte is 8 bits. You can interpret these in various ways.

An unsigned int, like Uint8, corresponds to the standard way humans are taught to interpret base-2 numbers. If you want to convert from base-2 into decimal (base-10), then you can sum up the value of each column. A zero in any column is worth zero. A one in the right-most column is worth 1; a one in the column left of that is worth 2. As you move left, the value of a column doubles each time. Thus, 8 is 1000, 12 is 1100, and 15 is 1111. The 8 in Uint8 indicates that it has 8 bits in it, which is 1 byte. Let’s look at the binary representation of that version number we just read.

julia> bits(version)
"00000001"

The bits function shows us the binary representation of a value, as a string of zeroes and ones. There are eight characters in the string because version is a Uint8. What if we do some math with version?

julia> bits(version + 1)
"0000000000000000000000000000000000000000000000000000000000000010"

julia> bits(version * 2)
"0000000000000000000000000000000000000000000000000000000000000010"

That’s a lot more zeroes! This is happening because the + and * functions are converting to my computer’s native Int type, Int64, before doing the computation. This type conversion when doing arithemetic can be annoying; we can get back to a Uint8 by using the function uint8.

julia> bits(uint8(version * 2))
"00000010"

julia> bits(uint8(version * 5))
"00000101"

A Char is represented the same way as a Uint8: 8 bits, which is 1 byte. The difference is that each value maps to a difference character, rather than being considered a number. We can look at the bits of characters, too:

julia> 'A'
'A'

julia> bits('A')
"00000000000000000000000001000001"

julia> bits('B')
"00000000000000000000000001000010"

julia> bits(' ')
"00000000000000000000000000100000"

Well, that looks strange. Those strings are 32 characters long (I checked them with length), but characters are represented by 1 byte each. There is another function sizeof which will tell use the size in bytes of a type or the number of bytes in a string. Let’s see what it says:

julia> typeof('A') # what is the type of A?
Char

julia> sizeof(Char) # how many bytes are in a Char?
4

julia> sizeof("CDF") # how many bytes are in this string?
3

I’m not sure what’s going on here, but clearly the string realizes that it only takes up 1 byte per character. Maybe it has something to do with Julia’s unicode support?

julia> typeof('')
Char

julia> sizeof("")
3

julia> sizeof("CDF")
6

Pulling Out the Data

See Our Work Pay Off

#
# NetCDF is a binary file format for representing numeric data.
#
# It was a predecessor of HD5F. It is mostly used for geoscience data.
#
# netCDF using big-endian numbers; we need to us `ntoh`
# to make sure they're in the correct byte-order for our computer
# Read the first four bytes of the file
# to determine the CDF version
function readversion(f::IO)
# Every CDF file starts with these characters
@assert read(f,Char) == 'C'
@assert read(f,Char) == 'D'
@assert read(f,Char) == 'F'
# The version is one byte
version = read(f,Uint8) # 0x01 == 32-bit offset
# 0x02 == 64-bit offset
# There are only two possible versions
@assert version == 0x01 || version == 0x02
return version
end
# Represents a variable from the file
type Variable
name::String
dimensions
attributes
typ::Type
bytes::Int
offset::Int
end
function Base.show(io::IO, v::Variable)
println(io, v.name)
print(io, "\t")
print(io, join([d[2] for d in v.dimensions],"x"))
println(io, " ", length(v.dimensions) == 1 ? " element vector": " matrix")
println(io, "\t", v.typ, " type")
println(io, "\tAttributes: ", v.attributes)
println(io, "\tstarts at ", v.offset, " and continues for ", v.bytes, " bytes.")
end
# Each section of the headers has a tag
const NC_VARIABLE = [0x00, 0x00, 0x00, 0x0B]
const NC_DIMENSION = [0x00,0x00,0x00,0x0A]
const NC_ATTRIBUTE = [0x00, 0x00, 0x00, 0x0C]
# Each data type has a tag
# A type tage is four bytes, but only the last one is different
const NC_BYTE = 0x01
const NC_CHAR = 0x02
const NC_SHORT = 0x03
const NC_INT = 0x04
const NC_FLOAT = 0x05
const NC_DOUBLE = 0x06
# Read a type tag from the file
# This involves reading 4 bytes
# The first three should be zero
# The last one will be one of the type tag constants above
function readtype(f::IO)
bytes = [read(f,Uint8) for x=1:4]
for i = 1:3
@assert bytes[i] == 0x00
end
tag = bytes[4]
if tag == NC_BYTE
return Uint8
elseif tag == NC_CHAR
return Char
elseif tag == NC_SHORT
return Int16
elseif tag == NC_INT
return Int32
elseif tag == NC_FLOAT
return Float32
elseif tag == NC_DOUBLE
return Float64
end
error("Unknown type $bytes")
end
# Read n values of type t
#
# This involves reading n * sizeof(t) bytes, plus padding.
# If the number of bytes of data read is not a multiple of four,
# there will be enough zeros of padding to make the whole thing a multiple of four.
# The result will be a Vector{t}.
function readvalues{T <: Real}(f::IO,t::Type{T},n::Number)
results = [read(f,T) for i=1:n]
datasize = sizeof(results[1])
if datasize % 4 != 0
remainder = (4 - ((datasize * n) % 4)) % 4
for i=1:remainder
@assert read(f,Uint8) == 0x00
end
end
results = [ntoh(v) for v in results]
return results
end
# Read n Chars from f, plus any padding.
# If n is not a multiple of 4, there will be 1-3 bytes of padding.
# Because a Vector{Char} is usually meant to be a string, we return a ByteString.
function readvalues(f::IO,t::Type{Char},n::Number)
results = [read(f,Uint8) for i=1:n]
remainder = (4 - (n % 4)) % 4
padding = [@assert read(f,Uint8) == 0x00 for i=1:remainder]
return bytestring(results)
end
# Read records header.
# Currently this only reads the number of records, and hopes it's zero.
# TODO: handle records.
function readrecords(f::IO)
num_records = ntoh(read(f,Int32))
if num_records > 0
error("Don't know how to handle records. There are $num_records records.")
end
return num_records
end
# Read a name. Names of variables and attributes have the same format.
function readname(f::IO)
name_length = ntoh(read(f,Int32))
name = readvalues(f,Char,name_length)
return name
end
# Read header tag (4 bytes) or absent (8 bytes)
function readtag(f, tag::Vector{Uint8})
const empty = [0x00 for i=1:4]
read_tag = [read(f,Uint8) for i=1:4]
if tag == read_tag
return true
else
((read_tag == empty && [read(f,Uint8) for i=1:4] == empty) || error("Wrong header tag. Expected zeros"))
return false
end
end
# Read the list of dimensions.
# The dimensions header has a 4-byte tag to indicate that it is present.
# If the tag is all zeros, then it will be followed by 4 more zeros.
# This means that the dimension header is intentionally absent.
# If the tag is NC_DIMENSION, then there is a dimension header to read.
# The header tells us how many dimensions there are, then lists them.
# Each dimension is a name and a length.
function readdimensions(f::IO)
dimensions = (String,Int)[]
if readtag(f, NC_DIMENSION)
num_dims = ntoh(read(f,Int32))
for i=1:num_dims
name = readname(f)
dim_size = ntoh(read(f,Int32))
push!(dimensions,(name,dim_size))
end
end
return dimensions
end
# Read list of attributes
function readattributes(f::IO)
attributes = {}
if readtag(f, NC_ATTRIBUTE)
num_attrs = ntoh(read(f,Int32))
for i=1:num_attrs
name = readname(f)
attr_type = readtype(f)
num_elems = ntoh(read(f,Int32))
values = readvalues(f,attr_type,num_elems)
push!(attributes, (name, values))
end
end
return attributes
end
# Read list of Variables
function readvariables(f::IO, version, dimensions)
variables = Variable[]
if readtag(f, NC_VARIABLE)
num_vars = ntoh(read(f,Int32))
for i=1:num_vars
name = readname(f)
# Read dimensionality & dimensions
num_dims = ntoh(read(f,Int32))
i_dims = [ntoh(read(f,Int32)) for i=1:num_dims]
dims = [dimensions[i + 1] for i in i_dims]
attrs = readattributes(f)
var_type = readtype(f)
size = ntoh(read(f,Int32))
offset = version == 0x01 ? ntoh(read(f,Int32)) : ntoh(read(f,Int64))
push!(variables,Variable(name,dims,attrs,var_type,size,offset))
end
end
return variables
end
# Read data
function readdata(f, v::Variable)
seek(f,v.offset)
dims = [x[2] for x in v.dimensions]
values = readvalues(f,v.typ,prod(dims))
return (reshape(values,reverse(dims)...))'
end
type Headers
version::Uint8
records
dimensions::Vector{(String,Int32)}
attributes
variables::Vector{Variable}
end
function readheaders(f)
version = readversion(f) # 1 or 2
records = readrecords(f)
dimensions = readdimensions(f)
attributes = readattributes(f)
variables = readvariables(f, version, dimensions)
return Headers(version, records, dimensions, attributes, variables)
end
##
# Equivalent to "Main" function in non-scripting languages
##
# Get filename as commandline argument
filename = ARGS[1]
f = open(filename, "r")
headers = readheaders(f)
println("Dimensions")
for (name,len) in headers.dimensions
println("\t", name, " ", len)
end
println("Variables")
for var in headers.variables
display(var)
display(readdata(f,var))
println("\n")
end
println()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment