astrieanna/draft.asciidoc

## draft.asciidoc

      
    Raw
  

              draft.asciidoc
            
          
Writing a Parser


Our goal for this section is to parse a NetCDF file. NetCDF is a binary format, which is most often used for climatology/geoscience data. Because it is a custom binary format, we’re going to write a custom parser for it by hand.


Tip


If you have a format based on JSON or XML to parse, then you should start with JSON.jl or julia-xml-library, rather than writing it by hand.


Tip


NetCDF was designed to store large arrays of data efficiently. If that’s a task you need to do, take a look at HDF5.jl. HDF5 is a file format that is more recent than NetCDF and that has similar goals; it is the successor to NetCDF. HDF5.jl can read and write files produced by other languages and can read and write .jld files (which can express Julia values, including their types).


Every NetCDF file has headers that decribe what data is has, and then the data. In order to do anything with the file, we need to first read the headers. The headers tell us about variables stored in the file; they’ll tell us names and metadata for each variable, plus where the value is located and how big it is. Once we’ve understood the headers, we can find and read the value of any/all variables stored in the file.


Parsing NetCDF Headers


The NetCDF file headers tell you what variables the file holds, their size and dimentions, and where in the file they are stored.
There are four parts to the header. Each one answers a different question:


Which version of NetCDF are we using? (either 1 or 2)


What are the dimensions in this file? (name, ordered list of dimensions)


What are the attributes of this file? (metadata about source of the data, I think)


What are the variables in this file? (name, attributes, size, location in file)


The first one in that list is different (and simpler) than the others, so first we’re going to parse the version header, and then we’re going to talk about the rest of them.


Parsing the Version Header


The version header consists of the characters C, D, F, and then a byte containing version value (either 0x01 or 0x02).
This does not mean CDF1, but CDF and then an unprintable character whose bits are 00000001.
Let’s write some code to read this:


julia> file = open("simple_xy.nc","r") # a simple NetCDF file
IOStream()

julia> read(file,Char) # C
'C'

julia> read(file,Char) # D
'D'

julia> read(file,Char) # F
'F'

julia> version = read(file,Uint8) # 1
0x01


The read function takes a stream and then a type that we want to read from it. Streams will always, at the bottom, be represented as binary, because that’s what computers represent all data as. We have to tell read how many bits to read and how to interpret them. Each 1 or 0 is a bit; a byte is 8 bits. You can interpret these in various ways.


An unsigned int, like Uint8, corresponds to the standard way humans are taught to interpret base-2 numbers. If you want to convert from base-2 into decimal (base-10), then you can sum up the value of each column. A zero in any column is worth zero. A one in the right-most column is worth 1; a one in the column left of that is worth 2. As you move left, the value of a column doubles each time. Thus, 8 is 1000, 12 is 1100, and 15 is 1111. The 8 in Uint8 indicates that it has 8 bits in it, which is 1 byte. Let’s look at the binary representation of that version number we just read.


julia> bits(version)
"00000001"


The bits function shows us the binary representation of a value, as a string of zeroes and ones. There are eight characters in the string because version is a Uint8. What if we do some math with version?


julia> bits(version + 1)
"0000000000000000000000000000000000000000000000000000000000000010"

julia> bits(version * 2)
"0000000000000000000000000000000000000000000000000000000000000010"


That’s a lot more zeroes! This is happening because the + and * functions are converting to my computer’s native Int type, Int64, before doing the computation. This type conversion when doing arithemetic can be annoying; we can get back to a Uint8 by using the function uint8.


julia> bits(uint8(version * 2))
"00000010"

julia> bits(uint8(version * 5))
"00000101"


A Char is represented the same way as a Uint8: 8 bits, which is 1 byte. The difference is that each value maps to a difference character, rather than being considered a number. We can look at the bits of characters, too:


julia> 'A'
'A'

julia> bits('A')
"00000000000000000000000001000001"

julia> bits('B')
"00000000000000000000000001000010"

julia> bits(' ')
"00000000000000000000000000100000"


Well, that looks strange. Those strings are 32 characters long (I checked them with length), but characters are represented by 1 byte each. There is another function sizeof which will tell use the size in bytes of a type or the number of bytes in a string. Let’s see what it says:


julia> typeof('A') # what is the type of A?
Char

julia> sizeof(Char) # how many bytes are in a Char?
4

julia> sizeof("CDF") # how many bytes are in this string?
3


I’m not sure what’s going on here, but clearly the string realizes that it only takes up 1 byte per character. Maybe it has something to do with Julia’s unicode support?


julia> typeof('')
Char

julia> sizeof("")
3

julia> sizeof("CDF")
6


Pulling Out the Data


See Our Work Pay Off


## netcdf.jl
#
# NetCDF is a binary file format for representing numeric data.
#
# It was a predecessor of HD5F. It is mostly used for geoscience data.
#
# netCDF using big-endian numbers; we need to us `ntoh`
# to make sure they're in the correct byte-order for our computer

# Read the first four bytes of the file
# to determine the CDF version
function readversion(f::IO)
  # Every CDF file starts with these characters
  @assert read(f,Char) == 'C'
  @assert read(f,Char) == 'D'
  @assert read(f,Char) == 'F'

  # The version is one byte
  version = read(f,Uint8) # 0x01 == 32-bit offset
                          # 0x02 == 64-bit offset

  # There are only two possible versions
  @assert version == 0x01 || version == 0x02

  return version
end

# Represents a variable from the file
type Variable
  name::String
  dimensions
  attributes
  typ::Type
  bytes::Int
  offset::Int
end

function Base.show(io::IO, v::Variable)
  println(io, v.name)
  print(io, "\t")
  print(io, join([d[2] for d in v.dimensions],"x"))
  println(io, " ", length(v.dimensions) == 1 ? " element vector": " matrix")
  println(io, "\t", v.typ, " type")
  println(io, "\tAttributes: ", v.attributes)
  println(io, "\tstarts at ", v.offset, " and continues for ", v.bytes, " bytes.")
end

# Each section of the headers has a tag
const NC_VARIABLE = [0x00, 0x00, 0x00, 0x0B]
const NC_DIMENSION = [0x00,0x00,0x00,0x0A]
const NC_ATTRIBUTE = [0x00, 0x00, 0x00, 0x0C]

# Each data type has a tag
# A type tage is four bytes, but only the last one is different
const NC_BYTE = 0x01
const NC_CHAR = 0x02
const NC_SHORT = 0x03
const NC_INT = 0x04
const NC_FLOAT = 0x05
const NC_DOUBLE = 0x06

# Read a type tag from the file
# This involves reading 4 bytes
# The first three should be zero
# The last one will be one of the type tag constants above
function readtype(f::IO)
  bytes = [read(f,Uint8) for x=1:4]
  for i = 1:3
    @assert bytes[i] == 0x00
  end

  tag = bytes[4]
  if tag == NC_BYTE
    return Uint8
  elseif tag == NC_CHAR
    return Char
  elseif tag == NC_SHORT
    return Int16
  elseif tag == NC_INT
    return Int32
  elseif tag == NC_FLOAT
    return Float32
  elseif tag == NC_DOUBLE
    return Float64
  end

  error("Unknown type $bytes")
end

# Read n values of type t
#
# This involves reading n * sizeof(t) bytes, plus padding.
# If the number of bytes of data read is not a multiple of four,
# there will be enough zeros of padding to make the whole thing a multiple of four.
# The result will be a Vector{t}.
function readvalues{T <: Real}(f::IO,t::Type{T},n::Number)
  results = [read(f,T) for i=1:n]
  datasize = sizeof(results[1])
  if datasize % 4 != 0
    remainder = (4 - ((datasize * n) % 4)) % 4
    for i=1:remainder
      @assert read(f,Uint8) == 0x00
    end
  end
  results = [ntoh(v) for v in results]
  return results
end

# Read n Chars from f, plus any padding.
# If n is not a multiple of 4, there will be 1-3 bytes of padding.
# Because a Vector{Char} is usually meant to be a string, we return a ByteString.
function readvalues(f::IO,t::Type{Char},n::Number)
  results = [read(f,Uint8) for i=1:n]
  remainder = (4 - (n % 4)) % 4
  padding = [@assert read(f,Uint8) == 0x00 for i=1:remainder]
  return bytestring(results)
end

# Read records header.
# Currently this only reads the number of records, and hopes it's zero.
# TODO: handle records.
function readrecords(f::IO)
  num_records = ntoh(read(f,Int32))
  if num_records > 0
    error("Don't know how to handle records. There are $num_records records.")
  end
  return num_records
end

# Read a name. Names of variables and attributes have the same format.
function readname(f::IO)
  name_length = ntoh(read(f,Int32))
  name = readvalues(f,Char,name_length)
  return name
end

# Read header tag (4 bytes) or absent (8 bytes)
function readtag(f, tag::Vector{Uint8})
  const empty = [0x00 for i=1:4]
  read_tag = [read(f,Uint8) for i=1:4]
  if tag == read_tag
    return true
  else
    ((read_tag == empty && [read(f,Uint8) for i=1:4] == empty) || error("Wrong header tag. Expected zeros"))
    return false
  end
end

# Read the list of dimensions.
# The dimensions header has a 4-byte tag to indicate that it is present.
# If the tag is all zeros, then it will be followed by 4 more zeros.
# This means that the dimension header is intentionally absent.
# If the tag is NC_DIMENSION, then there is a dimension header to read.
# The header tells us how many dimensions there are, then lists them.
# Each dimension is a name and a length.
function readdimensions(f::IO)
  dimensions = (String,Int)[]
  if readtag(f, NC_DIMENSION)
    num_dims = ntoh(read(f,Int32))
    for i=1:num_dims
      name = readname(f)
      dim_size = ntoh(read(f,Int32))
      push!(dimensions,(name,dim_size))
    end
  end
  return dimensions
end

# Read list of attributes
function readattributes(f::IO)
  attributes = {}
  if readtag(f, NC_ATTRIBUTE)
    num_attrs = ntoh(read(f,Int32))
    for i=1:num_attrs
      name = readname(f)
      attr_type = readtype(f)
      num_elems = ntoh(read(f,Int32))
      values = readvalues(f,attr_type,num_elems)
      push!(attributes, (name, values))
    end
  end
  return attributes
end

# Read list of Variables
function readvariables(f::IO, version,  dimensions)
  variables = Variable[]
  if readtag(f, NC_VARIABLE)
    num_vars = ntoh(read(f,Int32))
    for i=1:num_vars
      name = readname(f)

      # Read dimensionality & dimensions
      num_dims = ntoh(read(f,Int32))
      i_dims = [ntoh(read(f,Int32)) for i=1:num_dims]
      dims = [dimensions[i + 1] for i in i_dims]

      attrs = readattributes(f)
      var_type = readtype(f)
      size = ntoh(read(f,Int32))
      offset = version == 0x01 ? ntoh(read(f,Int32)) : ntoh(read(f,Int64))

      push!(variables,Variable(name,dims,attrs,var_type,size,offset))
    end
  end

  return variables
end

# Read data
function readdata(f, v::Variable)
  seek(f,v.offset)
  dims = [x[2] for x in v.dimensions]
  values = readvalues(f,v.typ,prod(dims))
  return (reshape(values,reverse(dims)...))'
end

type Headers
  version::Uint8
  records
  dimensions::Vector{(String,Int32)}
  attributes
  variables::Vector{Variable}
end

function readheaders(f)
  version = readversion(f) # 1 or 2
  records = readrecords(f)
  dimensions = readdimensions(f)
  attributes = readattributes(f)
  variables = readvariables(f, version, dimensions)
  return Headers(version, records, dimensions, attributes, variables)
end

##
# Equivalent to "Main" function in non-scripting languages
##

# Get filename as commandline argument
filename = ARGS[1]
f = open(filename, "r")
headers = readheaders(f)

println("Dimensions")
for (name,len) in headers.dimensions
  println("\t", name, " ", len)
end

println("Variables")
for var in headers.variables
  display(var)
  display(readdata(f,var))
  println("\n")
end
println()
	#
	# NetCDF is a binary file format for representing numeric data.
	#
	# It was a predecessor of HD5F. It is mostly used for geoscience data.
	#
	# netCDF using big-endian numbers; we need to us `ntoh`
	# to make sure they're in the correct byte-order for our computer

	# Read the first four bytes of the file
	# to determine the CDF version
	function readversion(f::IO)
	# Every CDF file starts with these characters
	@assert read(f,Char) == 'C'
	@assert read(f,Char) == 'D'
	@assert read(f,Char) == 'F'

	# The version is one byte
	version = read(f,Uint8) # 0x01 == 32-bit offset
	# 0x02 == 64-bit offset

	# There are only two possible versions
	@assert version == 0x01 \|\| version == 0x02

	return version
	end

	# Represents a variable from the file
	type Variable
	name::String
	dimensions
	attributes
	typ::Type
	bytes::Int
	offset::Int
	end

	function Base.show(io::IO, v::Variable)
	println(io, v.name)
	print(io, "\t")
	print(io, join([d[2] for d in v.dimensions],"x"))
	println(io, " ", length(v.dimensions) == 1 ? " element vector": " matrix")
	println(io, "\t", v.typ, " type")
	println(io, "\tAttributes: ", v.attributes)
	println(io, "\tstarts at ", v.offset, " and continues for ", v.bytes, " bytes.")
	end

	# Each section of the headers has a tag
	const NC_VARIABLE = [0x00, 0x00, 0x00, 0x0B]
	const NC_DIMENSION = [0x00,0x00,0x00,0x0A]
	const NC_ATTRIBUTE = [0x00, 0x00, 0x00, 0x0C]

	# Each data type has a tag
	# A type tage is four bytes, but only the last one is different
	const NC_BYTE = 0x01
	const NC_CHAR = 0x02
	const NC_SHORT = 0x03
	const NC_INT = 0x04
	const NC_FLOAT = 0x05
	const NC_DOUBLE = 0x06

	# Read a type tag from the file
	# This involves reading 4 bytes
	# The first three should be zero
	# The last one will be one of the type tag constants above
	function readtype(f::IO)
	bytes = [read(f,Uint8) for x=1:4]
	for i = 1:3
	@assert bytes[i] == 0x00
	end

	tag = bytes[4]
	if tag == NC_BYTE
	return Uint8
	elseif tag == NC_CHAR
	return Char
	elseif tag == NC_SHORT
	return Int16
	elseif tag == NC_INT
	return Int32
	elseif tag == NC_FLOAT
	return Float32
	elseif tag == NC_DOUBLE
	return Float64
	end

	error("Unknown type $bytes")
	end

	# Read n values of type t
	#
	# This involves reading n * sizeof(t) bytes, plus padding.
	# If the number of bytes of data read is not a multiple of four,
	# there will be enough zeros of padding to make the whole thing a multiple of four.
	# The result will be a Vector{t}.
	function readvalues{T <: Real}(f::IO,t::Type{T},n::Number)
	results = [read(f,T) for i=1:n]
	datasize = sizeof(results[1])
	if datasize % 4 != 0
	remainder = (4 - ((datasize * n) % 4)) % 4
	for i=1:remainder
	@assert read(f,Uint8) == 0x00
	end
	end
	results = [ntoh(v) for v in results]
	return results
	end

	# Read n Chars from f, plus any padding.
	# If n is not a multiple of 4, there will be 1-3 bytes of padding.
	# Because a Vector{Char} is usually meant to be a string, we return a ByteString.
	function readvalues(f::IO,t::Type{Char},n::Number)
	results = [read(f,Uint8) for i=1:n]
	remainder = (4 - (n % 4)) % 4
	padding = [@assert read(f,Uint8) == 0x00 for i=1:remainder]
	return bytestring(results)
	end

	# Read records header.
	# Currently this only reads the number of records, and hopes it's zero.
	# TODO: handle records.
	function readrecords(f::IO)
	num_records = ntoh(read(f,Int32))
	if num_records > 0
	error("Don't know how to handle records. There are $num_records records.")
	end
	return num_records
	end

	# Read a name. Names of variables and attributes have the same format.
	function readname(f::IO)
	name_length = ntoh(read(f,Int32))
	name = readvalues(f,Char,name_length)
	return name
	end

	# Read header tag (4 bytes) or absent (8 bytes)
	function readtag(f, tag::Vector{Uint8})
	const empty = [0x00 for i=1:4]
	read_tag = [read(f,Uint8) for i=1:4]
	if tag == read_tag
	return true
	else
	((read_tag == empty && [read(f,Uint8) for i=1:4] == empty) \|\| error("Wrong header tag. Expected zeros"))
	return false
	end
	end

	# Read the list of dimensions.
	# The dimensions header has a 4-byte tag to indicate that it is present.
	# If the tag is all zeros, then it will be followed by 4 more zeros.
	# This means that the dimension header is intentionally absent.
	# If the tag is NC_DIMENSION, then there is a dimension header to read.
	# The header tells us how many dimensions there are, then lists them.
	# Each dimension is a name and a length.
	function readdimensions(f::IO)
	dimensions = (String,Int)[]
	if readtag(f, NC_DIMENSION)
	num_dims = ntoh(read(f,Int32))
	for i=1:num_dims
	name = readname(f)
	dim_size = ntoh(read(f,Int32))
	push!(dimensions,(name,dim_size))
	end
	end
	return dimensions
	end

	# Read list of attributes
	function readattributes(f::IO)
	attributes = {}
	if readtag(f, NC_ATTRIBUTE)
	num_attrs = ntoh(read(f,Int32))
	for i=1:num_attrs
	name = readname(f)
	attr_type = readtype(f)
	num_elems = ntoh(read(f,Int32))
	values = readvalues(f,attr_type,num_elems)
	push!(attributes, (name, values))
	end
	end
	return attributes
	end

	# Read list of Variables
	function readvariables(f::IO, version, dimensions)
	variables = Variable[]
	if readtag(f, NC_VARIABLE)
	num_vars = ntoh(read(f,Int32))
	for i=1:num_vars
	name = readname(f)

	# Read dimensionality & dimensions
	num_dims = ntoh(read(f,Int32))
	i_dims = [ntoh(read(f,Int32)) for i=1:num_dims]
	dims = [dimensions[i + 1] for i in i_dims]

	attrs = readattributes(f)
	var_type = readtype(f)
	size = ntoh(read(f,Int32))
	offset = version == 0x01 ? ntoh(read(f,Int32)) : ntoh(read(f,Int64))

	push!(variables,Variable(name,dims,attrs,var_type,size,offset))
	end
	end

	return variables
	end

	# Read data
	function readdata(f, v::Variable)
	seek(f,v.offset)
	dims = [x[2] for x in v.dimensions]
	values = readvalues(f,v.typ,prod(dims))
	return (reshape(values,reverse(dims)...))'
	end

	type Headers
	version::Uint8
	records
	dimensions::Vector{(String,Int32)}
	attributes
	variables::Vector{Variable}
	end

	function readheaders(f)
	version = readversion(f) # 1 or 2
	records = readrecords(f)
	dimensions = readdimensions(f)
	attributes = readattributes(f)
	variables = readvariables(f, version, dimensions)
	return Headers(version, records, dimensions, attributes, variables)
	end

	##
	# Equivalent to "Main" function in non-scripting languages
	##

	# Get filename as commandline argument
	filename = ARGS[1]
	f = open(filename, "r")
	headers = readheaders(f)

	println("Dimensions")
	for (name,len) in headers.dimensions
	println("\t", name, " ", len)
	end

	println("Variables")
	for var in headers.variables
	display(var)
	display(readdata(f,var))
	println("\n")
	end
	println()