mbauman/MatlabClasses.ipynb

## MatlabClasses.ipynb
{
 "metadata": {
  "language": "Julia",
  "name": ""
 },
 "nbformat": 3,
 "nbformat_minor": 0,
 "worksheets": [
  {
   "cells": [
    {
     "cell_type": "heading",
     "level": 1,
     "metadata": {},
     "source": [
      "Parsing MAT files with class objects in them"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "February 20, 2014. Matt Bauman `[first initial][last name] (at) [gmail]`. MIT license."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Matlab saves class objects in `.mat` files in a crazy undocumented scheme.  As far as I can tell, few attempts have been made to understand how this information is stored.  Matlab itself once allowed loading of unknown class objects as structures, but has recently decided to forbid that behavior (due to class loading rules) and so now everybody in all languages are subjected to the strange representation of these objects:\n",
      "\n",
      "    % In Matlab:\n",
      "    >> load('simple.mat','obj')\n",
      "    Warning: Variable 'obj' originally saved as a SimpleClass cannot be instantiated\n",
      "    as an object and will be read in as a uint32. \n",
      "    obj =\n",
      "      [3707764736; 2; 1; 1; 1; 1]\n",
      "\n",
      "    # Or Python (SciPy):\n",
      "    MatlabOpaque([ (b'obj', b'MCOS', b'SimpleClass', [[3707764736], [2], [1], [1], [1], [1]])], \n",
      "      dtype=[('s0', 'O'), ('s1', 'O'), ('s2', 'O'), ('arr', 'O')])\n",
      "\n",
      "Wholly unhelpful.  And very strange.  An array of six unsigned integers?  Where's the data?  And what do those numbers mean?\n",
      "\n",
      "This chronicles my attempts at parsing this information out of Matfiles.  It's a work in progress as I blindly reverse engineer this, and may eventually become a part of Julia's MAT.jl package.  If you find any matfiles that don't conform to these expectations, email them to me! Or better yet: don't store your data in any undocumented format!"
     ]
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "The Matfile subsystem for version 5.0"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "As [some](https://github.com/scipy/scipy/blob/master/scipy/io/matlab/mio5.py#L30-L59) [folks](https://mailman.cae.wisc.edu/pipermail/octave-maintainers/2007-May/006599.html) have already discovered, when this happens there's a hidden, unnamed matrix filled with unsigned int8s stored beyond the bounds of the documented Matfile. SciPy calls it the `__function_workspace__`, but it's much more general than that -- it's where all the data is for class objects (it just so happens that function workspaces are one such thing stored there).  Matlab makes one small mention of this; it's called a subsystem and the only thing official that they tell us is that it lives at the end of some matfiles at a specified offset.\n",
      "\n",
      "The bytes in that subsystem matrix? They conform almost exactly (with some very minor caveats) to a matfile themselves and can be parsed quite easily into a bunch of objects.  Here's what a simple one looks like in Julia:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "using MAT, MAT_v5 # with some special sauce from PR MAT.jl#23 (https://github.com/simonster/MAT.jl/pull/23)\n",
      "f = matopen(\"simple.mat\")\n",
      "summarize(f.subsystem) # summarize and xxd are defined in the appendix at the bottom"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Dict{ASCIIString,Any}: \n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "  \"_i1\"=>Dict{ASCIIString,Any}: \n",
        "      \"MCOS\"=>(\"FileWrapper__\",6x1 Array{Any,2}: "
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "\n",
        "            [1] 352x1 Array{Uint8,2}: ["
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "0x02,0x00,0x00,0x00,0x06,0x00,0x00,0x00,0x70,0x00,\u2026]\n",
        "            [2] 0x0 Array{None,2}: "
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "[]\n",
        "            [3] 1x4 Array{Any,2}: \n",
        "                 [1] Float64: 1.0\n",
        "                 [2] ASCIIString: \"one\"\n",
        "                 [3] Float64: 2.0\n",
        "                 [4] ASCIIString: \"two\"\n",
        "            [4] ASCIIString: \"another_char\"\n",
        "            [5] 1x4 Array{Any,2}: \n",
        "                 [1] Float64: 3.0\n",
        "                 [2] ASCIIString: \"three\"\n",
        "                 [3] Float64: 4.0\n",
        "                 [4] ASCIIString: \"four\"\n",
        "            [6] 3x1 Array{Any,2}: \n",
        "                 [1] Dict{ASCIIString,Any}: {}\n",
        "                 [2] Dict{ASCIIString,Any}: \n",
        "                      \"array_field_2\"=>1x6 Array{Float64,2}: "
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "[0.0,1.0,2.0,3.0,4.0,5.0]\n",
        "                      \"char_field_1\"=>ASCIIString: \"char_field\"\n",
        "                 [3] Dict{ASCIIString,Any}: \n",
        "                      \"a\"=>Float64: 1.0)\n",
        "  \"_i2\"=>Dict{ASCIIString,Any}: \n",
        "      \"_i1\"=>Dict{ASCIIString,Any}: \n",
        "          \"MCOS\"=>0x0 Array{None,2}: []"
       ]
      }
     ],
     "prompt_number": 1
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "The `_i1` and `_i2` keys have no meaning; it's simply how I've chosen to name elements that don't contain a name (as there may be more than one such array and they would clobber the previous dictionary entry). So this subsystem has one variable, named \"MCOS\", and it has a bunch of different elements.  We can clearly see, though, this is where our data lives! But man is it jumbled. The first element is interesting, though: 352 raw bytes.  And that very last element, too... apparently this subsystem has an unnamed subsystem-like matrix at the end of it, too (although there's no offset given in the header information; it just acts like one).  But it's empty, so who knows what it'd be used for. I've never seen it populated.\n",
      "\n",
      "So, now we've got to figure out how to connect our 6 element array of uint32s with the real data.  What's that `obj` variable look like again?"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "summarize(read(f))"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Dict{ASCIIString,Any}: "
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "\n",
        "  \"obj3\"=>(\"AnotherClass\",6-element Array{Uint32,1}: ["
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "0xdd000000,0x00000002,0x00000001,0x00000001,0x00000003,0x00000002])\n",
        "  \"obj\"=>(\"SimpleClass\",6-element Array{Uint32,1}: [0xdd000000,0x00000002,0x00000001,0x00000001,0x00000001,0x00000001])\n",
        "  \"obj2\"=>(\"SimpleClass\",6-element Array{Uint32,1}: [0xdd000000,0x00000002,0x00000001,0x00000001,0x00000002,0x00000001])"
       ]
      }
     ],
     "prompt_number": 2
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "As far as I've seen, the first four elements are always `[0xdd000000, 2, 1, 1]`. They may be reserved for future features or perhaps they're features that I don't use or haven't happened to trigger yet. It might be possible that there'd be more than one `FileWrapper__` object \u2014 perhaps one of those elements is its index. If you ever see something different, here or anywhere else, send me those mat files!\n",
      "\n",
      "The last two elements are, respectively, the `object_id` and `class_id`.  Pretty simple.  But the information on how to connect that to the data in our subsystem is hidden away in that opaque byte array."
     ]
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "The `FileWrapper__` byte array"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "The first element of the `FileWrapper__` is a large byte array, with a very unique format.  It's generally a whole bunch of `Int32`s, and easiest to read if we treat it as an IO-like stream.  I'd guess that the first Int32 is some sort of version number. Here's what the beginning of the data look like:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "mcos = f.subsystem[\"_i1\"][\"MCOS\"][2]\n",
      "data = vec(mcos[1])\n",
      "fdata = IOBuffer(data)\n",
      "\n",
      "xxd(data,1,0x80)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "00"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "00: 02000000 06000000    ........            2            6 \n",
        "0008: 70000000 a0000000    p.......          112          160 \n",
        "0010: a0000000 00010000    ........          160          256 \n",
        "0018: 40010000 60010000    @...`...          320          352 \n",
        "0020: 00000000 00000000    ........            0            0 \n",
        "0028: 63656c6c 5f666965    cell_fie   1819043171   1701406303 \n",
        "0030: 6c645f33 0053696d    ld_3.Sim    861889644   1835619072 \n",
        "0038: 706c6543 6c617373    pleClass   1130720368   1936941420 \n",
        "0040: 00636861 725f6669    .char_fi   1634231040   1768316786 \n",
        "0048: 656c645f 31006172    eld_1.ar   1600416869   1918959665 \n",
        "0050: 7261795f 6669656c    ray_fiel   1601790322   1818585446 \n",
        "0058: 645f3200 416e6f74    d_2.Anot      3301220   1953459777 \n",
        "0060: 68657243 6c617373    herClass   1131570536   1936941420 \n",
        "0068: 00610000 00000000    .a......        24832            0 \n",
        "0070: 00000000 00000000    ........            0            0 \n",
        "0078: 00000000 00000000    ........            0            0 \n"
       ]
      }
     ],
     "prompt_number": 3
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "A bunch of little-endian integers, followed by some ASCII data.  That first Int32 is always a 2, probably a version or id number, and the second is the number of strings that you see starting at `0x28` (bizarre!). The next six Int32s are segment offsets into this data block - you can see that the first segment offset is the first multiple of 8 bytes after the ASCII strings stop. There are two (perhaps reserved) Int32 zeros and then the strings start at 0x28."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "function parse_header(f)\n",
      "    id = read(f,Uint32) # First element is a version number? Always 2?\n",
      "    id == 2 || error(\"unknown first field (version/id?): \", id)\n",
      "    \n",
      "    # Second element is the number of strings\n",
      "    n_strs = read(f,Uint32)\n",
      "    \n",
      "    # Followed by up to 6 section offsets (the last two sections seem to be unused)\n",
      "    offsets = read(f,Uint32,6)\n",
      "    \n",
      "    # And two reserved fields\n",
      "    all(read(f,Int32,2) .== 0) || error(\"reserved header fields nonzero\")\n",
      "    \n",
      "    # And now we're at the string data section\n",
      "    @assert position(f) == 0x28\n",
      "    strs = Array(ASCIIString,n_strs)\n",
      "    for i = 1:n_strs\n",
      "        # simply delimited by nulls\n",
      "        strs[i] = readuntil(f, '\\0')[1:end-1] # drop the trailing null byte\n",
      "    end\n",
      "    \n",
      "    (offsets,strs)\n",
      "end\n",
      "\n",
      "seek(fdata,0)\n",
      "segments, strs = parse_header(fdata)\n",
      "summarize(strs)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "6-element Array{ASCIIString,1}: \n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "  [1] ASCIIString: \"cell_field_3\"\n",
        "  [2] ASCIIString: \"SimpleClass\"\n",
        "  [3] ASCIIString: \"char_field_1\"\n",
        "  [4] ASCIIString: \"array_field_2\"\n",
        "  [5] ASCIIString: \"AnotherClass\"\n",
        "  [6] ASCIIString: \"a\""
       ]
      }
     ],
     "prompt_number": 4
    },
    {
     "cell_type": "heading",
     "level": 3,
     "metadata": {},
     "source": [
      "Segment 1: Class information"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "The first demarcated segment seems to describe the class information. I've not managed to save fancy enough classes that expose all of these fields, but it at least enumerates the classes, their names, and their package names (using the indexes into that heap of strings)."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "function parse_class_info(f,strs,section_end)\n",
      "    # The first four int32s unknown. Always 0? Or is this simply an empty slot for another class?\n",
      "    all(read(f,Int32,4) .== 0) || error(\"unknown header to class information\")\n",
      "    \n",
      "    classes = Array((ASCIIString,ASCIIString),0)\n",
      "    while position(f) < section_end\n",
      "        package_idx = read(f,Uint32)\n",
      "        package = package_idx > 0 ? strs[package_idx] : \"\"\n",
      "        name_idx = read(f,Uint32)\n",
      "        name = name_idx > 0 ? strs[name_idx] : \"\"\n",
      "        unknowns = read(f,Uint32,2)\n",
      "        all(unknowns .== 0) || error(\"discovered a nonzero class property for \",name)\n",
      "        push!(classes,(package, name))\n",
      "    end\n",
      "    classes\n",
      "end\n",
      "\n",
      "seek(fdata,segments[1])\n",
      "classes = parse_class_info(fdata,strs, segments[2])"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 5,
       "text": [
        "2-element Array{(ASCIIString,ASCIIString),1}:\n",
        " (\"\",\"SimpleClass\") \n",
        " (\"\",\"AnotherClass\")"
       ]
      }
     ],
     "prompt_number": 5
    },
    {
     "cell_type": "heading",
     "level": 3,
     "metadata": {},
     "source": [
      "Segment 2: Object properties that contain other objects"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "The second segment is only sometimes there (e.g. `offsets[2] == offsets[3]`). When it is, it contains informations about each object's properties. Each set has a variable number of subelements, one for each property.  But for this matfile, it is empty as there are no properties that contain other objects."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "function parse_properties(f::IO,names,heap,section_end)\n",
      "    props = Array(Dict{ASCIIString,Any},0)\n",
      "    position(f) >= section_end && return props\n",
      "    all(read(fdata,Int32,2) .== 0) || error(\"unknown header to properties segment\")\n",
      "    \n",
      "    # sizehint: 8 int32s would be 2 props per object; this is overly generous\n",
      "    sizehint(props,iceil((section_end-position(f))/(8*4)))\n",
      "    \n",
      "    while position(f) < section_end\n",
      "        # For each class, there is first a Int32 describing the number of properties\n",
      "        start_offset = position(f)\n",
      "        nprops = read(f,Int32)\n",
      "        d = Dict{ASCIIString,Any}()\n",
      "        sizehint(d,nprops)\n",
      "        for i=1:nprops\n",
      "            # For each property, there is an index into our strings\n",
      "            name_idx = read(f,Int32)\n",
      "            # A flag describing how the heap_idx is to be interpreted\n",
      "            flag = read(f,Int32)\n",
      "            # And a value; often an index into some data structure\n",
      "            heap_idx = read(f,Int32)\n",
      "            \n",
      "            if flag == 0\n",
      "                # This means that the property is stored in the names array\n",
      "                d[names[name_idx]] = names[heap_idx]\n",
      "            elseif flag == 1\n",
      "                # The property is stored in the MCOS FileWrapper__ heap\n",
      "                d[names[name_idx]] = heap[heap_idx+3] # But... the index is off by 3!? Crazy.\n",
      "            elseif flag == 2\n",
      "                # The property is a boolean, and the heap_idx itself is the value\n",
      "                @assert 0 <= heap_idx <= 1 \"boolean flag has a value other than 0 or 1\"\n",
      "                d[names[name_idx]] = bool(heap_idx)\n",
      "            else\n",
      "                error(\"unknown flag \",flag, \" for property \",names[name_idx], \" with heap index \",heap_idx)\n",
      "            end\n",
      "        end\n",
      "        push!(props,d)\n",
      "        \n",
      "        # Jump to the next 8-byte aligned offset\n",
      "        if position(f) % 8 != 0\n",
      "            seek(f,iceil(position(f)/8)*8)\n",
      "        end\n",
      "    end\n",
      "    props\n",
      "end\n",
      "\n",
      "seek(fdata,segments[2])\n",
      "seg2_props = parse_properties(fdata,strs,mcos,segments[3])\n",
      "summarize(seg2_props)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "0-element Array{Dict{ASCIIString,Any},1}: "
       ]
      }
     ],
     "prompt_number": 6
    },
    {
     "cell_type": "heading",
     "level": 3,
     "metadata": {},
     "source": [
      "Segment 3: Object information"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "This section has one element per object. Theres an index into the class structure, followed by a few unknown fields. Then there are two fields that describe where the property information is stored -- either in segment 2 or segment 4. "
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "function parse_object_info(f, section_end)\n",
      "    # The first six int32s unknown. Always 0? Or perhaps reserved space for an extra elt?\n",
      "    all(read(f,Int32,6) .== 0) || error(\"unknown header to object information\")\n",
      "    \n",
      "    object_info = Array((Int,Int,Int,Int),0)\n",
      "    while position(f) < section_end\n",
      "        class_idx = read(f,Int32)\n",
      "        unknown1 = read(f,Int32)\n",
      "        unknown2 = read(f,Int32)\n",
      "        segment1_idx = read(f,Int32) # The index into segment 2\n",
      "        segment2_idx = read(f,Int32) # The index into segment 4\n",
      "        obj_id = read(f,Int32)\n",
      "        \n",
      "        @assert unknown1 == unknown2 == 0 \"discovered a nonzero object property\"\n",
      "        push!(object_info,(class_idx,segment1_idx,segment2_idx,obj_id))\n",
      "    end\n",
      "    object_info\n",
      "end\n",
      "\n",
      "seek(fdata,segments[3])\n",
      "obj_info = parse_object_info(fdata,segments[4])\n",
      "# Let's map the class_idx to the classname so it's a bit more readable\n",
      "summarize(map(x -> (classes[x[1]][2],x[2],x[3],x[4]), obj_info))"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "3-element Array{(ASCIIString,Int64,Int64,Int64),1}: \n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "  [1] (\"SimpleClass\",0,1,1)\n",
        "  [2] (\"SimpleClass\",0,2,2)\n",
        "  [3] (\"AnotherClass\",0,3,3)"
       ]
      }
     ],
     "prompt_number": 7
    },
    {
     "cell_type": "heading",
     "level": 3,
     "metadata": {},
     "source": [
      "Segment 4: More properties!"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Just like segment 2, except these properties contain things that aren't class objects. Strange that these two segments aren't adjacent..."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "seek(fdata,segments[4])\n",
      "seg4_props = parse_properties(fdata,strs,mcos,segments[5])\n",
      "summarize(seg4_props)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "3-element Array{Dict{ASCIIString,Any},1}: \n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "  [1] Dict{ASCIIString,Any}: \n",
        "       \"cell_field_3\"=>1x4 Array{Any,2}: \n",
        "           [1] Float64: 1.0\n",
        "           [2] ASCIIString: \"one\"\n",
        "           [3] Float64: 2.0\n",
        "           [4] ASCIIString: \"two\"\n",
        "  [2] Dict{ASCIIString,Any}: \n",
        "       \"cell_field_3\"=>1x4 Array{Any,2}: \n",
        "           [1] Float64: 3.0\n",
        "           [2] ASCIIString: \"three\"\n",
        "           [3] Float64: 4.0\n",
        "           [4] ASCIIString: \"four\"\n",
        "       \"char_field_1\"=>ASCIIString: \"another_char\"\n",
        "  [3] Dict{ASCIIString,Any}: {}"
       ]
      }
     ],
     "prompt_number": 8
    },
    {
     "cell_type": "heading",
     "level": 3,
     "metadata": {},
     "source": [
      "Segment 5: Empty?"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "I've never seen this populated, so I have no idea what is going on here."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "function parse_segment5(f, segment_end)\n",
      "    seg5 = read(f,Uint8,segment_end-position(f))\n",
      "    if any(seg5 .!= 0)\n",
      "        xxd(seg5)\n",
      "    end\n",
      "    \n",
      "    @assert segment_end == position(f) && eof(f) \"there's more data to be had!\"\n",
      "end\n",
      "\n",
      "seek(fdata,segments[5])\n",
      "parse_segment5(fdata, segments[6])"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 9
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "Putting it all together"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "We're still missing quite a bit here: the object properties from segment 4 are still incomplete! And they're incomplete in strange ways. We've got three objects of two different classes: `SimpleClass` should have properties `char_field_1`, `array_field_2` and `cell_field_3`, and `AnotherClass` just has one property, `a`.  But the two `SimpleClass` objects have different fields populated! The `array_field_2` property is totally missing from both, and the `AnotherClass` object is totally empty! \n",
      "\n",
      "There's something we haven't used yet in the `FileWrapper__` array: the last element."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "println(\"The last element of FileWrapper__'s array:\")\n",
      "print(\"  \")\n",
      "summarize(mcos[end],\"  \")"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "The last element of FileWrapper__'s array:\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "  3x1 Array{Any,2}: \n",
        "    [1] Dict{ASCIIString,Any}: {}\n",
        "    [2] Dict{ASCIIString,Any}: \n",
        "         \"array_field_2\"=>1x6 Array{Float64,2}: [0.0,1.0,2.0,3.0,4.0,5.0]\n",
        "         \"char_field_1\"=>ASCIIString: \"char_field\"\n",
        "    [3] Dict{ASCIIString,Any}: \n",
        "         \"a\"=>Float64: 1.0"
       ]
      }
     ],
     "prompt_number": 10
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Those look like shared/default properties for each class! Ordered in class order.  But we're off by one? Again? Oy (I have a hunch that Matlab's implementation was coded in C by someone so very accustomed to 1-indexed arrays that they pretend index 0 doesn't exist\u2026).  It's also important to note that these shared properties are not related to class property defaults.  Let's merge these default values with the properties we got from segments 2 and 4 above:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "objs = Array(Dict{ASCIIString,Any},length(obj_info))\n",
      "for (i,info) in enumerate(obj_info)\n",
      "    # Get the property from either segment 2 or segment 4\n",
      "    props = info[2] > 0 ? seg2_props[info[2]] : seg4_props[info[3]]\n",
      "    # And merge it with the matfile defaults for this class\n",
      "    objs[i] = merge(mcos[end][info[1]+1],props)\n",
      "end\n",
      "summarize(objs)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "3-element Array{Dict{ASCIIString,Any},1}: \n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "  [1] Dict{ASCIIString,Any}: \n",
        "       \"cell_field_3\"=>1x4 Array{Any,2}: \n",
        "           [1] Float64: 1.0\n",
        "           [2] ASCIIString: \"one\"\n",
        "           [3] Float64: 2.0\n",
        "           [4] ASCIIString: \"two\"\n",
        "       \"array_field_2\"=>1x6 Array{Float64,2}: [0.0,1.0,2.0,3.0,4.0,5.0]\n",
        "       \"char_field_1\"=>ASCIIString: \"char_field\"\n",
        "  [2] Dict{ASCIIString,Any}: \n",
        "       \"cell_field_3\"=>1x4 Array{Any,2}: \n",
        "           [1] Float64: 3.0\n",
        "           [2] ASCIIString: \"three\"\n",
        "           [3] Float64: 4.0\n",
        "           [4] ASCIIString: \"four\"\n",
        "       \"array_field_2\"=>1x6 Array{Float64,2}: [0.0,1.0,2.0,3.0,4.0,5.0]\n",
        "       \"char_field_1\"=>ASCIIString: \"another_char\"\n",
        "  [3] Dict{ASCIIString,Any}: \n",
        "       \"a\"=>Float64: 1.0"
       ]
      }
     ],
     "prompt_number": 11
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "We did it! (Well... for this file)"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Here's what Matlab says these objects should be:\n",
      "\n",
      "    >> load('simple.mat')\n",
      "    >> display(obj); display(obj2); display(obj3)\n",
      "    obj = \n",
      "      SimpleClass with properties:\n",
      "    \n",
      "         char_field_1: 'char_field'\n",
      "        array_field_2: [0 1 2 3 4 5]\n",
      "         cell_field_3: {[1]  'one'  [2]  'two'}\n",
      "         \n",
      "    obj2 = \n",
      "      SimpleClass with properties:\n",
      "    \n",
      "         char_field_1: 'another_char'\n",
      "        array_field_2: [0 1 2 3 4 5]\n",
      "         cell_field_3: {[3]  'three'  [4]  'four'}\n",
      "         \n",
      "    obj3 = \n",
      "      AnotherClass with properties:\n",
      "    \n",
      "        a: 1\n",
      "\n",
      "\n",
      "What a mess, though. I guess it's not terribly surprising that this is undocumented."
     ]
    },
    {
     "cell_type": "heading",
     "level": 4,
     "metadata": {},
     "source": [
      "Appendix"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# More complicated files\n",
      "f = matopen(\"fiobj.mat\")\n",
      "mcos = f.subsystem[\"_i1\"][\"MCOS\"][2]\n",
      "data = vec(mcos[1])\n",
      "fdata = IOBuffer(data)\n",
      "\n",
      "seek(fdata,0)\n",
      "segments, strs = parse_header(fdata)\n",
      "\n",
      "seek(fdata,segments[1])\n",
      "classes = parse_class_info(fdata,strs, segments[2])\n",
      "\n",
      "seek(fdata,segments[2])\n",
      "seg2_props = parse_properties(fdata,strs,mcos,segments[3])\n",
      "\n",
      "seek(fdata,segments[3])\n",
      "obj_info = parse_object_info(fdata,segments[4])\n",
      "\n",
      "seek(fdata,segments[4])\n",
      "seg4_props = parse_properties(fdata,strs,mcos,segments[5])\n",
      "\n",
      "seek(fdata,segments[5])\n",
      "parse_segment5(fdata, segments[6])\n",
      "\n",
      "objs = Array(Dict{ASCIIString,Any},length(obj_info))\n",
      "for (i,info) in enumerate(obj_info)\n",
      "    # Get the property from either segment 2 or segment 4\n",
      "    props = info[2] > 0 ? seg2_props[info[2]] : seg4_props[info[3]]\n",
      "    # And merge it with the matfile defaults for this class\n",
      "    objs[i] = merge(mcos[end][info[1]+1],props)\n",
      "end\n",
      "summarize(objs)\n",
      "println()\n",
      "summarize(mcos)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "1-element Array{Dict{ASCIIString,Any},1}: \n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "  [1] Dict{ASCIIString,Any}: \n",
        "       \"Scaling\"=>ASCIIString: \"BinaryPoint\"\n",
        "       \"RoundingMethod\"=>ASCIIString: \"Nearest\"\n",
        "       \"SumBias\"=>Float64: 0.0\n",
        "       \"Bias\"=>Float64: 0.0\n",
        "       \"ProductFixedExponent\"=>Float64: -30.0\n",
        "       \"SumWordLength\"=>Float64: 32.0\n",
        "       \"CastBeforeSum\"=>Bool: true\n",
        "       \"ProductFractionLength\"=>Float64: 30.0\n",
        "       \"Signed\"=>Bool: true\n",
        "       \"MaxProductWordLength\"=>Float64: 65535.0\n",
        "       \"SumMode\"=>ASCIIString: \"FullPrecision\"\n",
        "       \"nunderflows\"=>Float64: 0.0\n",
        "       \"minlog\"=>Float64: 1.7976931348623157e308\n",
        "       \"DataType\"=>ASCIIString: \"Fixed\"\n",
        "       \"ProductBias\"=>Float64: 0.0\n",
        "       \"Logging\"=>Bool: false\n",
        "       \"maxlog\"=>Float64: -1.7976931348623157e308\n",
        "       \"ProductMode\"=>ASCIIString: \"FullPrecision\"\n",
        "       \"SumSlopeAdjustmentFactor\"=>Float64: 1.0\n",
        "       \"fimathislocal\"=>Bool: false\n",
        "       \"SumSlope\"=>Float64: 9.313225746154785e-10\n",
        "       \"intarray\"=>1x51 Array{Int16,2}: "
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "[0,328,655,983,1311,1638,1966,2294,2621,2949,\u2026]\n",
        "       \"DataTypeOverride\"=>ASCIIString: \"Inherit\"\n",
        "       \"SumFractionLength\"=>Float64: 30.0\n",
        "       \"ProductWordLength\"=>Float64: 32.0\n",
        "       \"ProductSlope\"=>Float64: 9.313225746154785e-10\n",
        "       \"ProductSlopeAdjustmentFactor\"=>Float64: 1.0\n",
        "       \"SlopeAdjustmentFactor\"=>Float64: 1.0\n",
        "       \"FixedExponent\"=>Float64: -15.0\n",
        "       \"OverflowAction\"=>ASCIIString: \"Saturate\"\n",
        "       \"MaxSumWordLength\"=>Float64: 65535.0\n",
        "       \"noverflows\"=>Float64: 0.0\n",
        "       \"SumFixedExponent\"=>Float64: -30.0\n",
        "       \"WordLength\"=>Float64: 16.0\n",
        "28x1 Array{Any,2}: \n",
        "  [1] 1104x1 Array{Uint8,2}: [0x02,0x00,0x00,0x00,0x2a,0x00,0x00,0x00,0x48,0x02,\u2026]\n",
        "  [2] 0x0 Array{None,2}: []\n",
        "  [3] Bool: true\n",
        "  [4] Float64: 16.0\n",
        "  [5] Float64: -15.0\n",
        "  [6] Float64: 1.0\n",
        "  [7] Float64: 0.0\n",
        "  [8] 1x51 Array{Int16,2}: [0,328,655,983,1311,1638,1966,2294,2621,2949,\u2026]\n",
        "  [9] Float64: 0.0\n",
        "  [10] Float64: 0.0\n",
        "  [11] Float64: -1.7976931348623157e308\n",
        "  [12] Float64: 1.7976931348623157e308\n",
        "  [13] Float64: 32.0\n",
        "  [14] Float64: 32.0\n",
        "  [15] Float64: 65535.0\n",
        "  [16] Float64: 65535.0\n",
        "  [17] Float64: 30.0\n",
        "  [18] Float64: -30.0\n",
        "  [19] Float64: 9.313225746154785e-10\n",
        "  [20] Float64: 1.0\n",
        "  [21] Float64: 0.0\n",
        "  [22] Float64: 30.0\n",
        "  [23] Float64: -30.0\n",
        "  [24] Float64: 9.313225746154785e-10\n",
        "  [25] Float64: 1.0\n",
        "  [26] Float64: 0.0\n",
        "  [27] Bool: true\n",
        "  [28] 2x1 Array{Any,2}: \n",
        "       [1] Dict{ASCIIString,Any}: {}\n",
        "       [2] Dict{ASCIIString,Any}: {}"
       ]
      }
     ],
     "prompt_number": 12
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# Simple utitilies for viewing of hex and big nested data structures\n",
      "cleanascii!{N}(A::Array{Uint8,N}) = (A[(A .< 0x20) | (A .> 0x7e)] = uint8('.'); A)\n",
      "function xxd(x, start=1, stop=length(x))\n",
      "    for i=div(start-1,8)*8+1:8:stop\n",
      "        row = i:i+7\n",
      "        # hexadecimal\n",
      "        @printf(\"%04x: \",i-1)\n",
      "        for r=row\n",
      "            start <= r <= stop ? @printf(\"%02x\",x[r]) : print(\"  \")\n",
      "            r % 4 == 0 && print(\" \")\n",
      "        end\n",
      "        # ASCII\n",
      "        print(\"   \",ascii(cleanascii!(x[i:min(i+7,end)])),\" \")\n",
      "        # Int32\n",
      "        for j=i:4:i+7\n",
      "            start <= j && j+3 <= stop ? @printf(\"% 12d \",reinterpret(Int32,x[j:j+3])[1]) : print(\" \"^12)\n",
      "        end\n",
      "        # Float64:\n",
      "        # start <= i && i+7 <= stop ? @printf(\"%.3e\",reinterpret(Float64,x[row])[1]) : nothing\n",
      "        println()\n",
      "    end\n",
      "end\n",
      "# Summarize - smartly display large nested data structures for some datatypes\n",
      "summarize(x::Any,prefix=\"\") = print(string(summary(x)))\n",
      "summarize(x::String,prefix=\"\") = print(string(summary(x),\": \\\"\", x, \"\\\"\"))\n",
      "summarize(x::Real,prefix=\"\") = print(string(summary(x),\": \", x))\n",
      "function summarize(x::Tuple,prefix=\"\")\n",
      "    print(\"(\")\n",
      "    i = start(x);\n",
      "    while !done(x,i)\n",
      "        t,i = next(x,i)\n",
      "        if isa(t,String)\n",
      "            print(\"\\\"\",t,\"\\\"\")\n",
      "        elseif isa(t,Real)\n",
      "            print(t)\n",
      "        else\n",
      "            summarize(t,string(prefix,\"  \"))\n",
      "        end\n",
      "        !done(x,i) && print(\",\")\n",
      "    end\n",
      "    print(\")\")\n",
      "end\n",
      "function summarize(x::Dict,prefix=\"\")\n",
      "    print(string(summary(x),\": \",(isempty(x) ? \"{}\" : \"\")))\n",
      "    i = start(x)\n",
      "    while !done(x,i)\n",
      "        (v,i) = next(x,i)\n",
      "        if typeof(v[1])<:String\n",
      "            println()\n",
      "            print(prefix,\"  \\\"\",v[1],\"\\\"=>\")\n",
      "            summarize(v[2],string(prefix,\"    \"))\n",
      "        else\n",
      "            println()\n",
      "            print(prefix,\"  \",summarize(v[1]),\"=>\")\n",
      "            summarize(v[2],string(prefix,\"    \"))\n",
      "        end\n",
      "    end\n",
      "end\n",
      "function summarize{T,N}(x::AbstractArray{T,N},prefix=\"\")\n",
      "    print(string(summary(x),\": \"))\n",
      "    if T<:Real\n",
      "        truncate = length(x) > 10\n",
      "        maxelt = truncate ? 10 : length(x)\n",
      "        # This is very wrong, but it works for the purposes above...\n",
      "        Base.show_comma_array(STDOUT,x[1:min(length(x),maxelt)],\"[\",(truncate ? \",\u2026]\" : \"]\"))\n",
      "    else\n",
      "        i = start(x)\n",
      "        while !done(x,i)\n",
      "            (v,i) = next(x,i)\n",
      "            println()\n",
      "            print(prefix,\"  [$(i-1)] \")\n",
      "            summarize(v,string(prefix,\"     \"))\n",
      "        end\n",
      "    end\n",
      "end;"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "    Copyright (C) 2014 Matt Bauman, [first initial][last name] (at) [gmail]\n",
      "    \n",
      "    Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the \"Software\"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:\n",
      "    \n",
      "    The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.\n",
      "    \n",
      "    THE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE."
     ]
    }
   ],
   "metadata": {}
  }
 ]
}