csabahenk/py-jsonfun.md

## py-jsonfun.md

      
    Raw
  

              py-jsonfun.md
            
          
    Some fun with Python and JSON

This document will help you to introspect complex Python data.
Requirements

We need some JSON goodness to start with.
Required


JSONPickle: a Python module that gives you complete dump Python objects and structures as JSON

Optional

This writeup will rely on the following  tools as I fancy them, but you can replace them with alternatives (would there be any you fancy more).

My choice of JSON parser is YAJL-Ruby.

The primary reason to love it for is the ability to parse JSON streams,
ie. a concatenated series of JSON objects (eg. {}[] -- most JSON
parsers barf out on this because of "trailing charcters", and can't take
subsequent action on {} and []).
The other reason to love it for is the very lightweight and intutive API.
Much more likable than that of the similar Python binding!


Of course, in turn, you need Yajl and
Ruby installed for this.

My choice of JSON formatter is underscore-cli.


The primary reason to love it for is that it formats the data with
awareness of width. Most JSON formatters would format {"a":1,"b":2}
either as


{"a": 1, "b": 2}
```
   or

   ```json

{
⁣    "a": 1,
⁣    "b": 2
}
```
  depending on whether indentation was asked for or not.
  _underscore-cli_ decides about formatting depending on data and screen 
  size, and  applies one-line formatting for small data, while indented
  formatting for larger data, thus achieving optimal readilbility.


- Other reasons to love it for:
    - it features a functional API for manipulating and extracting information from JSON
    - it features colorized output

It's written for Node.js, so you'll need to have that installed.
Set up JSON dumping

Add the following code to your
sitecustomize.py (or
usercustomize.py):
import os
import time
from gzip import GzipFile
from threading import Lock, current_thread
from string import Template
import distutils.dir_util as du
import jsonpickle

class Jlog(object):

    def __init__(self, ftmp):
        self.pathtemp = Template(ftmp)
        self.lock = Lock()
        self.jreg = {}

    params = {'pid': os.getpid}

    def getpath(self):
        pd = {}
        for k,v in self.params.iteritems():
            if callable(v):
                v = v()
            pd[k] = v
        return self.pathtemp.substitute(pd)

    @property
    def jhandle(self):
        path = self.getpath()
        if not self.jreg.get(path):
            du.mkpath(os.path.dirname(path))
            self.jreg[path] = GzipFile(path, 'wb')
        return self.jreg[path]

    def jlog(self, *a, **kw):
        for i in range(len(a)):
            kw["data%02d" % i] = a[i]
        d = {'data': kw, 'time': time.time(), 'pid': os.getpid(), 'thread': current_thread().getName()}
        with self.lock:
            self.jhandle.write(jsonpickle.encode(d))
            self.jhandle.flush()
        ld = {}
        for k,v in d['data'].iteritems():
            ld[k] = [len(jsonpickle.encode(v))]
        d['data'] = ld
        print 'JLOG ' + jsonpickle.encode(d)

jlog = Jlog("/tmp/jlog/${pid}.json.gz").jlog
This sets up a canonical Jlog instance (called jlog) for any Python script
you execute. Then in your Python code you can just add
from sitecustomize import jlog
...
jlog(<args>, <keywords>)
For example:
from threading import Thread
from sitecustomize import jlog

class Rectangle(object):
    def __init__(self,h,w):
        self.height = h
        self.width = w

def sayhirect(h,w):
    jlog("hello", "shape", shape=Rectangle(h,w))

t = Thread(target=sayhirect, args=(3,4))
t.start()
sayhirect(5,6)
This will print a message to stdout to get a hint what's going on, something like:
JLOG {"pid": 19869, "data": {"shape": [60], "data00": [7], "data01": [7]}, "thread": "MainThread", "time": 1378916569.103762}
JLOG {"pid": 19869, "data": {"shape": [60], "data00": [7], "data01": [7]}, "thread": "Thread-1", "time": 1378916569.100782}

The actual log is written to /tmp/jlog/<pid>.json.gz, so in this particular
example, to /tmp/jlog/19869.json.gz, as a gzipped JSON stream.  The JSON
stream looks like:
{"pid": 19869, "data": {"shape": {"py/object": "__main__.Rectangle", "width": 4, "height": 3}, "data00": "hello", "data01": "shape"}, "thread": "Thread-1", "time": 1378916569.100782}{"pid": 19869, "data": {"shape": {"py/object": "__main__.Rectangle", "width": 6, "height": 5}, "data00": "hello", "data01": "shape"}, "thread": "MainThread", "time": 1378916569.103762}
We can get it by zcat(1)-ing the file. But that's not the best way to view it.
Read the JSON dump like a pro

So basically we want to have the JSON dump fed to underscore-cli to get a
royal view; alas, it can't handle JSON streams. A little snippet of Ruby in
Yajl for the rescue!
#!/usr/bin/env ruby

require 'yajl'

sel = $*.map { |i|
  case i
  when /\A(-?\d+)\.\.(\.?)(-?\d+)\Z/
    Range.new *([$1, $3].map { |j| Integer j } << ($2 == "."))
  else
    Integer i
  end
} 

w = []
Yajl::Parser.new.parse(STDIN) { |o| w << o }
w = w.values_at *sel unless sel.empty?

Yajl::Encoder.encode w, STDOUT
Save it as jwrap.rb, place to your $PATH and set it executable (or not, but
the following examples will assume that). jwrap.rb wraps the elements of the
JSON stream into a single JSON array and output that to stdout. Besides:

Passing integer arguments to it, it will select only the objects of given
indices; negative indices are accepted. Thus jwrap.rb 0 1 selects the first
two objects, while jwrap.rb -1 selects the last object.
You can also have command line arguments of
Ruby range syntax, whereas i..j represents
an inclusive range (eg. 3..6 consists of 3, 4, 5, 6), while i...j
represents an exclusive range (eg. 3...6 consists of 3, 4, 5); negative range
bounds are accepted. Thus jwrap.rb -10..-1 will select the last ten objects.

Beware that this code is optimized for simplicity not efficiency -- if you
happen to jlog gigabytes of data, you'll have to come up with a smarter
version. It will do for us for everyday introspection.
Back to our above example, we can do now the following:
$ zcat /tmp/jlog/19869.json.gz | jwrap.rb | underscore print

which gives:
[
  {
    "pid": 19869,
    "data": {
      "shape": { "py/object": "__main__.Rectangle", "width": 4, "height": 3 },
      "data00": "hello",
      "data01": "shape"
    },
    "thread": "Thread-1",
    "time": 1378916569.100782
  },
  {
    "pid": 19869,
    "data": {
      "shape": { "py/object": "__main__.Rectangle", "width": 6, "height": 5 },
      "data00": "hello",
      "data01": "shape"
    },
    "thread": "MainThread",
    "time": 1378916569.103762
  }
]
or if we are interested only in the last dump entry:
$ zcat /tmp/jlog/19869.json.gz | jwrap.rb -1 | underscore print

which gives:
[
  {
    "pid": 19869,
    "data": {
      "shape": { "py/object": "__main__.Rectangle", "width": 6, "height": 5 },
      "data00": "hello",
      "data01": "shape"
    },
    "thread": "MainThread",
    "time": 1378916569.103762
  }
]