emaadmanzoor/00-CDM_StreamSpot_Translation.md

## 00-CDM_StreamSpot_Translation.md

      
    Raw
  

              00-CDM_StreamSpot_Translation.md
            
          
    Contents


README

StreamSpot format
CDM format

CDM events
CDM subjects, objects and principals
CDM simple edges


Combining CDM records into a StreamSpot edge
Assigning a graph ID to each StreamSpot edge


CDM to StreamSpot translation script
Visualise StreamSpot graph script
Avro to JSON translation script

Contact


emanzoor@cs.stonybrook.edu


## 01-README.md

      
    Raw
  

              01-README.md
            
          
    Converting from CDM to StreamSpot Format

StreamSpot Format

StreamSpot takes input as a TSV file containing on each line the following fields:
source_id source_type destination_id destination_type edge_type graph_id

The fields are data-typed as follows:

source_id and destination_id: Integer greater than 1 (0 stands for NA, like anonymous memory maps).
source_type, destination_type, edge_type: Single byte (we've been using ASCII characters so far).
graph_id: Integer greater than 0.

An example record in StreamSpot's format is: 4	a	5	c	p	0.
The data it currently uses is available here.
CDM Format

We are using the CDM 12 schema.
CDM Events

CDM-formatted JSON data contains events in the following format:
{"datum":{"com.bbn.tc.schema.avro.Event":{"uuid":"=ÖmÖä<8f><99>\u0015øNe¢÷<84>\f®ïC<94>é®<85>\u0001j\u0000 Ç¹s\u001AÓï","sequence":65362,"type":"EVENT_OPEN","threadId":28005,"source":"SOURCE_LINUX_AUDIT_TRACE","timestampMicros":{"long":1450217860803},"name":null,"parameters":null,"location":null,"size":null,"programPoint":null,"properties":{"map":{"eventId":"65362"}}}},"CDMVersion":"12"}

Unlike edges, CDM events are not self-contained: the only thing available is the event type (EVENT_OPEN in the example above). The rest of the metadata is contained in other non-event records occuring both previously and after the event record in the CDM data. Let's look at these records now.
CDM Subjects, Objects and Principals

Subjects correspond to entities like processes, threads and units.
{"datum":{"com.bbn.tc.schema.avro.Subject":{"uuid":"~#<92>Å7<9f>ÜUèÓ\ryEQ¬¾\"Ëõ]<96>xb3#}UcÏåÿ<90>","type":"SUBJECT_PROCESS","pid":28005,"ppid":27999,"source":"SOURCE_LINUX_AUDIT_TRACE","startTimestampMicros":null,"unitId":{"int":0},"endTimestampMicros":null,"cmdLine":null,"importedLibraries":null,"exportedLibraries":null,"pInfo":null,"properties":{"map":{"uid":"0","programName":"sudo","group":"1003"}}}},"CDMVersion":"12"}

Objects correspond to entities like files, sockets or memory addresses.
{"datum":{"com.bbn.tc.schema.avro.FileObject":{"uuid":"Ý<9c>¨©<96>°<98>\r\u0012p <87>Í¦¤õ\u0005L<91>D<8c>Â<96><97>\\;\u0011µbW<82>S","baseObject":{"source":"SOURCE_LINUX_AUDIT_TRACE","permission":null,"lastTimestampMicros":null,"properties":null},"url":"file:///etc/login.defs","isPipe":false,"version":0,"size":null}},"CDMVersion":"12"}

Principals correspond to entities like local and remote users, which StreamSpot doesn't use as of now.
{"datum":{"com.bbn.tc.schema.avro.Principal":{"uuid":"\u001DÕ\u0019.ÖÑ8,¸ïB:»lò\u0011<96>?3Æ\rÌYï\u001C\u0013É7²Õðç","type":"PRINCIPAL_LOCAL","userId":0,"groupIds":[1003],"source":"SOURCE_LINUX_AUDIT_TRACE","properties":{"map":{"euid":"0","egid":"1003"}}}},"CDMVersion":"12"}

All 3 of the above CDM records appeared before the event record in the file.
CDM Simple Edges

CDM edges are what connect events to subjects, objects and principals. The CDM edge types we care about are:

EDGE_EVENT_ISGENERATEDBY_SUBJECT: When an event is generated by a process/thread/unit.
EDGE_EVENT_AFFECTS_*: When an event writes to a file/registry/socket or forks a process.
EDGE_*_AFFECTS_EVENT: When an event reads from a file/registry/socket.

The entire CDM edge type list is here.
Continuing the example from above, the following CDM edge records following the event provide the metadata we need:

{"datum":{"com.bbn.tc.schema.avro.SimpleEdge":{"fromUuid":"=ÖmÖä<8f><99>\u0015øNe¢÷<84>\f®ïC<94>é®<85>\u0001j\u0000 Ç¹s\u001AÓï","toUuid":"Ý<9c>¨©<96>°<98>\r\u0012p <87>Í¦¤õ\u0005L<91>D<8c>Â<96><97>\\;\u0011µbW<82>S","type":"EDGE_FILE_AFFECTS_EVENT","timestamp":1450217860803,"properties":null}},"CDMVersion":"12"}

This tells us that a specific file object (identified by the UUID) was read during the event; so it provides us the destination type. By following this CDM edge to the object UUID, we can find out the destination ID (by mapping the filename to an ID).

{"datum":{"com.bbn.tc.schema.avro.SimpleEdge":{"fromUuid":"=ÖmÖä<8f><99>\u0015øNe¢÷<84>\f®ïC<94>é®<85>\u0001j\u0000 Ç¹s\u001AÓï","toUuid":"~#<92>Å7<9f>ÜUèÓ\ryEQ¬¾\"Ëõ]<96>xb3#}UcÏåÿ<90>","type":"EDGE_EVENT_ISGENERATEDBY_SUBJECT","timestamp":1450217860803,"properties":null}},"CDMVersion":"12"}

This tells us which process/thread/unit subject (identified by the UUID) generated the event. By following this edge to the subject UUID, we can find out the source type and source ID.
Combining CDM records into a StreamSpot edge

By reading all the records preceding the event record and following the event record until the next event record, we can collect all the metadata needed to (almost) construct a StreamSpot edge (for the example above):

Edge type: EVENT_OPEN, directly from the event record.
Source type: SUBJECT_PROCESS, from a CDM record following the event record.
Source ID: "pid":28005, from a CDM record preceding the event record.
Destination type: EDGE_FILE_AFFECTS_EVENT, from a CDM record following the event record.
Destination ID: "url":"file:///etc/login.defs", from a CDM record preceding the event record.
Note that the source and destination may need to be swapped here since the file is read and information flows to the process.

We still need to apply the following transformations:

Map source, destination and type names to single bytes: this can be done using the schema.
Map each destination ID (filename, memory address, URL) to an integer index.

We will maintain these maps in memory. What's left now is assigning graph ID's.
Assigning a graph ID to each StreamSpot edge

We maintain a mapping from process/unit/thread ID to graph ID. Subsequent rules assume processes. The rules of graph ID assignment are:

Two process ID's will have the same graph ID if one was forked from the other.
Two process ID's will have different graph ID's if neither was forked from the other.
Two process ID's will have different graph ID's if the parent forked then exec'd.
Graph ID's start from 0, so the first process is assigned graph ID 0.

The mechanics will work as follows:

When a new SUBJECT_PROCESS CDM record is read:

Check if its pid already has an assigned graph ID.
If a graph ID is already assigned to the pid, then continue. (this case should not occur, ideally)
If a graph ID is not assigned to the pid, then check if its ppid already has an assigned graph ID:

If a graph ID is assigned to the ppid, assign the same graph ID to the pid.
If a graph ID is not assigned to the ppid, assign the next available graph ID to the pid.


All subsequent events generated by this pid will use this assigned graph ID for the StreamSpot edge.


When a new EVENT_EXECUTE CDM record is read:

Note the threadId field of the event: this is the pid of the child process.
Assign the next available graph ID to this pid, replacing any existing assignment.
All subsequent events generated by this pid will use this assigned graph ID for the StreamSpot edge.


## 02-translate_cdm_to_streamspot.py
#!/usr/bin/env python

import argparse
import json
import pdb
import sys

# CDM record type constants
CDM_TYPE_PRINCIPAL = 'com.bbn.tc.schema.avro.Principal'
CDM_TYPE_SUBJECT = 'com.bbn.tc.schema.avro.Subject'
CDM_TYPE_FILE = 'com.bbn.tc.schema.avro.FileObject'
CDM_TYPE_MEM = 'com.bbn.tc.schema.avro.MemoryObject'
CDM_TYPE_SOCK = 'com.bbn.tc.schema.avro.NetFlowObject'
CDM_TYPE_EDGE = 'com.bbn.tc.schema.avro.SimpleEdge'
CDM_TYPE_EVENT = 'com.bbn.tc.schema.avro.Event'

parser = argparse.ArgumentParser()
parser.add_argument('-i', '--input', help='Input CDM/JSON file', required=True)
parser.add_argument('-o','--output', help='Output StreamSpot edge file', required=True)
args = vars(parser.parse_args())

input_file = args['input']
output_file = args['output']

uuid_to_pid = {}
uuid_to_pname = {}
uuid_to_url = {}
uuid_to_sockid = {}
uuid_to_addr = {}

pid_to_graph_id = {}
current_graph_id = 0

with open(input_file, 'r') as f:
    event_metadata_buffer = {} # filled/cleared on every new event
    streamspot_edge = {'event_uuid': None,
                       'source_id': None,
                       'source_name': None,
                       'source_type': None,
                       'dest_id': None,
                       'dest_name': 'NA',
                       'dest_type': None,
                       'edge_type': None,
                       'graph_id': None
                      } # filled/cleared on every new event
    lineno = 0
    for line in f:
        line = line.strip()
        lineno += 1
        cdm_record = json.loads(line)
        cdm_record_type = cdm_record['datum'].keys()[0]
        cdm_record_values = cdm_record['datum'][cdm_record_type]

        #print cdm_record

        if cdm_record_type == CDM_TYPE_PRINCIPAL:
            continue # we don't care about principals

        elif cdm_record_type == CDM_TYPE_SUBJECT:
            uuid = cdm_record_values['uuid']
            type = cdm_record_values['type']
            if type == 'SUBJECT_PROCESS':
                pid = cdm_record_values['pid']          # source ID
                ppid = cdm_record_values['ppid']        # needed to assign graph ID
                pname = cdm_record_values['properties']['map']['programName']
                unitid = cdm_record_values['unitId']['int']
                #uuid_to_pid[uuid] = str(pid) + '/' + pname + '/' + str(unitid)
                uuid_to_pid[uuid] = pid
                uuid_to_pname[uuid] = pname
                if not pid in pid_to_graph_id: # pid has no gid assigned
                    if not ppid in pid_to_graph_id: # ppid has no gid assigned
                        pid_to_graph_id[pid] = current_graph_id
                        pid_to_graph_id[ppid] = current_graph_id
                        current_graph_id += 1
                    else: # parent has a gid assigned
                        pid_to_graph_id[pid] = pid_to_graph_id[ppid]
            else:
                print "Unknown subject type:", type
                sys.exit(-1)

        elif cdm_record_type == CDM_TYPE_FILE:
            uuid = cdm_record_values['uuid']
            url = cdm_record_values['url']              # destination ID
            uuid_to_url[uuid] = url

        elif cdm_record_type == CDM_TYPE_SOCK:
            uuid = cdm_record_values['uuid']

            src = cdm_record_values['srcAddress']
            dest = cdm_record_values['destAddress']
            src_port = cdm_record_values['srcPort']
            dest_port =  cdm_record_values['destPort']
            sock_id = src + ':' + str(src_port) + ':' + dest + ':' + str(dest_port)

            uuid_to_sockid[uuid] = sock_id

        elif cdm_record_type == CDM_TYPE_MEM:
            uuid = cdm_record_values['uuid']
            addr = cdm_record_values['memoryAddress']
            uuid_to_addr[uuid] = addr

        elif cdm_record_type == CDM_TYPE_EVENT:
            # print previous streamspot edge if it is ready
            if not None in streamspot_edge.values():
                print str(streamspot_edge['source_id']) + '\t' +\
                      str(streamspot_edge['source_name']) + '\t' +\
                      str(streamspot_edge['source_type']) + '\t' +\
                      str(streamspot_edge['dest_id']) + '\t' +\
                      str(streamspot_edge['dest_name']) + '\t' +\
                      str(streamspot_edge['dest_type']) + '\t' +\
                      str(streamspot_edge['edge_type']) + '\t' +\
                      str(streamspot_edge['graph_id'])

                # clear old edge data
                streamspot_edge = {'event_uuid': None,
                                   'source_id': None,
                                   'source_name': None,
                                   'source_type': None,
                                   'dest_id': None,
                                   'dest_name': 'NA',
                                   'dest_type': None,
                                   'edge_type': None,
                                   'graph_id': None
                                  }

            uuid = cdm_record_values['uuid']
            type = cdm_record_values['type']
            streamspot_edge['edge_type'] = type         # streamspot edge type
            streamspot_edge['event_uuid'] = uuid        # to map metadata

        elif cdm_record_type == CDM_TYPE_EDGE:
            type = cdm_record_values['type']
            if type == 'EDGE_SUBJECT_HASLOCALPRINCIPAL':
                pass
            elif type == 'EDGE_FILE_AFFECTS_EVENT':
                # HACK! FIXME
                # Special case for
                #   - EVENT_UPDATE
                #   - EVENT_RENAME
                if streamspot_edge['edge_type'] == 'EVENT_UPDATE' or \
                   streamspot_edge['edge_type'] == 'EVENT_RENAME':
                    assert cdm_record_values['toUuid'] == \
                            streamspot_edge['event_uuid']
                    from_uuid = cdm_record_values['fromUuid']
                    url = uuid_to_url[from_uuid]
                    streamspot_edge['dest_id'] = url
                    streamspot_edge['dest_type'] = 'FILE'
                else:
                    assert cdm_record_values['fromUuid'] == \
                            streamspot_edge['event_uuid']
                    to_uuid = cdm_record_values['toUuid']
                    url = uuid_to_url[to_uuid]
                    streamspot_edge['dest_id'] = url
                    streamspot_edge['dest_type'] = 'FILE'
            elif type == 'EDGE_EVENT_AFFECTS_FILE':
                assert cdm_record_values['fromUuid'] == streamspot_edge['event_uuid']

                to_uuid = cdm_record_values['toUuid']
                url = uuid_to_url[to_uuid]
                streamspot_edge['dest_id'] = url
                streamspot_edge['dest_type'] = 'FILE'
            elif type == 'EDGE_MEMORY_AFFECTS_EVENT':
                assert cdm_record_values['fromUuid'] == streamspot_edge['event_uuid']

                to_uuid = cdm_record_values['toUuid']
                addr = uuid_to_addr[to_uuid]
                streamspot_edge['dest_id'] = addr
                streamspot_edge['dest_type'] = 'MEM'
            elif type == 'EDGE_EVENT_AFFECTS_MEMORY':
                assert cdm_record_values['fromUuid'] == streamspot_edge['event_uuid']

                to_uuid = cdm_record_values['toUuid']
                addr = uuid_to_addr[to_uuid]
                streamspot_edge['dest_id'] = addr
                streamspot_edge['dest_type'] = 'MEM'
            elif type == 'EDGE_EVENT_AFFECTS_NETFLOW':
                assert cdm_record_values['fromUuid'] == streamspot_edge['event_uuid']

                to_uuid = cdm_record_values['toUuid']
                sock_id = uuid_to_sockid[to_uuid]
                streamspot_edge['dest_id'] = sock_id
                streamspot_edge['dest_type'] = 'SOCK'
            elif type == 'EDGE_EVENT_AFFECTS_SUBJECT' or \
                    type == 'EDGE_EVENT_ISGENERATEDBY_SUBJECT':
                assert cdm_record_values['fromUuid'] == streamspot_edge['event_uuid']

                to_uuid = cdm_record_values['toUuid']
                pid = uuid_to_pid[to_uuid]
                pname = uuid_to_pname[to_uuid]

                if type == 'EDGE_EVENT_AFFECTS_SUBJECT':
                    streamspot_edge['dest_id'] = pid
                    streamspot_edge['dest_name'] = pname
                    streamspot_edge['dest_type'] = 'PROCESS'
                elif type == 'EDGE_EVENT_ISGENERATEDBY_SUBJECT':
                    streamspot_edge['source_id'] = pid
                    streamspot_edge['source_name'] = pname
                    streamspot_edge['source_type'] = 'PROCESS'

                # graph ID assignment to streamspot edge
                if type == 'EDGE_EVENT_ISGENERATEDBY_SUBJECT':
                    streamspot_edge['graph_id'] = \
                            pid_to_graph_id[streamspot_edge['source_id']]

                    # handle graph ID change on EXECUTE
                    if streamspot_edge['edge_type'] == 'EVENT_EXECUTE':
                        pid_to_graph_id[streamspot_edge['source_id']] = \
                                current_graph_id # change graph ID of caller process
                        current_graph_id += 1

            else:
                print 'Unknown edge type:', type
                sys.exit(-1)

        else:
            print 'Unknown CDM record type', cdm_record_type
            sys.exit(-1)


    # last event in buffer
    print str(streamspot_edge['source_id']) + '\t' +\
          str(streamspot_edge['source_name']) + '\t' +\
          str(streamspot_edge['source_type']) + '\t' +\
          str(streamspot_edge['dest_id']) + '\t' +\
          str(streamspot_edge['dest_name']) + '\t' +\
          str(streamspot_edge['dest_type']) + '\t' +\
          str(streamspot_edge['edge_type']) + '\t' +\
          str(streamspot_edge['graph_id'])

## 03-visualise_streamspot_graph.py
#!/usr/bin/env python

"""
    Visualises streamspot/infoleak_small_units.ss
    python visualise_streamspot_graph.py
"""

from graph_tool.all import *
import sys

process_color = [179/256.,205/256.,227/256.,0.8]
file_color = [251/256.,180/256.,174/256.,0.8]
mem_color = [204/256.,235/256.,197/256.,0.8]
sock_color = [222/256.,203/256.,228/256.,0.8]

group_map = {'PROCESS': 0,
             'FILE': 1,
             'MEM': 2,
             'SOCK': 3}

def create_new_graph():
    g = Graph()

    gid = g.new_graph_property('int')

    vid = g.new_vertex_property('string')
    vtype = g.new_vertex_property('string')
    vlabel = g.new_vertex_property('string')
    vgroup = g.new_vertex_property('int32_t')
    vcolor = g.new_vertex_property('vector<double>')

    etype = g.new_edge_property('string')
    elabel = g.new_edge_property('string')
    ecolor = g.new_edge_property('vector<double>')

    g.gp.id = gid

    g.vp.id = vid
    g.vp.type = vtype
    g.vp.label = vlabel
    g.vp.group = vgroup
    g.vp.color = vcolor

    g.ep.type = etype
    g.ep.label = elabel
    g.ep.color = ecolor

    return g

lno = 0
graphs = {} # from gid to graph
with open('streamspot/infoleak_small_units.ss', 'r') as f:
    for line in f:
        lno += 1
        #if lno > 20:
        #    break
        line = line.strip()
        fields = line.split('\t')
        source_id = fields[0]
        source_name = fields[1]
        source_type = fields[2]
        dest_id = fields[3]
        dest_name = fields[4]
        dest_type = fields[5]
        edge_type = fields[6]
        graph_id = int(fields[7])

        if edge_type == 'EVENT_UNIT':
            continue

        #if edge_type == 'EVENT_EXECUTE':
        #    continue # do not add this edge

        if not graph_id in graphs:
            graphs[graph_id] = create_new_graph()
        g = graphs[graph_id]

        # check if source vertex exists
        matches = find_vertex(g, g.vp.id, source_id)
        if len(matches) == 0:
            u = g.add_vertex()

            g.vp.id[u] = source_id
            g.vp.type[u] = source_type
            g.vp.label[u] = source_id + '\\' + source_name
            g.vp.group[u] = group_map[source_type]

            if source_type == 'PROCESS':
                g.vp.color[u] = process_color
            elif source_type == 'FILE':
                g.vp.color[u] = file_color
            elif source_type == 'MEM':
                g.vp.color[u] = mem_color
            elif source_type == 'SOCK':
                g.vp.color[u] = sock_color
            else:
                print 'unknown type', source_type
        else:
            u = matches[0]

        # check if destination vertex exists
        matches = find_vertex(g, g.vp.id, dest_id)
        if len(matches) == 0:
            v = g.add_vertex()

            g.vp.id[v] = dest_id
            g.vp.type[v] = dest_type
            g.vp.label[v] = dest_id + '\\' + dest_name
            g.vp.group[v] = group_map[dest_type]

            if dest_type == 'PROCESS':
                g.vp.color[v] = process_color
            elif dest_type == 'FILE':
                g.vp.color[v] = file_color
                g.vp.label[v] = dest_id.split('/')[-1]
            elif dest_type == 'MEM':
                g.vp.color[v] = mem_color
            elif dest_type == 'SOCK':
                g.vp.color[v] = sock_color
            else:
                print 'unknown type', source_type
        else:
            v = matches[0]

        e = g.add_edge(u,v)
        g.ep.type[e] = edge_type
        g.ep.label[e] = str(lno) + ':' + edge_type.split('_')[1]

        if edge_type in ['EVENT_EXECUTE', 'EVENT_CLONE']:
            g.ep.color[e] = process_color
        elif edge_type in ['EVENT_OPEN', 'EVENT_READ', 'EVENT_WRITE',
                           'EVENT_UPDATE']:
            g.ep.color[e] = file_color
        elif edge_type in ['EVENT_BIND', 'EVENT_ACCEPT', 'EVENT_CONNECT']:
            g.ep.color[e] = sock_color
        else:
            g.ep.color[e] = [0.179, 0.203,0.210, 0.8]

for graph_id, g in graphs.iteritems():
    pos = sfdp_layout(g, groups=g.vp.group)
    graph_draw(g, pos, output_size=(2000, 2000),
               vertex_text=g.vp.label, edge_text=g.ep.label,
               vertex_shape='circle',
               #vertex_size=5.0,
               vertex_color=g.vp.color,
               vertex_fill_color=[1.0,1.0,1.0,0.8],
               vertex_pen_width=5.0,
               edge_pen_width=5.0,
               edge_font_size=16,
               edge_font_weight=1.0,
               #nodesfirst=True,
               output='infoleaks_small_units_gid' + str(graph_id) + '.pdf')

## 04-convert_avro_to_json.sh
#!/usr/bin/env bash

# Requires avro-tools-1.8.1.jar in the working directory (and a working JRE)
# Requires Avro files in subdirectory cdm/ in the working directory
# Writes JSON files to subdirectory json/ in the working directory

mkdir json;
for file in $(ls cdm);
do
  java -jar avro-tools-1.8.1.jar tojson cdm/$file > json/${file};
  mv "json/${file}" "json/${file/%.avro/.json}"
done
	#!/usr/bin/env python

	import argparse
	import json
	import pdb
	import sys

	# CDM record type constants
	CDM_TYPE_PRINCIPAL = 'com.bbn.tc.schema.avro.Principal'
	CDM_TYPE_SUBJECT = 'com.bbn.tc.schema.avro.Subject'
	CDM_TYPE_FILE = 'com.bbn.tc.schema.avro.FileObject'
	CDM_TYPE_MEM = 'com.bbn.tc.schema.avro.MemoryObject'
	CDM_TYPE_SOCK = 'com.bbn.tc.schema.avro.NetFlowObject'
	CDM_TYPE_EDGE = 'com.bbn.tc.schema.avro.SimpleEdge'
	CDM_TYPE_EVENT = 'com.bbn.tc.schema.avro.Event'

	parser = argparse.ArgumentParser()
	parser.add_argument('-i', '--input', help='Input CDM/JSON file', required=True)
	parser.add_argument('-o','--output', help='Output StreamSpot edge file', required=True)
	args = vars(parser.parse_args())

	input_file = args['input']
	output_file = args['output']

	uuid_to_pid = {}
	uuid_to_pname = {}
	uuid_to_url = {}
	uuid_to_sockid = {}
	uuid_to_addr = {}

	pid_to_graph_id = {}
	current_graph_id = 0

	with open(input_file, 'r') as f:
	event_metadata_buffer = {} # filled/cleared on every new event
	streamspot_edge = {'event_uuid': None,
	'source_id': None,
	'source_name': None,
	'source_type': None,
	'dest_id': None,
	'dest_name': 'NA',
	'dest_type': None,
	'edge_type': None,
	'graph_id': None
	} # filled/cleared on every new event
	lineno = 0
	for line in f:
	line = line.strip()
	lineno += 1
	cdm_record = json.loads(line)
	cdm_record_type = cdm_record['datum'].keys()[0]
	cdm_record_values = cdm_record['datum'][cdm_record_type]

	#print cdm_record

	if cdm_record_type == CDM_TYPE_PRINCIPAL:
	continue # we don't care about principals

	elif cdm_record_type == CDM_TYPE_SUBJECT:
	uuid = cdm_record_values['uuid']
	type = cdm_record_values['type']
	if type == 'SUBJECT_PROCESS':
	pid = cdm_record_values['pid'] # source ID
	ppid = cdm_record_values['ppid'] # needed to assign graph ID
	pname = cdm_record_values['properties']['map']['programName']
	unitid = cdm_record_values['unitId']['int']
	#uuid_to_pid[uuid] = str(pid) + '/' + pname + '/' + str(unitid)
	uuid_to_pid[uuid] = pid
	uuid_to_pname[uuid] = pname
	if not pid in pid_to_graph_id: # pid has no gid assigned
	if not ppid in pid_to_graph_id: # ppid has no gid assigned
	pid_to_graph_id[pid] = current_graph_id
	pid_to_graph_id[ppid] = current_graph_id
	current_graph_id += 1
	else: # parent has a gid assigned
	pid_to_graph_id[pid] = pid_to_graph_id[ppid]
	else:
	print "Unknown subject type:", type
	sys.exit(-1)

	elif cdm_record_type == CDM_TYPE_FILE:
	uuid = cdm_record_values['uuid']
	url = cdm_record_values['url'] # destination ID
	uuid_to_url[uuid] = url

	elif cdm_record_type == CDM_TYPE_SOCK:
	uuid = cdm_record_values['uuid']

	src = cdm_record_values['srcAddress']
	dest = cdm_record_values['destAddress']
	src_port = cdm_record_values['srcPort']
	dest_port = cdm_record_values['destPort']
	sock_id = src + ':' + str(src_port) + ':' + dest + ':' + str(dest_port)

	uuid_to_sockid[uuid] = sock_id

	elif cdm_record_type == CDM_TYPE_MEM:
	uuid = cdm_record_values['uuid']
	addr = cdm_record_values['memoryAddress']
	uuid_to_addr[uuid] = addr

	elif cdm_record_type == CDM_TYPE_EVENT:
	# print previous streamspot edge if it is ready
	if not None in streamspot_edge.values():
	print str(streamspot_edge['source_id']) + '\t' +\
	str(streamspot_edge['source_name']) + '\t' +\
	str(streamspot_edge['source_type']) + '\t' +\
	str(streamspot_edge['dest_id']) + '\t' +\
	str(streamspot_edge['dest_name']) + '\t' +\
	str(streamspot_edge['dest_type']) + '\t' +\
	str(streamspot_edge['edge_type']) + '\t' +\
	str(streamspot_edge['graph_id'])

	# clear old edge data
	streamspot_edge = {'event_uuid': None,
	'source_id': None,
	'source_name': None,
	'source_type': None,
	'dest_id': None,
	'dest_name': 'NA',
	'dest_type': None,
	'edge_type': None,
	'graph_id': None
	}

	uuid = cdm_record_values['uuid']
	type = cdm_record_values['type']
	streamspot_edge['edge_type'] = type # streamspot edge type
	streamspot_edge['event_uuid'] = uuid # to map metadata

	elif cdm_record_type == CDM_TYPE_EDGE:
	type = cdm_record_values['type']
	if type == 'EDGE_SUBJECT_HASLOCALPRINCIPAL':
	pass
	elif type == 'EDGE_FILE_AFFECTS_EVENT':
	# HACK! FIXME
	# Special case for
	# - EVENT_UPDATE
	# - EVENT_RENAME
	if streamspot_edge['edge_type'] == 'EVENT_UPDATE' or \
	streamspot_edge['edge_type'] == 'EVENT_RENAME':
	assert cdm_record_values['toUuid'] == \
	streamspot_edge['event_uuid']
	from_uuid = cdm_record_values['fromUuid']
	url = uuid_to_url[from_uuid]
	streamspot_edge['dest_id'] = url
	streamspot_edge['dest_type'] = 'FILE'
	else:
	assert cdm_record_values['fromUuid'] == \
	streamspot_edge['event_uuid']
	to_uuid = cdm_record_values['toUuid']
	url = uuid_to_url[to_uuid]
	streamspot_edge['dest_id'] = url
	streamspot_edge['dest_type'] = 'FILE'
	elif type == 'EDGE_EVENT_AFFECTS_FILE':
	assert cdm_record_values['fromUuid'] == streamspot_edge['event_uuid']

	to_uuid = cdm_record_values['toUuid']
	url = uuid_to_url[to_uuid]
	streamspot_edge['dest_id'] = url
	streamspot_edge['dest_type'] = 'FILE'
	elif type == 'EDGE_MEMORY_AFFECTS_EVENT':
	assert cdm_record_values['fromUuid'] == streamspot_edge['event_uuid']

	to_uuid = cdm_record_values['toUuid']
	addr = uuid_to_addr[to_uuid]
	streamspot_edge['dest_id'] = addr
	streamspot_edge['dest_type'] = 'MEM'
	elif type == 'EDGE_EVENT_AFFECTS_MEMORY':
	assert cdm_record_values['fromUuid'] == streamspot_edge['event_uuid']

	to_uuid = cdm_record_values['toUuid']
	addr = uuid_to_addr[to_uuid]
	streamspot_edge['dest_id'] = addr
	streamspot_edge['dest_type'] = 'MEM'
	elif type == 'EDGE_EVENT_AFFECTS_NETFLOW':
	assert cdm_record_values['fromUuid'] == streamspot_edge['event_uuid']

	to_uuid = cdm_record_values['toUuid']
	sock_id = uuid_to_sockid[to_uuid]
	streamspot_edge['dest_id'] = sock_id
	streamspot_edge['dest_type'] = 'SOCK'
	elif type == 'EDGE_EVENT_AFFECTS_SUBJECT' or \
	type == 'EDGE_EVENT_ISGENERATEDBY_SUBJECT':
	assert cdm_record_values['fromUuid'] == streamspot_edge['event_uuid']

	to_uuid = cdm_record_values['toUuid']
	pid = uuid_to_pid[to_uuid]
	pname = uuid_to_pname[to_uuid]

	if type == 'EDGE_EVENT_AFFECTS_SUBJECT':
	streamspot_edge['dest_id'] = pid
	streamspot_edge['dest_name'] = pname
	streamspot_edge['dest_type'] = 'PROCESS'
	elif type == 'EDGE_EVENT_ISGENERATEDBY_SUBJECT':
	streamspot_edge['source_id'] = pid
	streamspot_edge['source_name'] = pname
	streamspot_edge['source_type'] = 'PROCESS'

	# graph ID assignment to streamspot edge
	if type == 'EDGE_EVENT_ISGENERATEDBY_SUBJECT':
	streamspot_edge['graph_id'] = \
	pid_to_graph_id[streamspot_edge['source_id']]

	# handle graph ID change on EXECUTE
	if streamspot_edge['edge_type'] == 'EVENT_EXECUTE':
	pid_to_graph_id[streamspot_edge['source_id']] = \
	current_graph_id # change graph ID of caller process
	current_graph_id += 1

	else:
	print 'Unknown edge type:', type
	sys.exit(-1)

	else:
	print 'Unknown CDM record type', cdm_record_type
	sys.exit(-1)


	# last event in buffer
	print str(streamspot_edge['source_id']) + '\t' +\
	str(streamspot_edge['source_name']) + '\t' +\
	str(streamspot_edge['source_type']) + '\t' +\
	str(streamspot_edge['dest_id']) + '\t' +\
	str(streamspot_edge['dest_name']) + '\t' +\
	str(streamspot_edge['dest_type']) + '\t' +\
	str(streamspot_edge['edge_type']) + '\t' +\
	str(streamspot_edge['graph_id'])
	#!/usr/bin/env python

	"""
	Visualises streamspot/infoleak_small_units.ss
	python visualise_streamspot_graph.py
	"""

	from graph_tool.all import *
	import sys

	process_color = [179/256.,205/256.,227/256.,0.8]
	file_color = [251/256.,180/256.,174/256.,0.8]
	mem_color = [204/256.,235/256.,197/256.,0.8]
	sock_color = [222/256.,203/256.,228/256.,0.8]

	group_map = {'PROCESS': 0,
	'FILE': 1,
	'MEM': 2,
	'SOCK': 3}

	def create_new_graph():
	g = Graph()

	gid = g.new_graph_property('int')

	vid = g.new_vertex_property('string')
	vtype = g.new_vertex_property('string')
	vlabel = g.new_vertex_property('string')
	vgroup = g.new_vertex_property('int32_t')
	vcolor = g.new_vertex_property('vector<double>')

	etype = g.new_edge_property('string')
	elabel = g.new_edge_property('string')
	ecolor = g.new_edge_property('vector<double>')

	g.gp.id = gid

	g.vp.id = vid
	g.vp.type = vtype
	g.vp.label = vlabel
	g.vp.group = vgroup
	g.vp.color = vcolor

	g.ep.type = etype
	g.ep.label = elabel
	g.ep.color = ecolor

	return g

	lno = 0
	graphs = {} # from gid to graph
	with open('streamspot/infoleak_small_units.ss', 'r') as f:
	for line in f:
	lno += 1
	#if lno > 20:
	# break
	line = line.strip()
	fields = line.split('\t')
	source_id = fields[0]
	source_name = fields[1]
	source_type = fields[2]
	dest_id = fields[3]
	dest_name = fields[4]
	dest_type = fields[5]
	edge_type = fields[6]
	graph_id = int(fields[7])

	if edge_type == 'EVENT_UNIT':
	continue

	#if edge_type == 'EVENT_EXECUTE':
	# continue # do not add this edge

	if not graph_id in graphs:
	graphs[graph_id] = create_new_graph()
	g = graphs[graph_id]

	# check if source vertex exists
	matches = find_vertex(g, g.vp.id, source_id)
	if len(matches) == 0:
	u = g.add_vertex()

	g.vp.id[u] = source_id
	g.vp.type[u] = source_type
	g.vp.label[u] = source_id + '\\' + source_name
	g.vp.group[u] = group_map[source_type]

	if source_type == 'PROCESS':
	g.vp.color[u] = process_color
	elif source_type == 'FILE':
	g.vp.color[u] = file_color
	elif source_type == 'MEM':
	g.vp.color[u] = mem_color
	elif source_type == 'SOCK':
	g.vp.color[u] = sock_color
	else:
	print 'unknown type', source_type
	else:
	u = matches[0]

	# check if destination vertex exists
	matches = find_vertex(g, g.vp.id, dest_id)
	if len(matches) == 0:
	v = g.add_vertex()

	g.vp.id[v] = dest_id
	g.vp.type[v] = dest_type
	g.vp.label[v] = dest_id + '\\' + dest_name
	g.vp.group[v] = group_map[dest_type]

	if dest_type == 'PROCESS':
	g.vp.color[v] = process_color
	elif dest_type == 'FILE':
	g.vp.color[v] = file_color
	g.vp.label[v] = dest_id.split('/')[-1]
	elif dest_type == 'MEM':
	g.vp.color[v] = mem_color
	elif dest_type == 'SOCK':
	g.vp.color[v] = sock_color
	else:
	print 'unknown type', source_type
	else:
	v = matches[0]

	e = g.add_edge(u,v)
	g.ep.type[e] = edge_type
	g.ep.label[e] = str(lno) + ':' + edge_type.split('_')[1]

	if edge_type in ['EVENT_EXECUTE', 'EVENT_CLONE']:
	g.ep.color[e] = process_color
	elif edge_type in ['EVENT_OPEN', 'EVENT_READ', 'EVENT_WRITE',
	'EVENT_UPDATE']:
	g.ep.color[e] = file_color
	elif edge_type in ['EVENT_BIND', 'EVENT_ACCEPT', 'EVENT_CONNECT']:
	g.ep.color[e] = sock_color
	else:
	g.ep.color[e] = [0.179, 0.203,0.210, 0.8]

	for graph_id, g in graphs.iteritems():
	pos = sfdp_layout(g, groups=g.vp.group)
	graph_draw(g, pos, output_size=(2000, 2000),
	vertex_text=g.vp.label, edge_text=g.ep.label,
	vertex_shape='circle',
	#vertex_size=5.0,
	vertex_color=g.vp.color,
	vertex_fill_color=[1.0,1.0,1.0,0.8],
	vertex_pen_width=5.0,
	edge_pen_width=5.0,
	edge_font_size=16,
	edge_font_weight=1.0,
	#nodesfirst=True,
	output='infoleaks_small_units_gid' + str(graph_id) + '.pdf')
	#!/usr/bin/env bash

	# Requires avro-tools-1.8.1.jar in the working directory (and a working JRE)
	# Requires Avro files in subdirectory cdm/ in the working directory
	# Writes JSON files to subdirectory json/ in the working directory

	mkdir json;
	for file in $(ls cdm);
	do
	java -jar avro-tools-1.8.1.jar tojson cdm/$file > json/${file};
	mv "json/${file}" "json/${file/%.avro/.json}"
	done