Skip to content

Instantly share code, notes, and snippets.

@emaadmanzoor
Last active July 4, 2016 06:25
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save emaadmanzoor/b23d46c9d4d064746e1ef667cb8070bc to your computer and use it in GitHub Desktop.
Save emaadmanzoor/b23d46c9d4d064746e1ef667cb8070bc to your computer and use it in GitHub Desktop.
CDM to StreamSpot Format Translation

Converting from CDM to StreamSpot Format

StreamSpot Format

StreamSpot takes input as a TSV file containing on each line the following fields:

source_id source_type destination_id destination_type edge_type graph_id

The fields are data-typed as follows:

  • source_id and destination_id: Integer greater than 1 (0 stands for NA, like anonymous memory maps).
  • source_type, destination_type, edge_type: Single byte (we've been using ASCII characters so far).
  • graph_id: Integer greater than 0.

An example record in StreamSpot's format is: 4 a 5 c p 0.

The data it currently uses is available here.

CDM Format

We are using the CDM 12 schema.

CDM Events

CDM-formatted JSON data contains events in the following format:

{"datum":{"com.bbn.tc.schema.avro.Event":{"uuid":"=ÖmÖä<8f><99>\u0015øNe¢÷<84>\f®ïC<94>é®<85>\u0001j\u0000 ǹs\u001AÓï","sequence":65362,"type":"EVENT_OPEN","threadId":28005,"source":"SOURCE_LINUX_AUDIT_TRACE","timestampMicros":{"long":1450217860803},"name":null,"parameters":null,"location":null,"size":null,"programPoint":null,"properties":{"map":{"eventId":"65362"}}}},"CDMVersion":"12"}

Unlike edges, CDM events are not self-contained: the only thing available is the event type (EVENT_OPEN in the example above). The rest of the metadata is contained in other non-event records occuring both previously and after the event record in the CDM data. Let's look at these records now.

CDM Subjects, Objects and Principals

Subjects correspond to entities like processes, threads and units.

{"datum":{"com.bbn.tc.schema.avro.Subject":{"uuid":"~#<92>Å7<9f>ÜUèÓ\ryEQ¬¾\"Ëõ]<96>xb3#}UcÏåÿ<90>","type":"SUBJECT_PROCESS","pid":28005,"ppid":27999,"source":"SOURCE_LINUX_AUDIT_TRACE","startTimestampMicros":null,"unitId":{"int":0},"endTimestampMicros":null,"cmdLine":null,"importedLibraries":null,"exportedLibraries":null,"pInfo":null,"properties":{"map":{"uid":"0","programName":"sudo","group":"1003"}}}},"CDMVersion":"12"}

Objects correspond to entities like files, sockets or memory addresses.

{"datum":{"com.bbn.tc.schema.avro.FileObject":{"uuid":"Ý<9c>¨©<96>°<98>\r\u0012p <87>ͦ¤õ\u0005L<91>D<8c>Â<96><97>\\;\u0011µbW<82>S","baseObject":{"source":"SOURCE_LINUX_AUDIT_TRACE","permission":null,"lastTimestampMicros":null,"properties":null},"url":"file:///etc/login.defs","isPipe":false,"version":0,"size":null}},"CDMVersion":"12"}

Principals correspond to entities like local and remote users, which StreamSpot doesn't use as of now.

{"datum":{"com.bbn.tc.schema.avro.Principal":{"uuid":"\u001DÕ\u0019.ÖÑ8,¸ïB:»lò\u0011<96>?3Æ\rÌYï\u001C\u0013É7²Õðç","type":"PRINCIPAL_LOCAL","userId":0,"groupIds":[1003],"source":"SOURCE_LINUX_AUDIT_TRACE","properties":{"map":{"euid":"0","egid":"1003"}}}},"CDMVersion":"12"}

All 3 of the above CDM records appeared before the event record in the file.

CDM Simple Edges

CDM edges are what connect events to subjects, objects and principals. The CDM edge types we care about are:

  • EDGE_EVENT_ISGENERATEDBY_SUBJECT: When an event is generated by a process/thread/unit.
  • EDGE_EVENT_AFFECTS_*: When an event writes to a file/registry/socket or forks a process.
  • EDGE_*_AFFECTS_EVENT: When an event reads from a file/registry/socket.

The entire CDM edge type list is here.

Continuing the example from above, the following CDM edge records following the event provide the metadata we need:

  • {"datum":{"com.bbn.tc.schema.avro.SimpleEdge":{"fromUuid":"=ÖmÖä<8f><99>\u0015øNe¢÷<84>\f®ïC<94>é®<85>\u0001j\u0000 ǹs\u001AÓï","toUuid":"Ý<9c>¨©<96>°<98>\r\u0012p <87>ͦ¤õ\u0005L<91>D<8c>Â<96><97>\\;\u0011µbW<82>S","type":"EDGE_FILE_AFFECTS_EVENT","timestamp":1450217860803,"properties":null}},"CDMVersion":"12"}

This tells us that a specific file object (identified by the UUID) was read during the event; so it provides us the destination type. By following this CDM edge to the object UUID, we can find out the destination ID (by mapping the filename to an ID).

  • {"datum":{"com.bbn.tc.schema.avro.SimpleEdge":{"fromUuid":"=ÖmÖä<8f><99>\u0015øNe¢÷<84>\f®ïC<94>é®<85>\u0001j\u0000 ǹs\u001AÓï","toUuid":"~#<92>Å7<9f>ÜUèÓ\ryEQ¬¾\"Ëõ]<96>xb3#}UcÏåÿ<90>","type":"EDGE_EVENT_ISGENERATEDBY_SUBJECT","timestamp":1450217860803,"properties":null}},"CDMVersion":"12"}

This tells us which process/thread/unit subject (identified by the UUID) generated the event. By following this edge to the subject UUID, we can find out the source type and source ID.

Combining CDM records into a StreamSpot edge

By reading all the records preceding the event record and following the event record until the next event record, we can collect all the metadata needed to (almost) construct a StreamSpot edge (for the example above):

  • Edge type: EVENT_OPEN, directly from the event record.
  • Source type: SUBJECT_PROCESS, from a CDM record following the event record.
  • Source ID: "pid":28005, from a CDM record preceding the event record.
  • Destination type: EDGE_FILE_AFFECTS_EVENT, from a CDM record following the event record.
  • Destination ID: "url":"file:///etc/login.defs", from a CDM record preceding the event record. Note that the source and destination may need to be swapped here since the file is read and information flows to the process.

We still need to apply the following transformations:

  • Map source, destination and type names to single bytes: this can be done using the schema.
  • Map each destination ID (filename, memory address, URL) to an integer index.

We will maintain these maps in memory. What's left now is assigning graph ID's.

Assigning a graph ID to each StreamSpot edge

We maintain a mapping from process/unit/thread ID to graph ID. Subsequent rules assume processes. The rules of graph ID assignment are:

  • Two process ID's will have the same graph ID if one was forked from the other.
  • Two process ID's will have different graph ID's if neither was forked from the other.
  • Two process ID's will have different graph ID's if the parent forked then exec'd.
  • Graph ID's start from 0, so the first process is assigned graph ID 0.

The mechanics will work as follows:

  • When a new SUBJECT_PROCESS CDM record is read:
    • Check if its pid already has an assigned graph ID.
    • If a graph ID is already assigned to the pid, then continue. (this case should not occur, ideally)
    • If a graph ID is not assigned to the pid, then check if its ppid already has an assigned graph ID:
      • If a graph ID is assigned to the ppid, assign the same graph ID to the pid.
      • If a graph ID is not assigned to the ppid, assign the next available graph ID to the pid.
    • All subsequent events generated by this pid will use this assigned graph ID for the StreamSpot edge.
  • When a new EVENT_EXECUTE CDM record is read:
    • Note the threadId field of the event: this is the pid of the child process.
    • Assign the next available graph ID to this pid, replacing any existing assignment.
    • All subsequent events generated by this pid will use this assigned graph ID for the StreamSpot edge.
#!/usr/bin/env python
import argparse
import json
import pdb
import sys
# CDM record type constants
CDM_TYPE_PRINCIPAL = 'com.bbn.tc.schema.avro.Principal'
CDM_TYPE_SUBJECT = 'com.bbn.tc.schema.avro.Subject'
CDM_TYPE_FILE = 'com.bbn.tc.schema.avro.FileObject'
CDM_TYPE_MEM = 'com.bbn.tc.schema.avro.MemoryObject'
CDM_TYPE_SOCK = 'com.bbn.tc.schema.avro.NetFlowObject'
CDM_TYPE_EDGE = 'com.bbn.tc.schema.avro.SimpleEdge'
CDM_TYPE_EVENT = 'com.bbn.tc.schema.avro.Event'
parser = argparse.ArgumentParser()
parser.add_argument('-i', '--input', help='Input CDM/JSON file', required=True)
parser.add_argument('-o','--output', help='Output StreamSpot edge file', required=True)
args = vars(parser.parse_args())
input_file = args['input']
output_file = args['output']
uuid_to_pid = {}
uuid_to_pname = {}
uuid_to_url = {}
uuid_to_sockid = {}
uuid_to_addr = {}
pid_to_graph_id = {}
current_graph_id = 0
with open(input_file, 'r') as f:
event_metadata_buffer = {} # filled/cleared on every new event
streamspot_edge = {'event_uuid': None,
'source_id': None,
'source_name': None,
'source_type': None,
'dest_id': None,
'dest_name': 'NA',
'dest_type': None,
'edge_type': None,
'graph_id': None
} # filled/cleared on every new event
lineno = 0
for line in f:
line = line.strip()
lineno += 1
cdm_record = json.loads(line)
cdm_record_type = cdm_record['datum'].keys()[0]
cdm_record_values = cdm_record['datum'][cdm_record_type]
#print cdm_record
if cdm_record_type == CDM_TYPE_PRINCIPAL:
continue # we don't care about principals
elif cdm_record_type == CDM_TYPE_SUBJECT:
uuid = cdm_record_values['uuid']
type = cdm_record_values['type']
if type == 'SUBJECT_PROCESS':
pid = cdm_record_values['pid'] # source ID
ppid = cdm_record_values['ppid'] # needed to assign graph ID
pname = cdm_record_values['properties']['map']['programName']
unitid = cdm_record_values['unitId']['int']
#uuid_to_pid[uuid] = str(pid) + '/' + pname + '/' + str(unitid)
uuid_to_pid[uuid] = pid
uuid_to_pname[uuid] = pname
if not pid in pid_to_graph_id: # pid has no gid assigned
if not ppid in pid_to_graph_id: # ppid has no gid assigned
pid_to_graph_id[pid] = current_graph_id
pid_to_graph_id[ppid] = current_graph_id
current_graph_id += 1
else: # parent has a gid assigned
pid_to_graph_id[pid] = pid_to_graph_id[ppid]
else:
print "Unknown subject type:", type
sys.exit(-1)
elif cdm_record_type == CDM_TYPE_FILE:
uuid = cdm_record_values['uuid']
url = cdm_record_values['url'] # destination ID
uuid_to_url[uuid] = url
elif cdm_record_type == CDM_TYPE_SOCK:
uuid = cdm_record_values['uuid']
src = cdm_record_values['srcAddress']
dest = cdm_record_values['destAddress']
src_port = cdm_record_values['srcPort']
dest_port = cdm_record_values['destPort']
sock_id = src + ':' + str(src_port) + ':' + dest + ':' + str(dest_port)
uuid_to_sockid[uuid] = sock_id
elif cdm_record_type == CDM_TYPE_MEM:
uuid = cdm_record_values['uuid']
addr = cdm_record_values['memoryAddress']
uuid_to_addr[uuid] = addr
elif cdm_record_type == CDM_TYPE_EVENT:
# print previous streamspot edge if it is ready
if not None in streamspot_edge.values():
print str(streamspot_edge['source_id']) + '\t' +\
str(streamspot_edge['source_name']) + '\t' +\
str(streamspot_edge['source_type']) + '\t' +\
str(streamspot_edge['dest_id']) + '\t' +\
str(streamspot_edge['dest_name']) + '\t' +\
str(streamspot_edge['dest_type']) + '\t' +\
str(streamspot_edge['edge_type']) + '\t' +\
str(streamspot_edge['graph_id'])
# clear old edge data
streamspot_edge = {'event_uuid': None,
'source_id': None,
'source_name': None,
'source_type': None,
'dest_id': None,
'dest_name': 'NA',
'dest_type': None,
'edge_type': None,
'graph_id': None
}
uuid = cdm_record_values['uuid']
type = cdm_record_values['type']
streamspot_edge['edge_type'] = type # streamspot edge type
streamspot_edge['event_uuid'] = uuid # to map metadata
elif cdm_record_type == CDM_TYPE_EDGE:
type = cdm_record_values['type']
if type == 'EDGE_SUBJECT_HASLOCALPRINCIPAL':
pass
elif type == 'EDGE_FILE_AFFECTS_EVENT':
# HACK! FIXME
# Special case for
# - EVENT_UPDATE
# - EVENT_RENAME
if streamspot_edge['edge_type'] == 'EVENT_UPDATE' or \
streamspot_edge['edge_type'] == 'EVENT_RENAME':
assert cdm_record_values['toUuid'] == \
streamspot_edge['event_uuid']
from_uuid = cdm_record_values['fromUuid']
url = uuid_to_url[from_uuid]
streamspot_edge['dest_id'] = url
streamspot_edge['dest_type'] = 'FILE'
else:
assert cdm_record_values['fromUuid'] == \
streamspot_edge['event_uuid']
to_uuid = cdm_record_values['toUuid']
url = uuid_to_url[to_uuid]
streamspot_edge['dest_id'] = url
streamspot_edge['dest_type'] = 'FILE'
elif type == 'EDGE_EVENT_AFFECTS_FILE':
assert cdm_record_values['fromUuid'] == streamspot_edge['event_uuid']
to_uuid = cdm_record_values['toUuid']
url = uuid_to_url[to_uuid]
streamspot_edge['dest_id'] = url
streamspot_edge['dest_type'] = 'FILE'
elif type == 'EDGE_MEMORY_AFFECTS_EVENT':
assert cdm_record_values['fromUuid'] == streamspot_edge['event_uuid']
to_uuid = cdm_record_values['toUuid']
addr = uuid_to_addr[to_uuid]
streamspot_edge['dest_id'] = addr
streamspot_edge['dest_type'] = 'MEM'
elif type == 'EDGE_EVENT_AFFECTS_MEMORY':
assert cdm_record_values['fromUuid'] == streamspot_edge['event_uuid']
to_uuid = cdm_record_values['toUuid']
addr = uuid_to_addr[to_uuid]
streamspot_edge['dest_id'] = addr
streamspot_edge['dest_type'] = 'MEM'
elif type == 'EDGE_EVENT_AFFECTS_NETFLOW':
assert cdm_record_values['fromUuid'] == streamspot_edge['event_uuid']
to_uuid = cdm_record_values['toUuid']
sock_id = uuid_to_sockid[to_uuid]
streamspot_edge['dest_id'] = sock_id
streamspot_edge['dest_type'] = 'SOCK'
elif type == 'EDGE_EVENT_AFFECTS_SUBJECT' or \
type == 'EDGE_EVENT_ISGENERATEDBY_SUBJECT':
assert cdm_record_values['fromUuid'] == streamspot_edge['event_uuid']
to_uuid = cdm_record_values['toUuid']
pid = uuid_to_pid[to_uuid]
pname = uuid_to_pname[to_uuid]
if type == 'EDGE_EVENT_AFFECTS_SUBJECT':
streamspot_edge['dest_id'] = pid
streamspot_edge['dest_name'] = pname
streamspot_edge['dest_type'] = 'PROCESS'
elif type == 'EDGE_EVENT_ISGENERATEDBY_SUBJECT':
streamspot_edge['source_id'] = pid
streamspot_edge['source_name'] = pname
streamspot_edge['source_type'] = 'PROCESS'
# graph ID assignment to streamspot edge
if type == 'EDGE_EVENT_ISGENERATEDBY_SUBJECT':
streamspot_edge['graph_id'] = \
pid_to_graph_id[streamspot_edge['source_id']]
# handle graph ID change on EXECUTE
if streamspot_edge['edge_type'] == 'EVENT_EXECUTE':
pid_to_graph_id[streamspot_edge['source_id']] = \
current_graph_id # change graph ID of caller process
current_graph_id += 1
else:
print 'Unknown edge type:', type
sys.exit(-1)
else:
print 'Unknown CDM record type', cdm_record_type
sys.exit(-1)
# last event in buffer
print str(streamspot_edge['source_id']) + '\t' +\
str(streamspot_edge['source_name']) + '\t' +\
str(streamspot_edge['source_type']) + '\t' +\
str(streamspot_edge['dest_id']) + '\t' +\
str(streamspot_edge['dest_name']) + '\t' +\
str(streamspot_edge['dest_type']) + '\t' +\
str(streamspot_edge['edge_type']) + '\t' +\
str(streamspot_edge['graph_id'])
#!/usr/bin/env python
"""
Visualises streamspot/infoleak_small_units.ss
python visualise_streamspot_graph.py
"""
from graph_tool.all import *
import sys
process_color = [179/256.,205/256.,227/256.,0.8]
file_color = [251/256.,180/256.,174/256.,0.8]
mem_color = [204/256.,235/256.,197/256.,0.8]
sock_color = [222/256.,203/256.,228/256.,0.8]
group_map = {'PROCESS': 0,
'FILE': 1,
'MEM': 2,
'SOCK': 3}
def create_new_graph():
g = Graph()
gid = g.new_graph_property('int')
vid = g.new_vertex_property('string')
vtype = g.new_vertex_property('string')
vlabel = g.new_vertex_property('string')
vgroup = g.new_vertex_property('int32_t')
vcolor = g.new_vertex_property('vector<double>')
etype = g.new_edge_property('string')
elabel = g.new_edge_property('string')
ecolor = g.new_edge_property('vector<double>')
g.gp.id = gid
g.vp.id = vid
g.vp.type = vtype
g.vp.label = vlabel
g.vp.group = vgroup
g.vp.color = vcolor
g.ep.type = etype
g.ep.label = elabel
g.ep.color = ecolor
return g
lno = 0
graphs = {} # from gid to graph
with open('streamspot/infoleak_small_units.ss', 'r') as f:
for line in f:
lno += 1
#if lno > 20:
# break
line = line.strip()
fields = line.split('\t')
source_id = fields[0]
source_name = fields[1]
source_type = fields[2]
dest_id = fields[3]
dest_name = fields[4]
dest_type = fields[5]
edge_type = fields[6]
graph_id = int(fields[7])
if edge_type == 'EVENT_UNIT':
continue
#if edge_type == 'EVENT_EXECUTE':
# continue # do not add this edge
if not graph_id in graphs:
graphs[graph_id] = create_new_graph()
g = graphs[graph_id]
# check if source vertex exists
matches = find_vertex(g, g.vp.id, source_id)
if len(matches) == 0:
u = g.add_vertex()
g.vp.id[u] = source_id
g.vp.type[u] = source_type
g.vp.label[u] = source_id + '\\' + source_name
g.vp.group[u] = group_map[source_type]
if source_type == 'PROCESS':
g.vp.color[u] = process_color
elif source_type == 'FILE':
g.vp.color[u] = file_color
elif source_type == 'MEM':
g.vp.color[u] = mem_color
elif source_type == 'SOCK':
g.vp.color[u] = sock_color
else:
print 'unknown type', source_type
else:
u = matches[0]
# check if destination vertex exists
matches = find_vertex(g, g.vp.id, dest_id)
if len(matches) == 0:
v = g.add_vertex()
g.vp.id[v] = dest_id
g.vp.type[v] = dest_type
g.vp.label[v] = dest_id + '\\' + dest_name
g.vp.group[v] = group_map[dest_type]
if dest_type == 'PROCESS':
g.vp.color[v] = process_color
elif dest_type == 'FILE':
g.vp.color[v] = file_color
g.vp.label[v] = dest_id.split('/')[-1]
elif dest_type == 'MEM':
g.vp.color[v] = mem_color
elif dest_type == 'SOCK':
g.vp.color[v] = sock_color
else:
print 'unknown type', source_type
else:
v = matches[0]
e = g.add_edge(u,v)
g.ep.type[e] = edge_type
g.ep.label[e] = str(lno) + ':' + edge_type.split('_')[1]
if edge_type in ['EVENT_EXECUTE', 'EVENT_CLONE']:
g.ep.color[e] = process_color
elif edge_type in ['EVENT_OPEN', 'EVENT_READ', 'EVENT_WRITE',
'EVENT_UPDATE']:
g.ep.color[e] = file_color
elif edge_type in ['EVENT_BIND', 'EVENT_ACCEPT', 'EVENT_CONNECT']:
g.ep.color[e] = sock_color
else:
g.ep.color[e] = [0.179, 0.203,0.210, 0.8]
for graph_id, g in graphs.iteritems():
pos = sfdp_layout(g, groups=g.vp.group)
graph_draw(g, pos, output_size=(2000, 2000),
vertex_text=g.vp.label, edge_text=g.ep.label,
vertex_shape='circle',
#vertex_size=5.0,
vertex_color=g.vp.color,
vertex_fill_color=[1.0,1.0,1.0,0.8],
vertex_pen_width=5.0,
edge_pen_width=5.0,
edge_font_size=16,
edge_font_weight=1.0,
#nodesfirst=True,
output='infoleaks_small_units_gid' + str(graph_id) + '.pdf')
#!/usr/bin/env bash
# Requires avro-tools-1.8.1.jar in the working directory (and a working JRE)
# Requires Avro files in subdirectory cdm/ in the working directory
# Writes JSON files to subdirectory json/ in the working directory
mkdir json;
for file in $(ls cdm);
do
java -jar avro-tools-1.8.1.jar tojson cdm/$file > json/${file};
mv "json/${file}" "json/${file/%.avro/.json}"
done
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment