Skip to content

Instantly share code, notes, and snippets.

@fairlight1337
Created September 17, 2014 18:16
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save fairlight1337/5366de28a6ae9316715d to your computer and use it in GitHub Desktop.
Save fairlight1337/5366de28a6ae9316715d to your computer and use it in GitHub Desktop.
Convert Weka J48 decision tree output to structured JSON
## Copyright (c) 2014, Jan Winkler <winkler@cs.uni-bremen.de>
## All rights reserved.
##
## Redistribution and use in source and binary forms, with or without
## modification, are permitted provided that the following conditions are met:
##
## * Redistributions of source code must retain the above copyright
## notice, this list of conditions and the following disclaimer.
## * Redistributions in binary form must reproduce the above copyright
## notice, this list of conditions and the following disclaimer in the
## documentation and/or other materials provided with the distribution.
## * Neither the name of the Institute for Artificial Intelligence/
## Universitaet Bremen nor the names of its contributors may be used to
## endorse or promote products derived from this software without specific
## prior written permission.
##
## THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
## AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
## IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
## ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE
## LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
## CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
## SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
## INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
## CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
## ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
## POSSIBILITY OF SUCH DAMAGE.
#!/usr/bin/python
import fileinput
import re
import json
lines = []
at_tree = False
line_skip = 0
for line_raw in fileinput.input():
if line_skip > 0:
line_skip -= 1
else:
line = line_raw.rstrip();
if not at_tree:
if line == "J48 pruned tree":
at_tree = True
line_skip = 2
else:
if line == "":
break;
lines.append(line)
data_lines = []
for line in lines:
level = line.count("|")
unlevelled_line = line[level * 4:]
m = re.match("(?P<variable>[\w\-]+) (?P<operator>[\<\=\>\!]+) (?P<value>[0-9a-zA-Z\.\-_]+): (?P<result>[\S]+) \((?P<occurrences>[\S]+)\)", unlevelled_line)
if not m: # this is not a result line, re-evaluate as normal line
m = re.match("(?P<variable>[\w\-]+) (?P<operator>[\<\=\>\!]+) (?P<value>[0-9a-zA-Z\.\-_]+)", unlevelled_line)
data = dict(m.groupdict().items() + {"level": level}.items())
data_lines.append(data)
def recTB(data_lines, level = 0):
children = []
index = 0
for data_line in data_lines:
if data_line["level"] == level:
# this one is on the current level - add it to the children
children.append({"data": data_line, "children": []})
index += 1
elif data_line["level"] > level:
# this is a child of the current level, recurse
(intres, intindex) = recTB(data_lines[index:], level + 1)
index += intindex
for intr in intres:
children[len(children) - 1]["children"].append(intr)
else:
# this is the end of our level, return.
break;
return (children, index)
def is_number(s):
try:
float(s)
return True
except ValueError:
return False
def formatDTree(branch_children):
formatted = []
for child in branch_children:
append_data = {}
append_data["relation"] = {}
val = child["data"]["value"]
if is_number(val):
if "." in val:
val = float(val)
else:
val = int(val)
append_data["relation"][child["data"]["operator"]] = {"value": val,
"variable": child["data"]["variable"]}
if "result" in child["data"]:
append_data["true"] = [{"result": child["data"]["result"]}]
else:
append_data["true"] = formatDTree(child["children"])
formatted.append(append_data)
return formatted
with open("o.json", "wb") as f:
(items, index) = recTB(data_lines)
dtree = formatDTree(items)
json.dump(dtree, f)
@imeiliasantoso
Copy link

Hello, where should I add my prune tree result or file in this code?

My result is like this:

rank_country <= -1
| Length_of_url <= 52
| | safebrowsing <= 0: Phishing (231.0/1.0)
| | safebrowsing > 0
| | | path_token_count <= 1: Unknown (2.0)
| | | path_token_count > 1
| | | | No_of_dots <= 3: Phishing (13.0/1.0)
| | | | No_of_dots > 3: Unknown (2.0)
| Length_of_url > 52
| | safebrowsing <= 0
| | | rank_host <= 1270769
| | | | sec_sen_word_cnt <= 0: Phishing (10.0)
| | | | sec_sen_word_cnt > 0: Unknown (3.0/1.0)
| | | rank_host > 1270769: Unknown (4.0)
| | safebrowsing > 0: Unknown (32.0)
rank_country > -1
| rank_country <= 787: Not Phishing (515.0)
| rank_country > 787
| | No_of_dots <= 1: Not Phishing (13.0)
| | No_of_dots > 1: Unknown (5.0)

or

digraph J48Tree {
N0 [label="rank_country" ]
N0->N1 [label="<= -1"]
N1 [label="Length_of_url" ]
N1->N2 [label="<= 52"]
N2 [label="safebrowsing" ]
N2->N3 [label="<= 0"]
N3 [label="Phishing (231.0/1.0)" shape=box style=filled ]
N2->N4 [label="> 0"]
N4 [label="path_token_count" ]
N4->N5 [label="<= 1"]
N5 [label="Unknown (2.0)" shape=box style=filled ]
N4->N6 [label="> 1"]
N6 [label="No_of_dots" ]
N6->N7 [label="<= 3"]
N7 [label="Phishing (13.0/1.0)" shape=box style=filled ]
N6->N8 [label="> 3"]
N8 [label="Unknown (2.0)" shape=box style=filled ]
N1->N9 [label="> 52"]
N9 [label="safebrowsing" ]
N9->N10 [label="<= 0"]
N10 [label="rank_host" ]
N10->N11 [label="<= 1270769"]
N11 [label="sec_sen_word_cnt" ]
N11->N12 [label="<= 0"]
N12 [label="Phishing (10.0)" shape=box style=filled ]
N11->N13 [label="> 0"]
N13 [label="Unknown (3.0/1.0)" shape=box style=filled ]
N10->N14 [label="> 1270769"]
N14 [label="Unknown (4.0)" shape=box style=filled ]
N9->N15 [label="> 0"]
N15 [label="Unknown (32.0)" shape=box style=filled ]
N0->N16 [label="> -1"]
N16 [label="rank_country" ]
N16->N17 [label="<= 787"]
N17 [label="Not Phishing (515.0)" shape=box style=filled ]
N16->N18 [label="> 787"]
N18 [label="No_of_dots" ]
N18->N19 [label="<= 1"]
N19 [label="Not Phishing (13.0)" shape=box style=filled ]
N18->N20 [label="> 1"]
N20 [label="Unknown (5.0)" shape=box style=filled ]
}

Thank you

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment