My current target uses a deterministic pattern when calling C++ constructors, so I can use the CFG to identify object instantiation. Here are my notes about how to use Ghidra's decompiler to get the sizes of objects to be created:
We can use the parameter of operator_new()
to find the size of the objects. Instead of parsing the instructions of the relevant basic blocks (and hoping that we don't run into some unexpected instruction sequences generated by the compiler) we can use the decompiler to get the association between the call to operator_new()
and its parameter.
Ghidra/Features/Decompiler/ghidra_scripts/ShowCCallsScript.java
contains a nice example of how to use the Decompiler API. First, an instance of DecompInterface
must be created, as shown in setUpDecompiler()
. Note that this method doesn't call openProgram()
on the returned DecomInterface
object, that is necessary to run decompilation! The decompileFunction()
method works as expected - the returned DecompileResults
object contains the "C Code" and "High Level" representations of the target function. We will use the former to get the decompiled representation of the interesting call.
At this point the most important objects we have are:
- The root node of the "C Code" representation obtained by
decompRes.getCCodeMarkup()
: C code is represented as a tree of nodes (ClangNode
), and this one bounds all the pieces of the function together. We will recursively traverse the tree from here to find the target C statements (CCodeStatement
). - The
Reference
object from the call site tooperand_new()
: This will give us theAddress
marker to bind the C statement with the disassembled instructions.
The tree traversal is implemented in the printCall()
method of the example. A simple loop goes through all children of the current ClangNode
and invokes printCall()
on them recursively. If we find a ClangNode
that is a ClangStatement
, and where the last address associated with the node is the same as the last address of our CALL of interest, we found the desired statement. The toString()
method of the example can be reused to get a string representation of any ClangStatement
, but a more elegant way to dissect the node is to analyze its subnodes further. My approach is to collect all contained ClangVariableToken
and ClangFuncNameToken
nodes in a list so it will only include relevant data, without whitespace, comments, etc. Basic sanity checks can be done on the list elements based on their string representations (their toString()
implementations work as expected), while more complex analysis can be done on P-Code (ClangSyntaxToken.getPCodeOp()
).