Skip to content

Instantly share code, notes, and snippets.

@jtpaasch
Created February 16, 2024 19:13
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save jtpaasch/71360bb74748a41a6ac9230232a6fb13 to your computer and use it in GitHub Desktop.
Save jtpaasch/71360bb74748a41a6ac9230232a6fb13 to your computer and use it in GitHub Desktop.
Parsing C++ ASTs with Clang

Clang AST parsing

Quick start for using Clang to parse C++ and get an AST.

Some useful resources on this matter:

Installing clang

Download the latest Ubuntu docker image.

nerdctl pull ubuntu:latest

Run the docker container:

cd ~
nerdctl run --rm -ti -v $(pwd):/external -w /external ubuntu:latest bash

Update packages:

apt update
apt install build-essential

Download the latest llvm-clang package from https://github.com/llvm/llvm-project/releases/.

Uncompress the folder, and move it somewhere, e.g.:

mv ~/Downloads/clang+llvm-16.0.4... /usr/local/clang-16

Set your PATH so you can see the installation, e.g.:

export PATH="/usr/local/clang-16/bin":${PATH}

Create a directory to do some clang foo in, e.g.:

mkdir ~/code/clang-foo
cd ~/code/clang-foo

Make sure you have clang and can build with it. E.g., create a file toy.cpp:

int main() {
    return 3;
}

Compile and run it:

clang++ -o toy.exe toy.cpp
./toy.exe
$? // should be 3

Another thing to do is dump the AST:

clang++ -Xclang -ast-dump -fsyntax-only toy.cpp

Make sure you can build with clang's library. E.g., create a file main.cpp:

#include <iostream>
#include <clang-c/Index.h>

int main() {
    std::cout << "Running..." << std::endl;
}

Compile and run it, telling clang where the include files are:

clang++ -I /usr/local/clang-16/include -o main.exe main.cpp
./main.exe

Using Clang to parse and walk an AST

Change main.cpp to this:

#include <iostream>
#include <clang-c/Index.h>

int main() {

    // Create an index. An index is the index of translation units
    // that go together to make up an executable or library.
    int IGNORE_LOCAL_DECLS = 0;
    int DIAGNOSTICS = 0;
    CXIndex index = clang_createIndex(IGNORE_LOCAL_DECLS, DIAGNOSTICS);

    // Parse a C++ file. This yields a translation unit if it succeeds,
    // or if it fails, it simply yields nothing. To get an error message,
    // use `clang_parseTranslationUnit2`. Here we keep it simple.
    CXTranslationUnit unit = clang_parseTranslationUnit(
        index, // The index to register this translation unit's AST with.
        "toy.cpp", // The file to parse and build an AST for.
        nullptr, // We have no command line args to pass to Clang.
        0, // Num of command line args to pass to Clang is 0.
        nullptr, // We have no other files to parse.
        0, // Num of other files to parse is 0.
        CXTranslationUnit_None // No special options.
        );

    // If clang failed to parse `toy.cpp`, show a message and exit.
    if (unit == nullptr) {
        std::cerr << "Can't parse toy.cpp." << std::endl;
        clang_disposeIndex(index); // clean up
        exit(1);
    }

    // Clean up.
    clang_disposeTranslationUnit(unit);
    clang_disposeIndex(index);

}

Compile it, telling clang where the include files are, where the library files are, and the name of the library to link against:

clang++ -I /usr/local/clang-16/include -L /usr/local/clang-16/lib -l clang -o main.exe main.cpp

Run it, telling the OS where the library can be found:

LD_LIBRARY_PATH=/usr/local/clang-16/lib ./main.exe
echo $? // Should be 0

Let's modify main.cpp now, and walk the AST. Here is an example:

#include <iostream>
#include <clang-c/Index.h>

// Print info about what kind of node k is.
void printNameOfNodeKind(int k) {
    switch (k) {
        case CXCursor_FunctionDecl:
            std::cout << "Function declaration" << std::endl;
            break;
        case CXCursor_CompoundStmt:
            std::cout << "Compound statement" << std::endl;
            break;
        case CXCursor_ReturnStmt:
            std::cout << "Return statement" << std::endl;
            break;
        case CXCursor_IntegerLiteral:
            std::cout << "Integer literal" << std::endl;
            break;
        default:
            std::cout << "Unknown kind" << std::endl;
            break;
    }
}

// Visit the node the cursor is pointing at.
CXChildVisitResult visitor(
    CXCursor child,
    CXCursor parent,
    CXClientData data
) {
    int kind = clang_getCursorKind(child);
    printNameOfNodeKind(kind);
    return CXChildVisit_Recurse; // Recurse down to the node's children.
}

int main() {

    // Create an index. An index is the index of translation units
    // that go together to make up an executable or library.
    int IGNORE_LOCAL_DECLS = 0;
    int DIAGNOSTICS = 0;
    CXIndex index = clang_createIndex(IGNORE_LOCAL_DECLS, DIAGNOSTICS);

    // Parse a C++ file. This yields a translation unit if it succeeds,
    // or if it fails, it simply yields nothing. To get an error message,
    // use `clang_parseTranslationUnit2`. Here we keep it simple.
    CXTranslationUnit unit = clang_parseTranslationUnit(
        index, // The index to register this translation unit's AST with.
        "toy.cpp", // The file to parse and build an AST for.
        nullptr, // We have no command line args to pass to Clang.
        0, // Num of command line args to pass to Clang is 0.
        nullptr, // We have no other files to parse.
        0, // Num of other files to parse is 0.
        CXTranslationUnit_None // No special options.
        );

    // If clang failed to parse `toy.cpp`, show a message and exit.
    if (unit == nullptr) {
        std::cerr << "Can't parse toy.cpp." << std::endl;
        clang_disposeIndex(index); // clean up
        exit(1);
    }

    // Get a cursor. The cursor returned here points to the first
    // node in the AST.
    CXCursor cursor = clang_getTranslationUnitCursor(unit);

    // Visit the children of the cursor.
    clang_visitChildren(
        cursor,  // The cursor whose children we want to visit.
        visitor, // Call this at each node the cursor visits.
        nullptr  // We have no data to pass to the visitor.
        );

    // Clean up.
    clang_disposeTranslationUnit(unit);
    clang_disposeIndex(index);

}

Here we get a cursor, and we use the clang_visitChildren function to visit each child node of the cursor. We provide a custom visitor function, which clang calls at each node. In that function, we do little more than print out what kind of node it is. You can find out more about different kinds of nodes in the documentation, e.g., at https://clang.llvm.org/doxygen/group__CINDEX.html (Ctrl+f "CXCursorKind").

Compile the above file and run it:

clang++ -I /usr/local/clang-16/include -L /usr/local/clang-16/lib -l clang -o main.exe main.cpp
LD_LIBRARY_PATH=/usr/local/clang-16/lib ./main.exe

It should print out something like this:

Function declaration
Compound statement
Return statement
Integer literal

This makes sense. It recursively walked through the AST of toy.cpp, which begins with a function declaration (i.e., main()), then it has as a child a compound statement (the body of the function). That, in turn, has as a child a return statement, and that in turn has as a child a return expression, which is the literal integer 3.

To confirm that this is indeed what Clang sees as the AST, check the AST:

clang++ -Xclang -ast-dump -fsyntax-only toy.cpp

Note in particular the last four lines of the dump:

`-FunctionDecl 0x55b17aae1270 <toy.cpp:1:1, line:3:1> line:1:5 main 'int ()'
  `-CompoundStmt 0x55b17aae13c0 <col:12, line:3:1>
    `-ReturnStmt 0x55b17aae13b0 <line:2:5, col:12>
      `-IntegerLiteral 0x55b17aae1390 <col:12> 'int' 3
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment