Skip to content

Instantly share code, notes, and snippets.

@efric
Last active September 10, 2023 20:12
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save efric/9faa9cb19fe829b97a54d5c7eabf5e72 to your computer and use it in GitHub Desktop.
Save efric/9faa9cb19fe829b97a54d5c7eabf5e72 to your computer and use it in GitHub Desktop.
GSoC 2023: GCC static analyzer plugin for CPython extension modules

Google Summer of Code 2023

Project Description

One crucial use case of the gcc-python plugin was its role in supporting cpychecker, a static analysis tool for CPython extension module code. Cpychecker's primary purpose was to assist programmers writing extensions in identifying common coding mistakes. The gcc-python-plugin has bitrotted over the years and, in particular, cpychecker became non-functional in some of the later GCC releases. The main objective of this project was to reintroduce the features of cpychecker, but as a GCC static analyzer plugin.

Contributions

While my initial proposal outlined several areas of interest, my mentor, David Malcolm, and I collectively chose to concentrate on the reference count checking feature of the project. We believed this would be the most beneficial to extension module developers. Briefly, the reference count for PyObject* objects in CPython represents the number of references to that object in memory; increasing the count prevents the object from being garbage collected, while decreasing it to zero allows the object to be deallocated. Reference counts are meant to be manipulated explicitly through the Python/C API with entrypoints such as Py_INCREF and Py_DECREF. For more detailed and potentially updated reference, please consult the Python/C API Reference Manual.

The overarching strategy I employed in tracking the reference count of PyObject* objects is derived from the original cpychecker. We attempt to analyze all paths through each function, keeping an eye on the various PyObject* objects throughout. Upon reaching the function's conclusion, we compared the expected reference counts with what was observed, flagging any PyObject* objects that appeared incorrect.

Below is an example of a reference count mismatch diagnostic in the plugin's present iteration:

rc3.c:23:10: warning: expected ‘item’ to have reference count: N + ‘1’ but ob_refcnt field is N + ‘2’
   23 |   return list;
      |          ^~~~
  ‘create_py_object’: events 1-4
    |
    |    4 |   PyObject* item = PyLong_FromLong(3);
    |      |                    ^~~~~~~~~~~~~~~~~~
    |      |                    |
    |      |                    (1) when ‘PyLong_FromLong’ succeeds
    |    5 |   PyObject* list = PyList_New(1);
    |      |                    ~~~~~~~~~~~~~
    |      |                    |
    |      |                    (2) when ‘PyList_New’ succeeds
    |......
    |   14 |   PyList_Append(list, item);
    |      |   ~~~~~~~~~~~~~~~~~~~~~~~~~
    |      |   |
    |      |   (3) when ‘PyList_Append’ succeeds, without moving underlying buffer
    |......
    |   23 |   return list;
    |      |          ~~~~
    |      |          |
    |      |          (4) here`
    |

We are also able to recognize memory leaks in relation to PyObject* objects instantiated.

rc3.c:23:10: warning: leak of ‘item’ [CWE-401] [-Wanalyzer-malloc-leak]
   23 |   return list;
      |          ^~~~
  ‘create_py_object’: events 1-5
    |
    |    4 |   PyObject* item = PyLong_FromLong(3);
    |      |                    ^~~~~~~~~~~~~~~~~~
    |      |                    |
    |      |                    (1) allocated here
    |      |                    (2) when ‘PyLong_FromLong’ succeeds
    |    5 |   PyObject* list = PyList_New(1);
    |      |                    ~~~~~~~~~~~~~
    |      |                    |
    |      |                    (3) when ‘PyList_New’ succeeds
    |......
    |   14 |   PyList_Append(list, item);
    |      |   ~~~~~~~~~~~~~~~~~~~~~~~~~
    |      |   |
    |      |   (4) when ‘PyList_Append’ fails
    |......
    |   23 |   return list;
    |      |          ~~~~
    |      |          |
    |      |          (5) ‘item’ leaks here; was allocated at (1)

To "teach" the analyzer to recognize and accurately simulate the behaviors of the API entrypoints highlighted above, known function subclasses were implemented for PyList_New, PyLong_FromLong, and PyList_Append. Additionally, various enhancements were made to the core analyzer to accommodate the features required by the plugin. For instance, the core analyzer was updated with a hook, allowing analyzer plugins to look up types and globals by their names in the C frontend. This adaptation provided access to tree structures representing Python object-specific fields in the source code, such as ob_item in PyList_Object, facilitating a detailed simulation.

For those interested, please find a detailed list of patches I've contributed here.

Using the plugin

At the time of writing, the state of the plugin is still considered experimental. It lives in the GCC testsuite at gcc/testsuite/gcc.dg/plugin/analyzer_cpython_plugin.c. To test the plugin, you can build GCC from source and follow the guidelines provided in the GCC wiki on building plugins. For reference, my Makefile is structured as:

HOST_GCC=g++-dev
TARGET_GCC=gcc-dev
CPYTHON_SOURCE_FILES= analyzer_cpython_plugin.c
GCC_DIR = /your/path/to/gcc
GCCPLUGINS_DIR:= $(shell $(TARGET_GCC) -print-file-name=plugin)
CXXFLAGS+= -I$(GCC_DIR) -I$(GCCPLUGINS_DIR)/include -fPIC -fno-rtti -g

analyzer_cpython_plugin.so: $(CPYTHON_SOURCE_FILES)
	$(HOST_GCC) -shared $(CXXFLAGS) $^ -o $@

To utilize the plugin, input the option to activate the static analyzer (-fanalyzer), specify the plugin (-fplugin=/path/to/plugin.so), and provide the path to your desired Python header files (e.g -I/usr/include/python3.9) alongside the file you wish to analyze. For instance, from the gcc subdirectory of the build directory:

./xgcc -B. -S -fanalyzer -fplugin=/path/to/analyzer_cpython_plugin.so -I/usr/include/python3.9 <filename.c>

Future Work

I initially aimed to reproduce the behaviors of specific Python/C API entrypoints for static analysis. However, the sheer number and intricacy of these APIs soon made it evident that this approach wouldn't scale effectively. A potential avenue for future exploration might involve creating custom function attributes. If accepted, these attributes could help express essential behaviors of the entrypoints, which the plugin can then harness for static analysis. Though this alternative wouldn't achieve the preciseness of hardcoding every entrypoint's behavior, it promises greater scalability. For more information regarding this proposition, please refer to the ongoing thread here.

Misc Learnings

Participating in GSoC was not only a rich technical experience but also provided invaluable lessons in project management and workflow efficiency. For instance, while the precise analysis of PyList_Append was accomplished, it consumed significant time that perhaps could have been more effectively allocated to other components of the plugin. Such in-depth pursuits often also led to expansive exploratory codes, which later demanded substantial cleanup to become patch-ready.

It became evident that adopting a strategy of smaller, incremental steps and more frequent patch submissions would be beneficial. This approach not only facilitated easier incremental cleanup for the patch submission process but also enhanced the transparency of my work, allowing the community to better understand and engage with what I was doing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment