nascheme/gist:6a79b28736136764c90274a2e114a2ba

## gistfile1.txt
See:
  https://public.etherpad-mozilla.org/p/cpython-dev-sprint-2017
  https://mail.python.org/pipermail/python-dev/2017-September/149307.html

Motivation
----------

Python objects that participate in cyclic GC (things like lists, dicts,
sets but not strings, ints and floats) have extra memory overhead.  I
think it is possible to mostly eliminate this overhead.  Also, while
the GC is running, this GC state is mutated.  That hurts copy-on-write
memory page optimizations because the next/prev/refs memory locations are
mutated.  This change would mostly fix that issue.

Background
----------

All objects that participate in cyclic GC have the Py_TPFLAGS_HAVE_GC
bit set in their type.  That causes an extra chunk of memory to be
allocated *before* the ob_refcnt struct member.  This is the PyGC_Head
struct.

The whole object looks like this in memory (PyObject pointer is at
arrow):

    union __gc_head *gc_next;
    union __gc_head *gc_prev;
    Py_ssize_t gc_refs;
-->
    Py_ssize_t ob_refcnt
    struct _typeobject *ob_type;
    [rest of PyObject members]


So, 24 bytes of overhead on a 64-bit machine.  The smallest Python
object that can have a pointer to another object (e.g. a single PyObject
* member) is 48 bytes.  Removing PyGC_Head would cut the size of these
objects in half.

Carl Shaprio questioned me today on why we use a double linked-list and
not the memory bitmap.  I think the answer is that there is no good
reason. We use a double linked list only due to historical constraints
that are no longer present.

Long ago, Python objects could be allocated using the system malloc or
other memory allocators.  Since we could not control the memory
location, bitmaps would be inefficient.  Today, we allocate all Python
objects via our own function.  Python objects under a certain size are
allocated using our own malloc, obmalloc, and are stored in memory
blocks known "arenas".

The PyGC_Head struct performs four main functions.  First, it allows
the GC to find all Python objects that will be checked for cycles
(i.e. follow the linked list).  Second, it stores a single logical
bit of information to let the GC know if it is safe to traverse the
object, set with PyObject_GC_Track().  It has a scratch area to
compute the effective reference count while tracing refs (gc_refs).
Finally, it allows the set of GC tracked objects to be partitioned
into sets (e.g. reachable, finalizer reachable).

Prototype design
----------------

Here is a sketch of how we can remove the PyGC_Head struct for small
objects (say less than 512 bytes).  Large objects or objects created by
a different memory allocator will still have the PyGC_Head overhead.

* Have memory arenas that contain only objects with the
  Py_TPFLAGS_HAVE_GC flag.  Objects like ints, strings, etc will be
  in different arenas, not have bitmaps, not be looked at by the
  cyclic GC.

* For those arenas, add a memory bitmap.  The bitmap is a bit array that
  has a bit for each fixed size object in the arena.  The memory used by
  the bitmap is a fraction of what is needed by PyGC_Head.  E.g. an
  arena that holds up to 1024 objects of 48 bytes in size would have a
  bitmap of 1024 bits.

* Maybe need two bits per object, e.g. 0 = untracked, 1 = gen1,
  2 = gen2, 3 = gen3

* The bits will be set and cleared by PyObject_GC_Track/Untrack()

* We also need an array of Py_ssize_t to take over the job of gc_refs.
  That could be allocated only when GC is working and it only needs to
  be the size of the number of true bits in the bitmap.  Or, it could be
  allocated when the arena is allocated and be sized for the full arena.

* Objects that are too large would still get the PyGC_Head struct
  allocated "in front" of the PyObject.  Because they are big, the
  overhead is not so bad.

* The GC process would work nearly the same as it does now.  Rather than
  only traversing the linked list, we would also have to crawl over the
  GC object arenas, check blocks of memory that have the tracked bit
  set.

There are a lot of smaller details to work out but I see no reason
why the idea should not work.  It should significantly reduce memory
usage.  Also, because the bitmap and gc_refs are contiguous in
memory, locality will be improved.  Łukasz Langa has mentioned that
the current GC causes issues with copy-on-write memory in big
applications.  This change should solve that issue.

To implement, I think the easiest path is to create new malloc to be
used by small GC objects, e.g. gcmalloc.c.  It would be similar to
obmalloc but have the features needed to keep track of the bitmap.
obmalloc has some quirks that makes it hard to use for this purpose.

The make the first version even easier, leave the gc_refs part of
the head on small objects.  Just remove the prev/next points and
use the bitmap to find and iterate over the objects.

Once the idea is proven, gcmalloc could be merged or made to be a
variation of obmalloc.  Or, maybe just optimized and remain
separate.  obmalloc is complicated and highly optimized.  So, adding
additional functionality to it will be challenging.

I believe this change would be ABI compatible.
	See:
	https://public.etherpad-mozilla.org/p/cpython-dev-sprint-2017
	https://mail.python.org/pipermail/python-dev/2017-September/149307.html

	Motivation
	----------

	Python objects that participate in cyclic GC (things like lists, dicts,
	sets but not strings, ints and floats) have extra memory overhead. I
	think it is possible to mostly eliminate this overhead. Also, while
	the GC is running, this GC state is mutated. That hurts copy-on-write
	memory page optimizations because the next/prev/refs memory locations are
	mutated. This change would mostly fix that issue.

	Background
	----------

	All objects that participate in cyclic GC have the Py_TPFLAGS_HAVE_GC
	bit set in their type. That causes an extra chunk of memory to be
	allocated before the ob_refcnt struct member. This is the PyGC_Head
	struct.

	The whole object looks like this in memory (PyObject pointer is at
	arrow):

	union __gc_head *gc_next;
	union __gc_head *gc_prev;
	Py_ssize_t gc_refs;
	-->
	Py_ssize_t ob_refcnt
	struct _typeobject *ob_type;
	[rest of PyObject members]


	So, 24 bytes of overhead on a 64-bit machine. The smallest Python
	object that can have a pointer to another object (e.g. a single PyObject
	* member) is 48 bytes. Removing PyGC_Head would cut the size of these
	objects in half.

	Carl Shaprio questioned me today on why we use a double linked-list and
	not the memory bitmap. I think the answer is that there is no good
	reason. We use a double linked list only due to historical constraints
	that are no longer present.

	Long ago, Python objects could be allocated using the system malloc or
	other memory allocators. Since we could not control the memory
	location, bitmaps would be inefficient. Today, we allocate all Python
	objects via our own function. Python objects under a certain size are
	allocated using our own malloc, obmalloc, and are stored in memory
	blocks known "arenas".

	The PyGC_Head struct performs four main functions. First, it allows
	the GC to find all Python objects that will be checked for cycles
	(i.e. follow the linked list). Second, it stores a single logical
	bit of information to let the GC know if it is safe to traverse the
	object, set with PyObject_GC_Track(). It has a scratch area to
	compute the effective reference count while tracing refs (gc_refs).
	Finally, it allows the set of GC tracked objects to be partitioned
	into sets (e.g. reachable, finalizer reachable).

	Prototype design
	----------------

	Here is a sketch of how we can remove the PyGC_Head struct for small
	objects (say less than 512 bytes). Large objects or objects created by
	a different memory allocator will still have the PyGC_Head overhead.

	* Have memory arenas that contain only objects with the
	Py_TPFLAGS_HAVE_GC flag. Objects like ints, strings, etc will be
	in different arenas, not have bitmaps, not be looked at by the
	cyclic GC.

	* For those arenas, add a memory bitmap. The bitmap is a bit array that
	has a bit for each fixed size object in the arena. The memory used by
	the bitmap is a fraction of what is needed by PyGC_Head. E.g. an
	arena that holds up to 1024 objects of 48 bytes in size would have a
	bitmap of 1024 bits.

	* Maybe need two bits per object, e.g. 0 = untracked, 1 = gen1,
	2 = gen2, 3 = gen3

	* The bits will be set and cleared by PyObject_GC_Track/Untrack()

	* We also need an array of Py_ssize_t to take over the job of gc_refs.
	That could be allocated only when GC is working and it only needs to
	be the size of the number of true bits in the bitmap. Or, it could be
	allocated when the arena is allocated and be sized for the full arena.

	* Objects that are too large would still get the PyGC_Head struct
	allocated "in front" of the PyObject. Because they are big, the
	overhead is not so bad.

	* The GC process would work nearly the same as it does now. Rather than
	only traversing the linked list, we would also have to crawl over the
	GC object arenas, check blocks of memory that have the tracked bit
	set.

	There are a lot of smaller details to work out but I see no reason
	why the idea should not work. It should significantly reduce memory
	usage. Also, because the bitmap and gc_refs are contiguous in
	memory, locality will be improved. Łukasz Langa has mentioned that
	the current GC causes issues with copy-on-write memory in big
	applications. This change should solve that issue.

	To implement, I think the easiest path is to create new malloc to be
	used by small GC objects, e.g. gcmalloc.c. It would be similar to
	obmalloc but have the features needed to keep track of the bitmap.
	obmalloc has some quirks that makes it hard to use for this purpose.

	The make the first version even easier, leave the gc_refs part of
	the head on small objects. Just remove the prev/next points and
	use the bitmap to find and iterate over the objects.

	Once the idea is proven, gcmalloc could be merged or made to be a
	variation of obmalloc. Or, maybe just optimized and remain
	separate. obmalloc is complicated and highly optimized. So, adding
	additional functionality to it will be challenging.

	I believe this change would be ABI compatible.