Harold2017/Go_GC.md

## Go_GC.md

      
    Raw
  

              Go_GC.md
            
          
GC

Getting to Go: The Journey of Go's Garbage Collector
General

mgc.go
The Go Gargabe Collector
// mgc.go

// Garbage collector (GC).
//
// The GC runs concurrently with mutator threads, is type accurate (aka precise), allows multiple
// GC thread to run in parallel. It is a concurrent mark and sweep that uses a write barrier. It is
// non-generational and non-compacting. Allocation is done using size segregated per P allocation
// areas to minimize fragmentation while eliminating locks in the common case.

// The algorithm decomposes into several steps.
// This is a high level description of the algorithm being used. For an overview of GC a good
// place to start is Richard Jones' gchandbook.org.

// 1. GC performs sweep termination.
//
//    a. Stop the world. This causes all Ps to reach a GC safe-point.
//
//    b. Sweep any unswept spans. There will only be unswept spans if
//    this GC cycle was forced before the expected time.
//
// 2. GC performs the mark phase.
//
//    a. Prepare for the mark phase by setting gcphase to _GCmark
//    (from _GCoff), enabling the write barrier, enabling mutator
//    assists, and enqueueing root mark jobs. No objects may be
//    scanned until all Ps have enabled the write barrier, which is
//    accomplished using STW.
//
//    b. Start the world. From this point, GC work is done by mark
//    workers started by the scheduler and by assists performed as
//    part of allocation. The write barrier shades both the
//    overwritten pointer and the new pointer value for any pointer
//    writes (see mbarrier.go for details). Newly allocated objects
//    are immediately marked black.
//
//    c. GC performs root marking jobs. This includes scanning all
//    stacks, shading all globals, and shading any heap pointers in
//    off-heap runtime data structures. Scanning a stack stops a
//    goroutine, shades any pointers found on its stack, and then
//    resumes the goroutine.
//
//    d. GC drains the work queue of grey objects, scanning each grey
//    object to black and shading all pointers found in the object
//    (which in turn may add those pointers to the work queue).
//
//    e. Because GC work is spread across local caches, GC uses a
//    distributed termination algorithm to detect when there are no
//    more root marking jobs or grey objects (see gcMarkDone). At this
//    point, GC transitions to mark termination.
//
// 3. GC performs mark termination.
//
//    a. Stop the world.
//
//    b. Set gcphase to _GCmarktermination, and disable workers and
//    assists.
//
//    c. Perform housekeeping like flushing mcaches.
//
// 4. GC performs the sweep phase.
//
//    a. Prepare for the sweep phase by setting gcphase to _GCoff,
//    setting up sweep state and disabling the write barrier.
//
//    b. Start the world. From this point on, newly allocated objects
//    are white, and allocating sweeps spans before use if necessary.
//
//    c. GC does concurrent sweeping in the background and in response
//    to allocation. See description below.
//
// 5. When sufficient allocation has taken place, replay the sequence
// starting with 1 above. See discussion of GC rate below.
The Tricolour Abstraction


[Dijkstra et al, 1 976, 1978]


Initially, all objects are white.
Scan and find all reachable objects, mark them as grey and put into a queue of pending processing objects.
Extract grey objects, find their referenced objects and mark as grey, then put referenced objects into the queue, mark themselves as black.
Watch objects memory change with write barrier, re-mark their colors or put into the queue.
After finishing all scan and mark works, the rest objects only can be white or black, and they represents objects pending recycle or active, the sweeping operation only need to recycle the white objects' memory.

GC Controller

mgc.go
// mgc.go

// gcController implements the GC pacing controller that determines when to trigger concurrent 
// garbage collection and how much marking work to do in mutator assists and background marking.

// It uses a feedback control algorithm to adjust the memstats.next_gc trigger based on the heap 
// growth and GC CPU utilization each cycle.

// This algorithm optimizes for heap growth to match GOGC and for CPU utilization between assist 
// and background marking to be 25% of GOMAXPROCS.

// The high-level design of this algorithm is documented at https://golang.org/s/go15gcpacing.
Mutator Assist

mgc.go
// mgc.go

// Concurrent marking happens through four different mechanisms. One
// is mutator assists, which happen in response to allocations and are
// not scheduled. The other three are variations in the per-P mark
// workers and are distinguished by gcMarkWorkerMode.

// gcMarkWorkerMode represents the mode that a concurrent mark worker
// should operate in.
In some cases, objects allocation speed may much faster than background marking.
And this will lead a series of bad effects like heap size malignant expansion, then
even the worse, it makes GC never be completed.
In this case, make user code thread participate in background marking process is necessary.
During the memory allocation for objects, trigger some kind of GC operation accordingly to balance
allocation and recycle operations is one way to keep the process be running within a good state.
Initialization

mgc.go
Initialization is simple, the key point is setting the threshold of gcpercent and next_gc.
// mgc.go

// GC rate.
// Next GC is after we've allocated an extra amount of memory proportional to
// the amount already in use. The proportion is controlled by GOGC environment variable
// (100 by default). If GOGC=100 and we're using 4M, we'll GC again when we get to 8M
// (this mark is tracked in next_gc variable). This keeps the GC cost in linear
// proportion to the allocation cost. Adjusting GOGC just changes the linear constant
// (and also the amount of extra memory used).

// heapminimum is the minimum heap size at which to trigger GC.
// For small heaps, this overrides the usual GOGC*live set rule.
//
// When there is a very small live set but a lot of allocation, simply
// collecting when the heap reaches GOGC*live results in many GC
// cycles and high total per-GC overhead. This minimum amortizes this
// per-GC overhead while keeping the heap reasonably small.
//
// During initialization this is set to 4MB*GOGC/100. In the case of
// GOGC==0, this will set heapminimum to 0, resulting in constant
// collection even when the heap size is small, which is useful for
// debugging.
var heapminimum uint64 = defaultHeapMinimum

// defaultHeapMinimum is the value of heapminimum for GOGC==100.
const defaultHeapMinimum = 4 << 20 // 4MB

// Initialized from $GOGC.  GOGC=off means no GC.
var gcpercent int32

func gcinit() {
	// No sweep on the first cycle.
	mheap_.sweepdone = 1
    
	// Set a reasonable initial GC trigger.
	memstats.triggerRatio = 7 / 8.0
	
	// Fake a heap_marked value so it looks like a trigger at
	// heapminimum is the appropriate growth from heap_marked. 
	// This will go into computing the initial GC goal. 
	memstats.heap_marked = uint64(float64(heapminimum) / (1 + memstats.triggerRatio))
    	
	// Set gcpercent from the environment. This will also compute
	// and set the GC trigger and goal.
	_ = setGCPercent(readgogc())
}

func readgogc() int32 {
	p := gogetenv("GOGC")
	if p == "" {
		return 100
	}
	if p == "off" {
		return -1
	}
	return int32(atoi(p))
}

func setGCPercent(in int32) (out int32) {
	out = gcpercent
	if in < 0 {
		in = -1
	}
	gcpercent = in
	heapminimum = defaultHeapMinimum * uint64(gcpercent) / 100
	// Update pacing in response to gcpercent change.
	gcSetTriggerRatio(memstats.triggerRatio)
	return out
}

// gcSetTriggerRatio sets the trigger ratio and updates everything
// derived from it: the absolute trigger, the heap goal, mark pacing,
// and sweep pacing.
func gcSetTriggerRatio(triggerRatio float64) {
	// Compute the next GC goal, which is when the allocated heap
	// has grown by GOGC/100 over the heap marked by the last
	// cycle.
	goal := ^uint64(0)
	if gcpercent >= 0 {
    		goal = memstats.heap_marked + memstats.heap_marked*uint64(gcpercent)/100
    	}
    
    	// Set the trigger ratio, capped to reasonable bounds.
    	if triggerRatio < 0 {
    		// This can happen if the mutator is allocating very
    		// quickly or the GC is scanning very slowly.
    		triggerRatio = 0
    	} else if gcpercent >= 0 {
    		// Ensure there's always a little margin so that the
    		// mutator assist ratio isn't infinity.
    		maxTriggerRatio := 0.95 * float64(gcpercent) / 100
    		if triggerRatio > maxTriggerRatio {
    			triggerRatio = maxTriggerRatio
    		}
    	}
    	memstats.triggerRatio = triggerRatio
    
    	// Compute the absolute GC trigger from the trigger ratio.
    	//
    	// We trigger the next GC cycle when the allocated heap has
    	// grown by the trigger ratio over the marked heap size.
    	trigger := ^uint64(0)
    	if gcpercent >= 0 {
    		trigger = uint64(float64(memstats.heap_marked) * (1 + triggerRatio))
    		// Don't trigger below the minimum heap size.
    		minTrigger := heapminimum
    		}
    		if trigger < minTrigger {
    			trigger = minTrigger
    		}

    		if trigger > goal {
    			// The trigger ratio is always less than GOGC/100, but
    			// other bounds on the trigger may have raised it.
    			// Push up the goal, too.
    			goal = trigger
    		}
    	}
    
    	// Commit to the trigger and goal.
    	memstats.gc_trigger = trigger
    	memstats.next_gc = goal
// gcpercent = 100
heapminimum = defaultHeapMinimum * uint64(gcpercent) / 100 // heapminimum = 4 MB

memstats.triggerRatio = 7 / 8.0
memstats.heap_marked = uint64(float64(heapminimum) / (1 + memstats.triggerRatio)) // heap_marked = 32/15 MB = 2 MB

goal = memstats.heap_marked + memstats.heap_marked*uint64(gcpercent)/100 // goal = 4 MB

trigger = uint64(float64(memstats.heap_marked) * (1 + triggerRatio)) // trigger = 4 MB

memstats.next_gc = goal // next_gc = 4 MB

// initial threshold of next_gc is 4 MB
Start

malloc.go
mgc.go
after allocating chunks for objects, mallocgc function will check GC trigger and follow corresponding status to start or assist GC.
// malloc.go

// Allocate an object of size bytes.
// Small objects are allocated from the per-P cache's free lists.
// Large objects (> 32 kB) are allocated straight from the heap.
func mallocgc(size uintptr, typ *_type, needzero bool) unsafe.Pointer {
	
	// assistG is the G to charge for this allocation, or nil if
	// GC is not currently active.
	var assistG *g
	if gcBlackenEnabled != 0 {
		// Charge the current user G for this allocation.
		assistG = getg()
		if assistG.m.curg != nil {
			assistG = assistG.m.curg
		}
		// Charge the allocation against the G. We'll account
		// for internal fragmentation at the end of mallocgc.
		assistG.gcAssistBytes -= int64(size)

		if assistG.gcAssistBytes < 0 {
			// This G is in debt. Assist the GC to correct
			// this before allocating. This must happen
			// before disabling preemption.
			gcAssistAlloc(assistG)
		}
	}
	
	
	// ... allocate memory chunk for objects: tiny, small, large...
	// nextFree function called inside this volume allocation process...
	
	// Allocate black during GC.
	// All slots hold nil so no scanning is needed.
	// This may be racing with GC so do it atomically if there can be
	// a race marking the bit.
	if gcphase != _GCoff {
		gcmarknewobject(uintptr(x), size, scanSize)
	}
	
	if shouldhelpgc {
		if t := (gcTrigger{kind: gcTriggerHeap}); t.test() {
			gcStart(t)
		}
	}
}


// nextFree returns the next free object from the cached span if one is available.
// Otherwise it refills the cache with a span with an available object and
// returns that object along with a flag indicating that this was a heavy
// weight allocation. If it is a heavy weight allocation the caller must
// determine whether a new GC cycle needs to be started or if the GC is active
// whether this goroutine needs to assist the GC.
func (c *mcache) nextFree(spc spanClass) (v gclinkptr, s *mspan, shouldhelpgc bool) {
	s = c.alloc[spc]
	shouldhelpgc = false
	freeIndex := s.nextFreeIndex()
	if freeIndex == s.nelems {
		// The span is full.
		if uintptr(s.allocCount) != s.nelems {
			println("runtime: s.allocCount=", s.allocCount, "s.nelems=", s.nelems)
			throw("s.allocCount != s.nelems && freeIndex == s.nelems")
		}
		c.refill(spc)
		// return `shouldhelpgc` to `mallocgc` and call `gcStart` there
		shouldhelpgc = true
		s = c.alloc[spc]

		freeIndex = s.nextFreeIndex()
	}
}

// mgc.go

// gcStart starts the GC. It transitions from _GCoff to _GCmark (if
// debug.gcstoptheworld == 0) or performs all of GC (if
// debug.gcstoptheworld != 0).
//
// This may return without performing this transition in some cases,
// such as when called on a system stack or with locks held.
func gcStart(trigger gcTrigger) {
	// In gcstoptheworld debug mode, upgrade the mode accordingly.
	// We do this after re-checking the transition condition so
	// that multiple goroutines that detect the heap trigger don't
	// start multiple STW GCs.
	mode := gcBackgroundMode
	// tell the GODEBUG env var
	// 1: forbid concurrent mark
	// 2: forbid concurrent mark and concurrent sweep
	if debug.gcstoptheworld == 1 {
		mode = gcForceMode
	} else if debug.gcstoptheworld == 2 {
		mode = gcForceBlockMode
	}
	
	// ... GC preparation ...
	gcBgMarkStartWorkers()
	
	systemstack(gcResetMarkState)
	
	// Finish sweep before we start concurrent scan.
	systemstack(func() {
		finishsweep_m()
	})
	// clearpools before we start the GC. If we wait they memory will not be
	// reclaimed until the next GC cycle.
	clearpools()

	work.cycles++

	gcController.startCycle()
	work.heapGoal = memstats.next_gc
	
	// synchronized block mode
	// In STW mode, disable scheduling of user Gs. This may also
	// disable scheduling of this goroutine, so it may block as
	// soon as we start the world again.
	if mode != gcBackgroundMode {
		schedEnableUser(false)
	}

	
	// ... gc(mode) ... (background mode -> concurrent mode)
		
	// synchronized block mode
	// In STW mode, we could block the instant systemstack
	// returns, so don't do anything important here. Make sure we
	// block rather than returning to user code.
	if mode != gcBackgroundMode {
		Gosched()
	}
	
}


// gcBgMarkStartWorkers prepares background mark worker goroutines.
// These goroutines will not run until the mark phase, but they must
// be started while the work is not stopped and from a regular G
// stack. The caller must hold worldsema.
func gcBgMarkStartWorkers() {
	// Background marking is performed by per-P G's. Ensure that
	// each P has a background GC G.
	for _, p := range allp {
		if p.gcBgMarkWorker == 0 {
			go gcBgMarkWorker(p)
			notetsleepg(&work.bgMarkReady, -1)
			noteclear(&work.bgMarkReady)
		}
	}
}
gcBackgroundMode -> concurrent mode
gcForceMode or gcForceBlockMode -> synchronized block mode
set by GODEBUG env var

gcForceMode: forbid concurrent mark (GODEBUG=1)
gcForceBlockMode: forbid concurrent mark and concurrent sweep (GODEBUG=2)

GC cycle

mgc.go
// mgc.go

// GC runs a garbage collection and blocks the caller until the
// garbage collection is complete. It may also block the entire
// program.
func GC() {
	// We consider a cycle to be: sweep termination, mark, mark
	// termination, and sweep. This function shouldn't return
	// until a full cycle has been completed, from beginning to
	// end. Hence, we always want to finish up the current cycle
	// and start a new one. That means:
	//
	// 1. In sweep termination, mark, or mark termination of cycle
	// N, wait until mark termination N completes and transitions
	// to sweep N.
	//
	// 2. In sweep N, help with sweep N.
	//
	// At this point we can begin a full cycle N+1.
	//
	// 3. Trigger cycle N+1 by starting sweep termination N+1.
	//
	// 4. Wait for mark termination N+1 to complete.
	//
	// 5. Help with sweep N+1 until it's done.
	//
	// This all has to be written to deal with the fact that the
	// GC may move ahead on its own. For example, when we block
	// until mark termination N, we may wake up in cycle N+2.

	// Wait until the current sweep termination, mark, and mark
	// termination complete.
	n := atomic.Load(&work.cycles)
	gcWaitOnMark(n)

	// We're now in sweep N or later. Trigger GC cycle N+1, which
	// will first finish sweep N if necessary and then enter sweep
	// termination N+1.
	gcStart(gcTrigger{kind: gcTriggerCycle, n: n + 1})

	// Wait for mark termination N+1 to complete.
	gcWaitOnMark(n + 1)

	// Finish sweep N+1 before returning. We do this both to
	// complete the cycle and because runtime.GC() is often used
	// as part of tests and benchmarks to get the system into a
	// relatively stable and isolated state.
	for atomic.Load(&work.cycles) == n+1 && sweepone() != ^uintptr(0) {
		sweep.nbgsweep++
		Gosched()
	}

	// Callers may assume that the heap profile reflects the
	// just-completed cycle when this returns (historically this
	// happened because this was a STW GC), but right now the
	// profile still reflects mark termination N, not N+1.
	//
	// As soon as all of the sweep frees from cycle N+1 are done,
	// we can go ahead and publish the heap profile.
	//
	// First, wait for sweeping to finish. (We know there are no
	// more spans on the sweep queue, but we may be concurrently
	// sweeping spans, so we have to wait.)
	for atomic.Load(&work.cycles) == n+1 && atomic.Load(&mheap_.sweepers) != 0 {
		Gosched()
	}

	// Now we're really done with sweeping, so we can publish the
	// stable heap profile. Only do this if we haven't already hit
	// another mark termination.
	mp := acquirem()
	cycle := atomic.Load(&work.cycles)
	if cycle == n+1 || (gcphase == _GCmark && cycle == n+2) {
		mProf_PostSweep()
	}
	releasem(mp)
}
Mark

mcmark.go
Concurrent mark is divided into two procedures:

Scanning: traverse corresponding memory chunks, find the grey reachable objects based on ptr, and put them into the queue.
Marking: extract grey objects from the queue, make their referenced objects as grey and make themselves as black.

Scanning

// mgcmark.go

// gcMarkRootPrepare queues root scanning jobs (stacks, globals, and
// some miscellany) and initializes scanning-related state.
//
// The caller must have call gcCopySpans().
//
// The world must be stopped.
//
//go:nowritebarrier
func gcMarkRootPrepare() {
	// ... see source code comments
}
Marking

// mgcmark.go

// markroot scans the i'th root.
//
// Preemption must be disabled (because this uses a gcWork).
//
// nowritebarrier is only advisory here.
//
//go:nowritebarrier
func markroot(gcw *gcWork, i uint32) {
	// ... see source code comments
}


// scanblock scans b as scanobject would, but using an explicit
// pointer bitmap instead of the heap bitmap.
//
// This is used to scan non-heap roots, so it does not update
// gcw.bytesMarked or gcw.scanWork.
//
// If stk != nil, possible stack pointers are also reported to stk.putPtr.
//go:nowritebarrier
func scanblock(b0, n0 uintptr, ptrmask *uint8, gcw *gcWork, stk *stackScanState) {
	// Use local copies of original parameters, so that a stack trace
	// due to one of the throws below shows the original block
	// base and extent.
	b := b0
	n := n0

	for i := uintptr(0); i < n; {
		// Find bits for the next word.
		bits := uint32(*addb(ptrmask, i/(sys.PtrSize*8)))
		// no mark, ignore
		if bits == 0 {
			i += sys.PtrSize * 8
			continue
		}
		for j := 0; j < 8 && i < n; j++ {
			// have `bitPointer` mark
			if bits&1 != 0 {
				// Same work as in scanobject; see comments there.
				// read ptr content, target object address
				p := *(*uintptr)(unsafe.Pointer(b + i))
				// make sure ptr is valid
				if p != 0 {
					if obj, span, objIndex := findObject(p, b, i); obj != 0 {
						// mark as grey object
						greyobject(obj, b, i, span, gcw, objIndex)
					} else if stk != nil && p >= stk.stack.lo && p < stk.stack.hi {
						stk.putPtr(p)
					}
				}
			}
			bits >>= 1
			i += sys.PtrSize
		}
	}
}

// obj is the start of an object with mark mbits.
// If it isn't already marked, mark it and enqueue into gcw.
// base and off are for debugging only and could be removed.
//
// See also wbBufFlush1, which partially duplicates this logic.
//
//go:nowritebarrierrec
func greyobject(obj, base, off uintptr, span *mspan, gcw *gcWork, objIndex uintptr) {
	
	// mark unmarked object as grey object and put it into the queue
	if hbits.isCheckmarked(span.elemsize) {
		return // marked, do nothing and return
	}
	hbits.setCheckmarked(span.elemsize) // mark unmarked object
	
	// Queue the obj for scanning. The PREFETCH(obj) logic has been removed but
	// seems like a nice optimization that can be added back in.
	// There needs to be time between the PREFETCH and the use.
	// Previously we put the obj in an 8 element buffer that is drained at a rate
	// to give the PREFETCH time to do its work.
	// Use of PREFETCHNTA might be more appropriate than PREFETCH
	if !gcw.putFast(obj) {
		gcw.put(obj)
	}
}
Finally after all these scanning procedures, the valid ptrs can be found by comparing bitmap region info through scanblock function.
Mark its target as grey objects and add them into the queue.
Work is made of designed high performance queues, which allows global queue (work.full, work.empty) and local queue (gcWork).
gcWork is made of workbuf. when workbuf is full of obj, it will be transfer to work.full and if it is empty, it will be transfer
to work.empry. next time when use workbuf, it will try to get one from work.empty for re-use.
mgc.go
// mgc.go

var work struct {
	full  lfstack          // lock-free list of full blocks workbuf
	empty lfstack          // lock-free list of empty blocks workbuf
	pad0  cpu.CacheLinePad // prevents false-sharing between full/empty and nproc/nwait

	// lock-free list of partially filled blocks workbuf
	wbufSpans struct {
		lock mutex
		// free is a list of spans dedicated to workbufs, but
		// that don't currently contain any workbufs.
		free mSpanList
		// busy is a list of all spans containing workbufs on
		// one of the workbuf lists.
		busy mSpanList
	}
	
	// assistQueue is a queue of assists that are blocked because
	// there was neither enough credit to steal or enough work to
	// do.
	assistQueue struct {
		lock mutex
		q    gQueue
	}

	// sweepWaiters is a list of blocked goroutines to wake when
	// we transition from mark termination to sweep.
	sweepWaiters struct {
		lock mutex
		list gList
	}
}
gcWork is designed to store grey objects.
workbuf is lock-free stack node (lfnode of lfstack), itself is a buffer container (obj array member).
lfstack.go
// lfstack.go

// lfstack is the head of a lock-free stack.
//
// The zero value of lfstack is an empty list.
//
// This stack is intrusive. Nodes must embed lfnode as the first field.
//
// The stack does not keep GC-visible pointers to nodes, so the caller
// is responsible for ensuring the nodes are not garbage collected
// (typically by allocating them from manually-managed memory).
type lfstack uint64

func (head *lfstack) push(node *lfnode) {
	// accumulated counter
	node.pushcnt++
	
	// use pointer + pushcnt to get the unique serial number
	new := lfstackPack(node, node.pushcnt)
	
	// reversely unpack the serial number for error checking
	if node1 := lfstackUnpack(new); node1 != node {
		print("runtime: lfstack.push invalid packing: node=", node, " cnt=", hex(node.pushcnt), " packed=", hex(new), " -> node=", node1, "\n")
		throw("lfstack.push")
	}
	
	// endless-loop work as spinning, retry until succeed
	for {
		// atomic load original head node serial number (multi-core)
		old := atomic.Load64((*uint64)(head))
		
		// make current node as head
		// this operation will NOT affect original stack until succeed
		node.next = old
		
		// use CAS instruction to replace original head
		// if failed to replace, keep trying with this endless-loop
		if atomic.Cas64((*uint64)(head), old, new) {
			break
		}
	}
}

func (head *lfstack) pop() unsafe.Pointer {
	for {
		// atomic load stack head
		old := atomic.Load64((*uint64)(head))
		if old == 0 {
			return nil
		}
		
		// unpack the serial number to get the pointer
		node := lfstackUnpack(old)
		
		// revise stack head by CAS instruction
		next := atomic.Load64(&node.next)
		if atomic.Cas64((*uint64)(head), old, next) {
			return unsafe.Pointer(node)
		}
	}
}

func (head *lfstack) empty() bool {
	return atomic.Load64((*uint64)(head)) == 0
}
if CAS instruction operated only based on old pointer address, and this address is accidentally re-used,
it will lead to a fault, which is called ABA problem.
by using pointer address + counter to generate the unique serial number, which realizes double-CAS,
this problem is forbidden.
lfstack_64bit.go
func lfstackPack(node *lfnode, cnt uintptr) uint64 {
	// ...
	return uint64(uintptr(unsafe.Pointer(node)))<<(64-addrBits) | uint64(cnt&(1<<cntBits-1))
}
gcWork with workbuf works similar to memory allocator's cache, first use local cache,
until reach some threshold and then exchange with global cache. By this means, it can ensure
the performance by avoiding operating global queue directly. And when get work from the
global queue, it will get one group of them (a workbuf with obj inside).
mgcwork.go
// mgcwork.go

type gcWork struct {
	// wbuf1 and wbuf2 are the primary and secondary work buffers.
	//
	// This can be thought of as a stack of both work buffers'
	// pointers concatenated. When we pop the last pointer, we
	// shift the stack up by one work buffer by bringing in a new
	// full buffer and discarding an empty one. When we fill both
	// buffers, we shift the stack down by one work buffer by
	// bringing in a new empty buffer and discarding a full one.
	// This way we have one buffer's worth of hysteresis, which
	// amortizes the cost of getting or putting a work buffer over
	// at least one buffer of work and reduces contention on the
	// global work lists.
	//
	// wbuf1 is always the buffer we're currently pushing to and
	// popping from and wbuf2 is the buffer that will be discarded
	// next.
	//
	// Invariant: Both wbuf1 and wbuf2 are nil or neither are.
	wbuf1, wbuf2 *workbuf

	// Bytes marked (blackened) on this gcWork. This is aggregated
	// into work.bytesMarked by dispose.
	bytesMarked uint64

	// Scan work performed on this gcWork. This is aggregated into
	// gcController by dispose and may also be flushed by callers.
	scanWork int64

	// flushedWork indicates that a non-empty work buffer was
	// flushed to the global work list since the last gcMarkDone
	// termination check. Specifically, this indicates that this
	// gcWork may have communicated work to another gcWork.
	flushedWork bool

	// pauseGen causes put operations to spin while pauseGen ==
	// gcWorkPauseGen if debugCachedWork is true.
	pauseGen uint32

	// putGen is the pauseGen of the last putGen.
	putGen uint32

	// pauseStack is the stack at which this P was paused if
	// debugCachedWork is true.
	pauseStack [16]uintptr
}

// Most of the methods of gcWork are go:nowritebarrierrec because the
// write barrier itself can invoke gcWork methods but the methods are
// not generally re-entrant. Hence, if a gcWork method invoked the
// write barrier while the gcWork was in an inconsistent state, and
// the write barrier in turn invoked a gcWork method, it could
// permanently corrupt the gcWork.

func (w *gcWork) init() {
	w.wbuf1 = getempty()
	wbuf2 := trygetfull()
	if wbuf2 == nil {
		wbuf2 = getempty()
	}
	w.wbuf2 = wbuf2
}

// put enqueues a pointer for the garbage collector to trace.
// obj must point to the beginning of a heap object or an oblet.
//go:nowritebarrierrec
func (w *gcWork) put(obj uintptr) {
	// ...
}

// dispose returns any cached pointers to the global queue.
// The buffers are being put on the full queue so that the
// write barriers will not simply reacquire them before the
// GC can inspect them. This helps reduce the mutator's
// ability to hide pointers during the concurrent mark phase.
//
//go:nowritebarrierrec
func (w *gcWork) dispose() {
	// ...
}

// balance moves some work that's cached in this gcWork back on the
// global queue.
//go:nowritebarrierrec
func (w *gcWork) balance() {
	if w.wbuf1 == nil {
		return
	}
	if wbuf := w.wbuf2; wbuf.nobj != 0 {
		w.checkPut(0, wbuf.obj[:wbuf.nobj])
		putfull(wbuf)
		w.flushedWork = true
		w.wbuf2 = getempty()
	} else if wbuf := w.wbuf1; wbuf.nobj > 4 {
		w.checkPut(0, wbuf.obj[:wbuf.nobj])
		w.wbuf1 = handoff(wbuf)
		w.flushedWork = true // handoff did putfull
	} else {
		return
	}
	
	// We flushed a buffer to the full list, so wake a worker.
	if gcphase == _GCmark {
		gcController.enlistWorker()
	}
}

//go:nowritebarrier
func handoff(b *workbuf) *workbuf {
	// Make new buffer with half of b's pointers.
	// get empty workbuf from work.empty
	b1 := getempty()
	n := b.nobj / 2
	b.nobj -= n
	b1.nobj = n
	memmove(unsafe.Pointer(&b1.obj[0]), unsafe.Pointer(&b.obj[b.nobj]), uintptr(n)*unsafe.Sizeof(b1.obj[0]))

	// Put b on full list - let first half of b get stolen.
	putfull(b)
	return b1
}


// Internally, the GC work pool is kept in arrays in work buffers.
// The gcWork interface caches a work buffer until full (or empty) to
// avoid contending on the global work buffer lists.

type workbufhdr struct {
	node lfnode // must be first
	nobj int
}

//go:notinheap
type workbuf struct {
	workbufhdr
	// account for the above fields
	obj [(_WorkbufSize - unsafe.Sizeof(workbufhdr{})) / sys.PtrSize]uintptr
}
Sweep

mgcsweep.go
mgc.go
concurrent sweep essentially is a endless loop (for {...}). It starts sweep work when waken up.
by traversing all spans, it triggers memory allocator's sweep operation (mspan.sweep).
after completing sweep, it sleep until next task.
// mgcsweep.go

// Garbage collector: sweeping

// The sweeper consists of two different algorithms:
//
// * The object reclaimer finds and frees unmarked slots in spans. It
//   can free a whole span if none of the objects are marked, but that
//   isn't its goal. This can be driven either synchronously by
//   mcentral.cacheSpan for mcentral spans, or asynchronously by
//   sweepone from the list of all in-use spans in mheap_.sweepSpans.
//
// * The span reclaimer looks for spans that contain no marked objects
//   and frees whole spans. This is a separate algorithm because
//   freeing whole spans is the hardest task for the object reclaimer,
//   but is critical when allocating new spans. The entry point for
//   this is mheap_.reclaim and it's driven by a sequential scan of
//   the page marks bitmap in the heap arenas.
//
// Both algorithms ultimately call mspan.sweep, which sweeps a single
// heap span.

var sweep sweepdata
// concurrent sweep
// State of background sweep.
type sweepdata struct {
	lock    mutex
	g       *g
	parked  bool
	started bool

	nbgsweep    uint32
	npausesweep uint32
}

func bgsweep(c chan int) {
	// current goroutine
	sweep.g = getg()

	lock(&sweep.lock)
	sweep.parked = true
	
	// make gcenable exit
	c <- 1
	
	// sleep, wait for gcSweep awaken it
	goparkunlock(&sweep.lock, waitReasonGCSweepWait, traceEvGoBlock, 1)

	// loop to sweep all span
	for {
		for sweepone() != ^uintptr(0) {
			sweep.nbgsweep++
			// call scheduler to avoid occupying cpu for a long time
			Gosched()
		}
		for freeSomeWbufs(true) {
			Gosched()
		}
		lock(&sweep.lock)
		if !isSweepDone() {
			// This can happen if a GC runs between
			// gosweepone returning ^0 above
			// and the lock being acquired.
			unlock(&sweep.lock)
			continue
		}
		
		// sweep complete, sleep until waken up again
		sweep.parked = true
		goparkunlock(&sweep.lock, waitReasonGCSweepWait, traceEvGoBlock, 1)
	}
}

func sweepone() uintptr {
	_g_ := getg()
	
	// Find a span to sweep.
	var s *mspan
	sg := mheap_.sweepgen
	for {
		s = mheap_.sweepSpans[1-sg/2%2].pop()
		
		// complete when reach nil
		if s == nil {
			atomic.Store(&mheap_.sweepdone, 1)
			break
		}
		
		// skip idle span (s.sweepgen == sg)
		//if s.state != mSpanInUse {
		//	s.sweepgen = sg
		//	continue
		//}
		if s.state != mSpanInUse {
			// This can happen if direct sweeping already
			// swept this span, but in that case the sweep
			// generation should always be up-to-date.
			if !(s.sweepgen == sg || s.sweepgen == sg+3) {
				print("runtime: bad span s.state=", s.state, " s.sweepgen=", s.sweepgen, " sweepgen=", sg, "\n")
				throw("non in-use span in unswept list")
			}
			continue
		}
		
		// skip spans which already been sweep or sweeping
		//if s.sweepgen != sg-2 || !cas(&s.sweepgen, sg-2, sg-1) {
		//	continue
		//}
		// find the span which need sweep
		if s.sweepgen == sg-2 && atomic.Cas(&s.sweepgen, sg-2, sg-1) {
			break
		}
	}
	
	// Sweep the span we found.
	npages := ^uintptr(0)
	if s != nil {
		npages = s.npages
		// call memory allocator's sweep method (mspan.sweep)
		if s.sweep(false) {
			// Whole span was freed. Count it toward the
			// page reclaimer credit since these pages can
			// now be used for span allocation.
			atomic.Xadduintptr(&mheap_.reclaimCredit, npages)
		} else {
			// Span is still in-use, so this returned no
			// pages to the heap and the span needs to
			// move to the swept in-use list.
			npages = 0
		}
	}

	// Decrement the number of active sweepers and if this is the
	// last one print trace information.
	if atomic.Xadd(&mheap_.sweepers, -1) == 0 && atomic.Load(&mheap_.sweepdone) != 0 {
		if debug.gcpacertrace > 0 {
			print("pacer: sweep done at heap size ", memstats.heap_live>>20, "MB; allocated ", (memstats.heap_live-mheap_.sweepHeapLiveBasis)>>20, "MB during sweep; swept ", mheap_.pagesSwept, " pages at ", sweepRatio, " pages/byte\n")
		}
	}
	_g_.m.locks--
	return npages
}

// Sweep frees or collects finalizers for blocks not marked in the mark phase.
// It clears the mark bits in preparation for the next GC round.
// Returns true if the span was returned to heap.
// If preserve=true, don't return it to heap nor relink in mcentral lists;
// caller takes care of it.
func (s *mspan) sweep(preserve bool) bool {
	// ...
}
// mgc.go

// gcenable is called after the bulk of the runtime initialization,
// just before we're about to start letting user code run.
// It kicks off the background sweeper goroutine, the background
// scavenger goroutine, and enables GC.
func gcenable() {
	// Kick off sweeping and scavenging.
	c := make(chan int, 2)
	go bgsweep(c)
	go bgscavenge(c)
	<-c
	<-c
	memstats.enablegc = true // now that runtime is initialized, GC is okay
}
Monitor

proc.go
Mock condition: service restart, lots of clients make connections, large amount of objects are allocated, which will
push next_gc to a ultra high value.
However, after service back to normal, due to amount of active objects is much
smaller than this next_gc threshold, GC will NOT be triggered in a long time, which means service process has large
amount of white objects NOT able to be recycled. It leads to a implicitly memory leakage.
The same situation may be caused by some kind of algorithm create large amount of template objects.
=> simulated with large_mem_not_gc.go
➜  GC git:(master) ✗ make build                                         
go build -gcflags "-l" -o test large_mem_not_gc.go
➜  GC git:(master) ✗ make trace
GODEBUG="gctrace=1" ./test
gc 1 @0.001s 5%: 0.055+0.29+0.004 ms clock, 0.44+0.26/0.075/0.13+0.033 ms cpu, 4->4->3 MB, 5 MB goal, 8 P
gc 2 @0.002s 4%: 0.002+0.57+0.006 ms clock, 0.019+0.067/0.036/0.67+0.053 ms cpu, 7->8->8 MB, 8 MB goal, 8 P
gc 3 @0.003s 4%: 0.002+0.61+0.003 ms clock, 0.020+0.59/0.054/0.16+0.025 ms cpu, 15->15->15 MB, 17 MB goal, 8 P
16:56:08 30 MB
16:56:38 30 MB
16:57:08 30 MB
16:57:38 30 MB
16:58:08 30 MB
GC forced
gc 4 @120.007s 0%: 0.024+0.24+0.008 ms clock, 0.19+0/0.17/0.30+0.066 ms cpu, 20->20->0 MB, 30 MB goal, 8 P
scvg0: inuse: 0, idle: 63, sys: 63, released: 0, consumed: 63 (MB)
16:58:38 4 MB
16:59:08 4 MB
test function is used to simulate allocating large amount of objects rapidly.
Result indicates, in a quite long time after this allocation ended, NO GC is triggered.
Until forcegc, the next_gc is set to normal. It is the last precaution of GC.
Monitor service sysmon check GC status every 2 mins, if GC is NOT triggered beyond 2 mins, it will forced to execute GC.
// proc.go

// forcegcperiod is the maximum time in nanoseconds between garbage
// collections. If we go this long without a garbage collection, one
// is forced to run.
//
// This is a variable for testing purposes. It normally doesn't change.
var forcegcperiod int64 = 2 * 60 * 1e9

// start forcegc helper goroutine
func init() {
	go forcegchelper()
}

func forcegchelper() {
	forcegc.g = getg()
	for {
		lock(&forcegc.lock)
		if forcegc.idle != 0 {
			throw("forcegc: phase error")
		}
		atomic.Store(&forcegc.idle, 1)
		// sleep and wait for being awaken
		goparkunlock(&forcegc.lock, waitReasonForceGGIdle, traceEvGoBlock, 1)
		// this goroutine is explicitly resumed by sysmon
		if debug.gctrace > 0 {
			println("GC forced")
		}
		// Time-triggered, fully concurrent.
		gcStart(gcTrigger{kind: gcTriggerTime, now: nanotime()})
	}
}

// Always runs without a P, so write barriers are not allowed.
//
//go:nowritebarrierrec
func sysmon() {
	// ...
	
	for {
		// ...
		
		// check if we need to force a GC
		if t := (gcTrigger{kind: gcTriggerTime, now: now}); t.test() && atomic.Load(&forcegc.idle) != 0 {
			lock(&forcegc.lock)
			forcegc.idle = 0
			var list gList
			list.push(forcegc.g)
			// put goroutine of forcegc to ready-to-run list
			injectglist(&list)
			unlock(&forcegc.lock)
		}
	}
}
forcegc goroutine works the same with former bgsweep goroutine in a pattern of endless-loop, sleeping, wake-up waiting.
Memory status statics

mstats and MemStats
mstats.go
// ReadMemStats populates m with memory allocator statistics.
//
// The returned memory allocator statistics are up to date as of the
// call to ReadMemStats. This is in contrast with a heap profile,
// which is a snapshot as of the most recently completed garbage
// collection cycle.
func ReadMemStats(m *MemStats) {
	stopTheWorld("read mem stats")

	systemstack(func() {
		readmemstats_m(m)
	})

	startTheWorld()
}
ReadMemStats includes STW operation, then should control its utilization time and times.
bash output example:
                   heap_idle        heap_released
                   |                |
scvg0: inuse: 3, idle: 1, sys: 5, released: 0, consumed: 5 (MB)
         |                 |                     |
         heap_inuse        heap_sys              heap_sys - heap_released
GC Traces

GODEBUG=gctrace=1
Garbage Collection In Go : Part II - GC Traces