Understanding the current GC, conclusion

by Leandro Lucarella on 2009- 04- 11 19:36 (updated on 2009- 04- 11 19:36)
tagged book, conclusion, d, dgc, druntime, en, gc, mark-sweep, understanding the current gc - with 0 comment(s)

Now that I know fairly deeply the implementation details about the current GC, I can compare it to the techniques exposed in the GC Book.

Tri-colour abstraction

Since most literature speaks in terms of the tri-colour abstraction, now it's a good time to translate how this is mapped to the D GC implementation.

As we all remember, each cell (bin) in D has several bits associated to them. Only 3 are interesting in this case:

mark
scan
free (freebits)

So, how we can translate this bits into the tri-colour abstraction?

Black

Cells that were marked and scanned (there are no pointer to follow) are coloured black. In D this cells has the bits:

mark = 1
scan = 0
free = 0

Grey

Cells that has been marked, but they have pointers to follow in them are coloured grey. In D this cells has the bits:

mark = 1
scan = 1
free = 0

White

Cells that has not been visited at all are coloured white (all cells should be colored white before the marking starts). In D this cells has the bits:

mark = 0
scan = X

Or:

free = 1

The scan bit is not important in this case (but in D it should be 0 because scan bits are cleared before the mark phase starts). The free bit is used for the cells in the free list. They are marked before other cells get marked with bits mark=1 and free=1. This way the cells in the free list don't get scanned (mark=1, scan=0) and are not confused with black cells (free=1), so they can be kept in the free list after the mark phase is done. I think this is only necessary because the free list is regenerated.

Improvements

Here is a summary of improvements proposed by the GC Book, how the current GC is implemented in regards to this improvements and what optimization opportunities can be considered.

Mark stack

The simplest version of the marking algorithm is recursive:

mark(cell)
    if not cell.marked
        cell.marked = true
        for child in cell.children
            mark(child)

The problem here is, of course, stack overflow for very deep heap graphs (and the space cost).

The book proposes using a marking stack instead, and several ways to handle stack overflow, but all these are only useful for relieving the symptom, they are not a cure.

As a real cure, pointer reversal is proposed. The idea is to use the very same pointers to store the mark stack. This is constant in space, and needs only one pass through the help, so it's a very tempting approach. The bad side is increased complexity and probably worse cache behavior (writes to the heap dirties the entire heap, and this can kill the cache).

Current implementation

The D GC implementation does none of this. Instead it completes the mark phase by traversing the heap (well, not really the heap, only the bit sets) in several passes, until no more data to scan can be found (all cells are painted black or white). While the original algorithm only needs one pass through the heap, this one need several. This trades space (and the complexity of stack overflow handling) for time.

Optimization opportunities

This seems like a fair trade-off, but alternatives can be explored.

Bitmap marking

The simplest mark-sweep algorithm suggests to store marking bits in the very own cells. This can be very bad for the cache because a full traversal should be done across the entire heap. As an optimization, a bitmap can be used, because they are much small and much more likely to fit in the cache, marking can be greatly improved using them.

Current implementation

Current implementation uses bitmaps for mark, scan, free and other bits. The bitmap implementation is GCBits and is a general approach.

The bitmap stores a bit for each 16 bytes chunks, no matter what cell size (Bins, or bin size) is used. This means that 4096/16 = 256 bits (32 bytes) are used for each bitmap for every page in the GC heap. Being 5 bitmaps (mark, scan, freebits, finals and noscan), the total spaces per page is 160 bytes. This is a 4% space overhead in bits only.

This wastes some space for larger cells.

Optimization opportunities

The space overhead of bitmaps seems to be fairly small, but each byte counts for the mark phase because of the cache. A heap with 64 MiB uses 2.5 MiB in bitmaps. Modern processors come with about that much cache, and a program using 64 MiB doesn't seems very rare. So we are pushing the limits here if we want our bitmaps to fit in the cache to speed up the marking phase.

I think there is a little room for improvement here. A big object, lets say it's 8 MiB long, uses 640 KiB of memory for bitmaps it doesn't need. I think some specialized bitmaps can be used for large object, for instance, to minimize the bitmaps space overhead.

There are some overlapping bits too. mark=0 and scan=1 can never happen for instance. I think it should be possible to use that combination for freebits, and get rid of an entire bitmap.

Lazy sweep

The sweep phase is done generally right after the mark phase. Since normally the collection is triggered by an allocation, this can be a little disrupting for the thread that made that allocation, that has to absorb all the sweeping itself.

Another alternative is to do the sweeping incrementally, by doing it lazy. Instead of finding all the white cells and linking them to the free list immediately, this is done on each allocation. If there is no free cells in the free list, a little sweeping is done until new space can be found.

This can help minimize pauses for the allocating thread.

Current implementation

The current implementation does an eager sweeping.

Optimization opportunities

The sweeping phase can be made lazy. The only disadvantage I see is (well, besides extra complexity) that could make the heap more likely to be fragmented, because consecutive requests are not necessarily made on the same page (a free() call can add new cells from another page to the free list), making the heap more sparse, (which can be bad for the cache too). But I think this is only possible if free() is called explicitly, and this should be fairly rare in a garbage collected system, so I guess this could worth trying.

Lazy sweeping helps the cache too, because in the sweep phase, you might trigger cache misses when linking to the free list. When sweeping lazily, the cache miss is delayed until it's really necessary (the cache miss will happen anyway when you are allocating the free cell).

Conclusion

Even when the current GC is fairly optimized, there is plenty of room for improvements, even preserving the original global design.

Understanding the current GC, the end

by Leandro Lucarella on 2009- 04- 11 04:46 (updated on 2009- 04- 15 04:10)
tagged d, dgc, druntime, en, gc, mark, mark-sweep, sweep, understanding the current gc - with 0 comment(s)

In this post I will take a closer look at the Gcx.mark() and Gcx.fullcollect() functions.

This is a simplified version of the mark algorithm:

mark(from, to)
    changes = 0
    while from < to
        pool = findPool(from)
        offset = from - pool.baseAddr
        page_index = offset / PAGESIZE
        bin_size = pool.pagetable[page_index]
        bit_index = find_bit_index(bin_size, pool, offset)
        if not pool.mark.test(bit_index)
            pool.mark.set(bit_index)
            if not pool.noscan.test(bit_index)
                pool.scan.set(bit_index)
                changes = true
        from++
        anychanges |= changes // anychanges is global

In the original version, there are some optimizations and the find_bit_index() function doesn't exist (it does some bit masking to find the right bit index for the bit set). But everything else is pretty much the same.

So far, is evident that the algorithm don't mark the whole heap in one step, because it doesn't follow pointers. It just marks a consecutive chunk of memory, assuming that pointers can be at any place in that memory, as long as they are aligned (from increments in word-sized steps).

fullcollect() is the one in charge of following pointers, and marking chunks of memory. It does it in an iterative way (that's why mark() informs about anychanges (when new pointer should be followed to mark them, or, speaking in the tri-colour abstraction, when grey cells are found).

fullcollect() is huge, so I'll split it up in smaller pieces for the sake of clarity. Let's see what are the basic blocks (see the second part of this series):

fullcollect()
    thread_suspendAll()
    clear_mark_bits()
    mark_free_list()
    rt_scanStaticData(mark)
    thread_scanAll(mark, stackTop)
    mark_root_set()
    mark_heap()
    thread_resumeAll()
    sweep()

Generaly speaking, all the functions that have some CamelCasing are real functions and the ones that are all_lowercase and made up by me.

Let's see each function.

thread_suspendAll()

This is part of the threads runtime (found in src/common/core/thread.d). A simple peak at it shows it uses SIGUSR1 to stop the thread. When the signal is caught it pushes all the registers into the stack to be sure any pointers there are scanned in the future. The threads waits for SIGUSR2 to resume.

clear_mark_bits()

foreach pool in pooltable
    pool.mark.zero()
    pool.scan.zero()
    pool.freebits.zero()

mark_free_list()

foreach n in B_16 .. B_PAGE
    foreach node in bucket
        pool = findPool(node)
        pool.freebits.set(find_bit_index(pool, node))
        pool.mark.set(find_bit_index(pool, node))

rt_scanStaticData(mark)

This function, as the name suggests, uses the provided mark function callback to scan the program's static data.

thread_scanAll(mark, stackTop)

This is another threads runtime function, used to mark the suspended threads stacks. I does some calculation about the stack bottom and top, and calls mark(bottom, top), so at this point we have marked all reachable memory from the stack(s).

mark_root_set()

mark(roots, roots + nroots)
foreach range in ranges
    mark(range.pbot, range.ptop)

mark_heap()

This is where most of the marking work is done. The code is really ugly, very hard to read (mainly because of bad variable names) but what it does it's relatively simple, here is the simplified algorithm:

// anychanges is global and was set by the mark()ing of the
// stacks and root set
while anychanges
    anychanges = 0
    foreach pool in pooltable
        foreach bit_pos in pool.scan
            if not pool.scan.test(bit_pos)
                continue
            pool.scan.clear(bit_pos) // mark as already scanned
            bin_size = find_bin_for_bit(pool, bit_pos)
            bin_base_addr = find_base_addr_for_bit(pool, bit_pos)
            if bin_size < B_PAGE // small object
                bin_top_addr = bin_base_addr + bin_size
            else if bin_size in [B_PAGE, B_PAGEPLUS] // big object
                page_num = (bin_base_addr - pool.baseAddr) / PAGESIZE
                if bin == B_PAGEPLUS // search for the base page
                    while pool.pagetable[page_num - 1] != B_PAGE
                        page_num--
                n_pages = 1
                while page_num + n_pages < pool.ncommitted
                        and pool.pagetable[page_num + n_pages] == B_PAGEPLUS
                    n_pages++
                bin_top_addr = bin_base_addr + n_pages * PAGESIZE
            mark(bin_base_addr, bin_top_addr)

The original algorithm has some optimizations for proccessing bits in clusters (skips groups of bins without the scan bit) and some kind-of bugs too.

Again, the functions in all_lower_case don't really exist, some pointer arithmetics are done in place for finding those values.

Note that the pools are iterated over and over again until there are no unvisited bins. I guess this is a fair price to pay for not having a mark stack (but I'm not really sure =).

thread_resumeAll()

This is, again, part of the threads runtime and resume all the paused threads by signaling a SIGUSR2 to them.

sweep()

mark_unmarked_free()
rebuild_free_list()

mark_unmarked_free()

This (invented) function looks for unmarked bins and set the freebits bit on them if they are small objects (bin size smaller than B_PAGE) or mark the entire page as free (B_FREE) in case of large objects.

This step is in charge of executing destructors too (through rt_finalize() the runtime function).

rebuild_free_list()

This (also invented) function first clear the free list (bucket) and then rebuild it using the information collected in the previous step.

As usual, only bins with size smaller than B_PAGE are linked to the free list, except if the pages they belong to have all the bins freed, in which case the page is marked with the special B_FREE bin size. The same goes for big objects freed in the previous step.

I think rebuilding the whole free list is not necessary, the new free bins could be just linked to the existing free list. I guess this step exists to help reducing fragmentation, since the rebuilt free list group bins belonging to the same page together.

Luca's meaningless thoughts

Understanding the current GC, conclusion

Tri-colour abstraction

Improvements

Mark stack

Bitmap marking

Lazy sweep

Conclusion

Understanding the current GC, the end