Luca's meaningless thoughts  

Memory allocation patterns

by Leandro Lucarella on 2010- 08- 14 03:28 (updated on 2010- 08- 14 13:09)
tagged allocation, benchmark, cdgc, d, dgc, dil, en, gc, memory, pattern - with 0 comment(s)

Note

Tango 0.99.9 has a bug in its runtime, which sometimes makes the GC scan memory that should not be scanned. It only affects Dil and Voronoi programs, but in a significant way. The tests in this post are done using a patched runtime, with the bug fixed.

Update

The results for the unpublished programs are now available. You can find the graphic results, the detailed summary and the source code for all the programs (except dil, which can be downloaded from its home site).

After seeing some weird behaviours and how different benchmarks are more or less affected by changes like memory addresses returned by the OS or by different ways to store the type information pointer, I decided to gather some information about how much and what kind of memory are requested by the different benchmarks.

I used the information provided by the malloc_stats_file CDGC option, and generated some stats.

The analysis is done on the allocations requested by the program (calls to gc_malloc()) and contrasting that with the real memory allocated by the GC. Note that only the GC heap memory (that is, memory dedicated to the program, which the GC scans in the collections) is counted (internal GC memory used for bookkeeping is not).

Also note that in this post I generally refer to object meaning a block of memory, it doesn't mean they are actually instance of a class or anything. Finally bear in mind that all the figures shown here are the sum of all the allocations done in the life of a program. If the collected data says a program requested 1GiB of memory, that doesn't mean the program had a residency of 1GiB, the program could had a working set of a few KiB and recycled memory like hell.

When analyzing the real memory allocated by the GC, there are two modes being analyzed, one is the classic conservative mode and the other is the precise mode (as it is in the original patch, storing the type information pointer at the end of the blocks). So the idea here is to measure two major things:

  • The amount of memory wasted by the GC because of how it arranges memory as fixed-size blocks (bins) and large objects that uses whole pages.
  • The extra amount of memory wasted by the GC when using precise mode because it stores the type information pointer at the end of the blocks.

I've selected a few representative benchmarks. Here are the results:

bh allocation pattern

This is a translation by Leonardo Maffi from the Olden Benchmark that does a Barnes–Hut simulation. The program is CPU intensive an does a lot of allocation of about 5 different small objects.

Here is a graphic summary of the allocation requests and real allocated memory for a run with -b 4000:

https://llucax.com:443/blog/posts/2010/08/14-memory-allocation/bh.rq.tot.png https://llucax.com:443/blog/posts/2010/08/14-memory-allocation/bh.rq.bin.png https://llucax.com:443/blog/posts/2010/08/14-memory-allocation/bh.ws.tot.png https://llucax.com:443/blog/posts/2010/08/14-memory-allocation/bh.ws.bin.png

We can easily see here how the space wasted by the GC memory organization is significant (about 15% wasted), and how the type information pointer is adding an even more significant overhead (about 36% of the memory is wasted). This means that this program will be 15% more subject to false pointers (and will have to scan some extra memory too, but fortunately the majority of the memory doesn't need to be scanned) than it should in conservative mode and that the precise mode makes things 25% worse.

You can also see how the extra overhead in the precise mode is because some objects that should fit in a 16 bin now need a 32 bytes bin to hold the extra pointer. See how there were no waste at all in the conservative mode for objects that should fit a 16 bytes bin. 117MiB are wasted because of that.

Here is a more detailed (but textual) summary of the memory requested and allocated:

Requested
Total
15,432,462 objecs, 317,236,335 bytes [302.54MiB]
Scanned
7,757,429 (50.27%) objecs, 125,360,510 bytes [119.55MiB] (39.52%)
Not scanned
7,675,033 (49.73%) objecs, 191,875,825 bytes [182.99MiB] (60.48%)
Different object sizes
8
Objects requested with a bin size of:
16 bytes
7,675,064 (49.73%) objects, 122,801,024 bytes [117.11MiB] (38.71%)
32 bytes
7,734,214 (50.12%, 99.85% cumulative) objects, 193,609,617 bytes [184.64MiB] (61.03%, 99.74% cumulative)
64 bytes
23,181 (0.15%, 100% cumulative) objects, 824,988 bytes [805.65KiB] (0.26%, 100% cumulative)
256 bytes
2 (0%, 100% cumulative) objects, 370 bytes (0%, 100% cumulative)
512 bytes
1 (0%, 100% cumulative) objects, 336 bytes (0%, 100% cumulative)
Allocated
Conservative mode
Total allocated
371,780,480 bytes [354.56MiB]
Total wasted
54,544,145 bytes [52.02MiB], 14.67%
Wasted due to objects that should use a bin of
16 bytes
0 bytes (0%)
32 bytes
53,885,231 bytes [51.39MiB] (98.79%, 98.79% cumulative)
64 bytes
658,596 bytes [643.16KiB] (1.21%, 100% cumulative)
256 bytes
142 bytes (0%, 100% cumulative)
512 bytes
176 bytes (0%, 100% cumulative)
Precise mode
Total allocated
495,195,296 bytes [472.26MiB]
Total wasted
177,958,961 bytes [169.71MiB], 35.94%
Wasted due to objects that should use a bin of
16 bytes
122,801,024 bytes [117.11MiB] (69.01%)
32 bytes
54,499,023 bytes [51.97MiB] (30.62%, 99.63% cumulative)
64 bytes
658,596 bytes [643.16KiB] (0.37%, 100% cumulative)
256 bytes
142 bytes (0%, 100% cumulative)
512 bytes
176 bytes (0%, 100% cumulative)

bigarr allocation pattern

This is a extremely simple program that just allocate a big array of small-medium objects (all of the same size) I found in the D NG.

Here is the graphic summary:

https://llucax.com:443/blog/posts/2010/08/14-memory-allocation/bigarr.rq.tot.png https://llucax.com:443/blog/posts/2010/08/14-memory-allocation/bigarr.rq.bin.png https://llucax.com:443/blog/posts/2010/08/14-memory-allocation/bigarr.ws.tot.png https://llucax.com:443/blog/posts/2010/08/14-memory-allocation/bigarr.ws.bin.png

The only interesting part of this test is how many space is wasted because of the memory organization, which in this case goes up to 30% for the conservative mode (and have no change for the precise mode).

Here is the detailed summary:

Requested
Total
12,000,305 objecs, 1,104,160,974 bytes [1.03GiB]
Scanned
12,000,305 (100%) objecs, 1,104,160,974 bytes [1.03GiB] (100%)
Not scanned
0 (0%) objecs, 0 bytes (0%)
Different object sizes
5
Objects requested with a bin size of
128 bytes
12,000,000 (100%, 100% cumulative) objects, 1,056,000,000 bytes [1007.08MiB] (95.64%, 95.64% cumulative)
256 bytes
2 (0%, 100% cumulative) objects, 322 bytes (0%, 95.64% cumulative)
512 bytes
1 (0%, 100% cumulative) objects, 336 bytes (0%, 95.64% cumulative)
more than a page
302 (0%) objects, 48,160,316 bytes [45.93MiB] (4.36%)
Allocated
Conservative mode
Total allocated
1,584,242,808 bytes [1.48GiB]
Total wasted
480,081,834 bytes [457.84MiB], 30.3%
Wasted due to objects that should use a bin of
128 bytes
480,000,000 bytes [457.76MiB] (99.98%, 99.98% cumulative)
256 bytes
190 bytes (0%, 99.98% cumulative)
512 bytes
176 bytes (0%, 99.98% cumulative)
more than a page
81,468 bytes [79.56KiB] (0.02%)
Precise mode
Total allocated
1,584,242,808 bytes [1.48GiB]
Total wasted
480,081,834 bytes [457.84MiB], 30.3%
Wasted due to objects that should use a bin of:
128 bytes
480,000,000 bytes [457.76MiB] (99.98%, 99.98% cumulative)
256 bytes
190 bytes (0%, 99.98% cumulative)
512 bytes
176 bytes (0%, 99.98% cumulative)
more than a page
81,468 bytes [79.56KiB] (0.02%)

mcore allocation pattern

This is program that test the contention produced by the GC when appending to (thread-specific) arrays in several threads concurrently (again, found at the D NG). For this analysis the concurrency doesn't play any role though, is just a program that do a lot of appending to a few arrays.

Here are the graphic results:

https://llucax.com:443/blog/posts/2010/08/14-memory-allocation/mcore.rq.tot.png https://llucax.com:443/blog/posts/2010/08/14-memory-allocation/mcore.rq.bin.png https://llucax.com:443/blog/posts/2010/08/14-memory-allocation/mcore.ws.tot.png https://llucax.com:443/blog/posts/2010/08/14-memory-allocation/mcore.ws.bin.png

This is the most boring of the examples, as everything works as expected =)

You can clearly see how the arrays grow, passing through each bin size and finally becoming big objects which take most of the allocated space. Almost nothing need to be scanned (they are int arrays), and practically there is no waste. That's a good decision by the array allocation algorithm, which seems to exploit the bin sizes to the maximum. Since almost all the data is doesn't need to be scanned, there is no need to store the type information pointers, so there is no waste either for the precise mode (the story would be totally different if the arrays were of objects that should be scanned, as probably each array allocation would waste about 50% of the memory to store the type information pointer).

Here is the detailed summary:

Requested
Total requested
367 objecs, 320,666,378 bytes [305.81MiB]
Scanned
8 (2.18%) objecs, 2,019 bytes [1.97KiB] (0%)
Not scanned
359 (97.82%) objecs, 320,664,359 bytes [305.81MiB] (100%)
Different object sizes
278
Objects requested with a bin size of
16 bytes
4 (1.09%) objects, 20 bytes (0%)
32 bytes
5 (1.36%, 2.45% cumulative) objects, 85 bytes (0%, 0% cumulative)
64 bytes
4 (1.09%, 3.54% cumulative) objects, 132 bytes (0%, 0% cumulative)
128 bytes
4 (1.09%, 4.63% cumulative) objects, 260 bytes (0%, 0% cumulative)
256 bytes
6 (1.63%, 6.27% cumulative) objects, 838 bytes (0%, 0% cumulative)
512 bytes
9 (2.45%, 8.72% cumulative) objects, 2,708 bytes [2.64KiB] (0%, 0% cumulative)
1024 bytes
4 (1.09%, 9.81% cumulative) objects, 2,052 bytes [2KiB] (0%, 0% cumulative)
2048 bytes
4 (1.09%, 10.9% cumulative) objects, 4,100 bytes [4KiB] (0%, 0% cumulative)
4096 bytes
4 (1.09%, 11.99% cumulative) objects, 8,196 bytes [8KiB] (0%, 0.01% cumulative)
more than a page
323 (88.01%) objects, 320,647,987 bytes [305.79MiB] (99.99%)
Allocated
Conservative mode
Total allocated
321,319,494 bytes [306.43MiB]
Total wasted
653,116 bytes [637.81KiB], 0.2%
Wasted due to objects that should use a bin of
16 bytes
44 bytes (0.01%)
32 bytes
75 bytes (0.01%, 0.02% cumulative)
64 bytes
124 bytes (0.02%, 0.04% cumulative)
128 bytes
252 bytes (0.04%, 0.08% cumulative)
256 bytes
698 bytes (0.11%, 0.18% cumulative)
512 bytes
1,900 bytes [1.86KiB] (0.29%, 0.47% cumulative)
1024 bytes
2,044 bytes [2KiB] (0.31%, 0.79% cumulative)
2048 bytes
4,092 bytes [4KiB] (0.63%, 1.41% cumulative)
4096 bytes
8,188 bytes [8KiB] (1.25%, 2.67% cumulative)
more than a page
635,699 bytes [620.8KiB] (97.33%)
Precise mode
Total allocated
321,319,494 bytes [306.43MiB]
Total wasted
653,116 bytes [637.81KiB], 0.2%
Wasted due to objects that should use a bin of
16 bytes
44 bytes (0.01%)
32 bytes
75 bytes (0.01%, 0.02% cumulative)
64 bytes
124 bytes (0.02%, 0.04% cumulative)
128 bytes
252 bytes (0.04%, 0.08% cumulative)
256 bytes
698 bytes (0.11%, 0.18% cumulative)
512 bytes
1,900 bytes [1.86KiB] (0.29%, 0.47% cumulative)
1024 bytes
2,044 bytes [2KiB] (0.31%, 0.79% cumulative)
2048 bytes
4,092 bytes [4KiB] (0.63%, 1.41% cumulative)
4096 bytes
8,188 bytes [8KiB] (1.25%, 2.67% cumulative)
more than a page
635,699 bytes [620.8KiB] (97.33%)

voronoi allocation pattern

This is one of my favourites, because is always problematic. It "computes the voronoi diagram of a set of points recursively on the tree" and is also taken from the Olden Benchmark and translated by Leonardo Maffi to D.

Here are the graphic results for a run with -n 30000:

https://llucax.com:443/blog/posts/2010/08/14-memory-allocation/voronoi.rq.tot.png https://llucax.com:443/blog/posts/2010/08/14-memory-allocation/voronoi.rq.bin.png https://llucax.com:443/blog/posts/2010/08/14-memory-allocation/voronoi.ws.tot.png https://llucax.com:443/blog/posts/2010/08/14-memory-allocation/voronoi.ws.bin.png

This have a little from all the previous examples. Practically all the heap should be scanned (as in bigarr), it wastes a considerably portion of the heap because of the fixed-size blocks (as all but mcore), it wastes just a very little more because of type information (as all but bh) but that waste comes from objects that should fit in a 16 bytes bin but is stored in a 32 bytes bin instead (as in bh).

Maybe that's why it's problematic, it touches a little mostly all the GC flaws.

Here is the detailed summary:

Requested
Total requested
1,309,638 objecs, 33,772,881 bytes [32.21MiB]
Scanned
1,309,636 (100%) objecs, 33,772,849 bytes [32.21MiB] (100%)
Not scanned
2 (0%) objecs, 32 bytes (0%)
Different object sizes
6
Objects requested with a bin size of
16 bytes
49,152 (3.75%) objects, 786,432 bytes [768KiB] (2.33%)
32 bytes
1,227,715 (93.74%, 97.5% cumulative) objects, 31,675,047 bytes [30.21MiB] (93.79%, 96.12% cumulative)
64 bytes
32,768 (2.5%, 100% cumulative) objects, 1,310,720 bytes [1.25MiB] (3.88%, 100% cumulative)
256 bytes
2 (0%, 100% cumulative) objects, 346 bytes (0%, 100% cumulative)
512 bytes
1 (0%, 100% cumulative) objects, 336 bytes (0%, 100% cumulative)
Allocated
Conservative mode
Total allocated
42,171,488 bytes [40.22MiB]
Total wasted
8,398,607 bytes [8.01MiB], 19.92%
Wasted due to objects that should use a bin of
16 bytes
0 bytes (0%)
32 bytes
7,611,833 bytes [7.26MiB] (90.63%, 90.63% cumulative)
64 bytes
786,432 bytes [768KiB] (9.36%, 100% cumulative)
256 bytes
166 bytes (0%, 100% cumulative)
512 bytes
176 bytes (0%, 100% cumulative)
Precise mode
Total allocated
42,957,888 bytes [40.97MiB]
Total wasted
9,185,007 bytes [8.76MiB], 21.38%
Wasted due to objects that should use a bin of
16 bytes
786,400 bytes [767.97KiB] (8.56%)
32 bytes
7,611,833 bytes [7.26MiB] (82.87%, 91.43% cumulative)
64 bytes
786,432 bytes [768KiB] (8.56%, 100% cumulative)
256 bytes
166 bytes (0%, 100% cumulative)
512 bytes
176 bytes (0%, 100% cumulative)

Dil allocation pattern

Finally, this is by far my favourite, the only real-life program, and the most colorful example (literally =).

Dil is a D compiler, and as such, it works a lot with strings, a lot of big chunks of memory, a lot of small objects, it has it all! String manipulation stress the GC a lot, because it uses objects (blocks) of all possible sizes ever, specially extremely small objects (less than 8 bytes, even a lot of blocks of just one byte!).

Here are the results of a run of Dil to generate Tango documentation (around 555 source files are processed):

https://llucax.com:443/blog/posts/2010/08/14-memory-allocation/dil.rq.tot.png https://llucax.com:443/blog/posts/2010/08/14-memory-allocation/dil.rq.bin.png https://llucax.com:443/blog/posts/2010/08/14-memory-allocation/dil.ws.tot.png https://llucax.com:443/blog/posts/2010/08/14-memory-allocation/dil.ws.bin.png

Didn't I say it was colorful?

This is like the voronoi but taken to the extreme, it really have it all, it allocates all types of objects in significant quantities, it wastes a lot of memory (23%) and much more when used in precise mode (33%).

Here is the detailed summary:

Requested
Total
7,307,686 objecs, 322,411,081 bytes [307.48MiB]
Scanned
6,675,124 (91.34%) objecs, 227,950,157 bytes [217.39MiB] (70.7%)
Not scanned
632,562 (8.66%) objecs, 94,460,924 bytes [90.08MiB] (29.3%)
Different object sizes
6,307
Objects requested with a bin size of
16 bytes
2,476,688 (33.89%) objects, 15,693,576 bytes [14.97MiB] (4.87%)
32 bytes
3,731,864 (51.07%, 84.96% cumulative) objects, 91,914,815 bytes [87.66MiB] (28.51%, 33.38% cumulative)
64 bytes
911,016 (12.47%, 97.43% cumulative) objects, 41,918,888 bytes [39.98MiB] (13%, 46.38% cumulative)
128 bytes
108,713 (1.49%, 98.91% cumulative) objects, 8,797,572 bytes [8.39MiB] (2.73%, 49.11% cumulative)
256 bytes
37,900 (0.52%, 99.43% cumulative) objects, 6,354,323 bytes [6.06MiB] (1.97%, 51.08% cumulative)
512 bytes
22,878 (0.31%, 99.75% cumulative) objects, 7,653,461 bytes [7.3MiB] (2.37%, 53.45% cumulative)
1024 bytes
7,585 (0.1%, 99.85% cumulative) objects, 4,963,029 bytes [4.73MiB] (1.54%, 54.99% cumulative)
2048 bytes
3,985 (0.05%, 99.9% cumulative) objects, 5,451,493 bytes [5.2MiB] (1.69%, 56.68% cumulative)
4096 bytes
2,271 (0.03%, 99.93% cumulative) objects, 6,228,433 bytes [5.94MiB] (1.93%, 58.61% cumulative)
more than a page
4,786 (0.07%) objects, 133,435,491 bytes [127.25MiB] (41.39%)
Allocated
Conservative mode
Total allocated
419,368,774 bytes [399.94MiB]
Total wasted
96,957,693 bytes [92.47MiB], 23.12%
Wasted due to objects that should use a bin of
16 bytes
23,933,432 bytes [22.82MiB] (24.68%)
32 bytes
27,504,833 bytes [26.23MiB] (28.37%, 53.05% cumulative)
64 bytes
16,386,136 bytes [15.63MiB] (16.9%, 69.95% cumulative)
128 bytes
5,117,692 bytes [4.88MiB] (5.28%, 75.23% cumulative)
256 bytes
3,348,077 bytes [3.19MiB] (3.45%, 78.68% cumulative)
512 bytes
4,060,075 bytes [3.87MiB] (4.19%, 82.87% cumulative)
1024 bytes
2,804,011 bytes [2.67MiB] (2.89%, 85.76% cumulative)
2048 bytes
2,709,787 bytes [2.58MiB] (2.79%, 88.56% cumulative)
4096 bytes
3,073,583 bytes [2.93MiB] (3.17%, 91.73% cumulative)
more than a page
8,020,067 bytes [7.65MiB] (8.27%)
Precise mode:
Total allocated
482,596,774 bytes [460.24MiB]
Total wasted
160,185,693 bytes [152.76MiB], 33.19%
Wasted due to objects that should use a bin of
16 bytes
26,820,824 bytes [25.58MiB] (16.74%)
32 bytes
85,742,913 bytes [81.77MiB] (53.53%, 70.27% cumulative)
64 bytes
18,070,872 bytes [17.23MiB] (11.28%, 81.55% cumulative)
128 bytes
5,221,884 bytes [4.98MiB] (3.26%, 84.81% cumulative)
256 bytes
3,400,557 bytes [3.24MiB] (2.12%, 86.93% cumulative)
512 bytes
4,125,611 bytes [3.93MiB] (2.58%, 89.51% cumulative)
1024 bytes
2,878,763 bytes [2.75MiB] (1.8%, 91.31% cumulative)
2048 bytes
2,760,987 bytes [2.63MiB] (1.72%, 93.03% cumulative)
4096 bytes
3,143,215 bytes [3MiB] (1.96%, 94.99% cumulative)
more than a page
8,020,067 bytes [7.65MiB] (5.01%)

Conclusion

I've analyzed other small fabricated benchmarks, but all of them had results very similar to the ones shown here.

I think the overallocation problem is more serious than what one might think at first sight. Bear in mind this is not GC overhead, is not because of internal GC data. Is memory the GC or the mutator cannot use. Is memory wasted because of fragmentation (planned fragmentation, but fragmentation at least). And I don't think this is the worse problem. The worse problem is, this memory will need to be scanned in most cases (Dil needs to scan 70% of the total memory requested), and maybe the worse of all is that is subject to false pointer. A false pointer to a memory location that is not actually being used by the program will keep the block alive! If is a large object (several pages) that could be pretty nasty.

This problems can be addressed in several ways. One is mitigate the problem by checking (when type information is available) what portions of the memory is really used and what is wasted, and don't keep things alive when they are only pointed to wasted memory. This is not free though, it will consume more CPU cycles so the solution could be worse than the problem.

I think it worth experimenting with other heap organizations, for example, I would experiment with one free list for object size instead of pre-fixed-sizes. I would even experiment with a free list for each type when type information is available, that would save a lot of space (internal GC space) when storing type information. Some specialization for strings could be useful too.

Unfortunately I don't think I'll have the time to do this, at least for the thesis, but I think is a very rich and interesting ground to experiment.

Allocations graphs

by Leandro Lucarella on 2009- 08- 26 21:54 (updated on 2009- 08- 26 21:54)
tagged allocation, benchmark, d, dgc, dgcbench, en, gc, graph, naive, statistics - with 0 comment(s)

Here are a set of improved statistics graphs, now including allocation statistics. All the data is plotted together and using the same timeline to ease the analysis and comparison.

Again, all graphs (as the graph title says), are taken using the Naive GC (stats code still not public yet :) and you can find the code for it in my D GC benchmark repository.

This time the (big) graphs are in EPS format because I could render them in PNG as big as I wanted and I didn't had the time to fix that =S

big_arrays rnd_data rnd_data_2 shootout_binarytrees split startup tree

The graphs shows the same as in the previous post with the addition of allocation time (how long it took to perform the allocation) and space (how many memory has been requested), which are rendered in the same graph, and an histogram of cell sizes. The histogram differentiates cells with and without the NO_SCAN bit, which might be useful in terms on seeing how bad the effect of false positives could be.

You can easily see how allocation time peeks match allocations that triggered a collection for example, and how bad can it be the effect of false positives, even when almost all the heap (99.99%) has the NO_SCAN bit (see rnd_data_2).

Understanding the current GC, part III

by Leandro Lucarella on 2009- 04- 10 02:28 (updated on 2009- 04- 10 02:28)
tagged allocation, d, dgc, druntime, en, gc, mark-sweep, understanding the current gc - with 0 comment(s)

In the previous post we focused on the Gcx object, the core of the GC in druntime (and Phobos and Tango, they are all are based on the same implementation). In this post we will focus on allocation, which a little more complex than it should be in my opinion.

It was not an easy task to follow how allocation works. A GC.malloc() call spawns into this function calls:

GC.malloc(size, bits)
 |
 '---> GC.mallocNoSync(size, bits)
        |
        |---> Gcx.allocPage(bin_size)
        |      |
        |      '---> Pool.allocPages(n_pages)
        |             |
        |             '---> Pool.extendPages(n_pages)
        |                    |
        |                    '---> os_mem_commit(addr, offset, size)
        |
        |---> Gcx.fullcollectshell()
        |
        |---> Gcx.newPool(n_pages)
        |      |
        |      '---> Pool.initialize(n_pages)
        |             |
        |             |---> os_mem_map(mem_size)
        |             |
        |             '---> GCBits.alloc(bits_size)
        |
        '---> Gcx.bigAlloc(size)
               |
               |---> Pool.allocPages(n_pages)
               |      '---> (...)
               |
               |---> Gcx.fullcollectshell()
               |
               |---> Gcx.minimize()
               |      |
               |      '---> Pool.Dtor()
               |             |
               |             |---> os_mem_decommit(addr, offset, size)
               |             |
               |             |---> os_mem_map(addr, size)
               |             |
               |             '---> GCBits.Dtor()
               |
               '---> Gcx.newPool(n_pages)
                      '---> (...)

Doesn't look so simple, ugh?

The map/commit differentiation of Windows doesn't exactly help simplicity. Note that Pool.initialize() maps the memory (reserve the address space) while Pool.allocPages() (through Pool.extendPages()) commit the new memory (ask the OS to actually reserve the virtual memory). I don't know how good is this for Windows (or put in another way, how bad could it be for Windows if all mapped memory gets immediately committed), but it adds a new layer of complexity (that's not even needed in Posix OSs). The whole branch starting at Gcx.allocPage(bin_size) would be gone if this distinction it's not made. Besides this, it worsen Posix OSs performance, because there are some non-trivial lookups to handle this non-existing non-committed pages, even when the os_mem_commit() and os_mem_decommit() functions are NOP and can be optimized out, the lookups are there.

Mental Note

See if getting rid of the commit()/decommit() stuff improves Linux performance.

But well, let's forget about this issue for now and live with it. Here is a summary of what all this functions do.

Note

I recommend to give another read to the (updated) previous posts of this series, specially if you are not familiar with the Pool concept and implementation.

GC.malloc(size, bits)
This is just a wrapper for multi-threaded code, it takes the GCLock if necessary and calls GC.mallocNoSync(size, bits).
GC.mallocNoSync(size, bits)

This function has 2 different algorithms for small objects (less than a page of 4KiB) and another for big objects.

It does some common work for both cases, like logging and adding a sentinel for debugging purposes (if those feature are enabled), finding the bin size (bin_size) that better fits size (and cache the result as an optimization for consecutive calls to malloc with the same size) and setting the bits (NO_SCAN, NO_MOVE, FINALIZE) to the allocated bin.

Small objects (bin_size < B_PAGE)
Looks at the free list (Gcx.bucket) trying to find a page with the minimum bin size that's equals or bigger than size. If it can't succeed, it calls Gcx.allocPage(bin_size) to find room in uncommitted pages. If there still no room for the requested amount of memory, it triggers a collection (Gcx.fullcollectshell()). If there is still no luck, Gcx.newPage(1) is called to ask the OS for more memory. Then it calls again Gcx.allocPage(bin_size) (remember the new memory is just mmap'ped but not commit'ed) and if there is no room in the free list still, an out of memory error is issued.
Big objects (B_PAGE and B_PAGEPLUS)
It simply calls Gcx.bigAlloc(size) and issue an out of memory error if that call fails to get the requested memory.
Gcx.allocPage(bin_size)
This function linearly search the pooltable for a Pool with an allocable page (i.e. a page already mapped by not yet committed). This is done through a call to Pool.allocPages(1). If a page is found, its bin size is set to bin_size via the Pool's pagetable, and all the bins of that page are linked to the free list (Gcx.bucket).
Pool.allocPages(n_pages)
Search for n_pages consecutive free pages (B_FREE) in the committed pages (pages in the pagetable with index up to ncommited). If they're not found, Pool.extendPages(n_pages) is called to commit some more mapped pages to fulfill the request.
Pool.extendPages(n_pages)
Commit n_pages already mapped pages (calling os_mem_commit()), setting them as free (B_FREE) and updating the ncommited attribute. If there are not that many uncommitted pages, it returns an error.
Gcx.newPool(n_pages)
This function adds a new Pool to the pooltable. It first adjusts the n_pages variable using various rules (for example, it duplicates the current allocated memory until 8MiB are allocated and then allocates 8MiB pools always, unless more memory is requested in the first place, of course). Then a new Pool is created with the adjusted n_pages value and it's initialized calling to Pool.initialize(n_pages), the pooltable is resized to fit the new number of pools (npools) and sorted using Pool.opCmp() (which uses the baseAddr to compare). Finally the minAddr and maxAddr attributes are updated.
Pool.initialize(n_pages)
Initializes all the Pool attributes, mapping the requested number of pages (n_pages) using os_mem_map(). All the bit sets (mark, scan, freebits, noscan) are allocated (using GCBits.alloc()) to n_pages * PAGESIZE / 16 bits and the pagetable too, setting all bins to B_UNCOMMITTED and ncommitted to 0.
Gcx.bigAlloc(size)

This is the weirdest function by far. There are very strange things, but I'll try to explain what I understand from it (what I think it's trying to do).

It first make a simple lookup in the pooltable for n_pages consecutive pages in any existing Pool (calling Pool.allocPages(n_pages) as in Gcx.allocPage()). If this fails, it runs a fullcollectshell() (if not disabled) then calls to minimize() (to prevent bloat) and then create a new pool (calling newPool() followed by Pool.allocPages()). If all that fails, it returns an error. If something succeed, the bin size for the first page is set to B_PAGE and the remaining pages are set to B_PAGEPLUS (if any). If there is any unused memory at the end, it's initialized to 0 (to prevent false positives when scanning I guess).

The weird thing about this, is that a lot of lookups into the pooltable are done in certain condition, but I think they are not needed because there are no changes that can make new room.

I don't know if this is legacy code that never got updated and have a lot of useless lookups or if I'm getting something wrong. Help is welcome!

There is not much to say about os_mem_xxx(), Gcx.minimize() and Gcx.fullcollectshell() functions, they were briefly described in the previous posts of this series. Pool.Dtor() just undo what was done in Pool.initialize().

A final word about the free list (Gcx.bucket). It's just a simple linked list. It uses the first size_t bytes of the free bin to point to the next free bin (there's always room for a pointer in a bin because their minimum size is 16 bytes). A simple structure is used to easy this:

struct List {
    List *next;
}

Then, the memory cell is casted to this structure to use the next pointer, like this:

p = gcx.bucket[bin]
gcx.bucket[bin] = (cast(List*) p).next

I really have my doubts if this is even a little less cryptic than:

p = gcx.bucket[bin]
gcx.bucket[bin] = *(cast(void**) p)

But what the hell, this is no really important =)