Luca's meaningless thoughts

Delegates and inlining

by Leandro Lucarella on 2010- 06- 28 15:30 (updated on 2010- 06- 28 15:30)
tagged d, delegate, dgc, en, gc, inline, inlining, optimization, performance - with 0 comment(s)

Sometimes performance issues matter more than you might think for a language. In this case I'm talking about the D programming language.

I'm trying to improve the GC, and I want to improve it not only in terms of performance, but in terms of code quality too. But I'm hitting some performance issues that prevent me to make the code better.

D support high level constructs, like delegates (aka closures). For example, to do a simple linear search I wanted to use this code:

T* find_if(bool delegate(ref T) predicate)
{
   for (size_t i = 0; i < this._size; i++)
      if (predicate(this._data[i]))
         return this._data + i;
   return null;
}
...
auto p = find_if((ref T t) { return t > 5; });

But in DMD, you don't get that predicate inlined (neither the find_if() call, for that matter), so you're basically screwed, suddenly you code is ~4x slower. Seriously, I'm not joking, using callgrind to profile the program (DMD's profiler doesn't work for me, I get a stack overflow for a recursive call when I try to use it), doing the call takes 4x more instructions, and in a real life example, using Dil to generate the Tango documentation, I get a 3.3x performance penalty for using this high-level construct.

I guess this is why D2's sort uses string mixins instead of delegates for this kind of things. The only lectures that I can find from this is delegates are failing in D, either because they have a bad syntax (compare sort(x, (ref X a, ref X b) { return a > b; }) with sort!"a < b"(x)) or because their performance sucks (mixins are inlined by definition, think of C macros). The language designer is telling you "don't use that feature".

Fortunately the later is only a DMD issue, LDC is able to inline those predicates (they have to inhibit the DMD front-end inlining to let LLVM do the dirty work, and it definitely does it better).

The problem is I can't use LDC because for some unknown reason it produces a non-working Dil executable, and Dil is the only real-life program I have to test and benchmark the GC.

I think this issue really hurts D, because if you can't write performance critical code using higher-level D constructs, you can't showcase your own language in the important parts.

GC optimization for contiguous pointers to the same page

by Leandro Lucarella on 2009- 04- 01 23:41 (updated on 2009- 04- 01 23:41)
tagged d, dgc, en, gc, optimization, phobos - with 0 comment(s)

This optimization had a patch, written by Vladimir Panteleev, sitting on Bugzilla (issue #1923) for a little more than an year now. It was already included in both Tango (issue #982) and DMD 2.x but DMD 1.x was missing it.

Fortunately is now included in DMD 1.042, released yesterday.

This optimization is best seen when you do word splitting of a big text (as shown in the post that triggered the patch):

import std.file, std.string;
void main() {
    auto txt = cast(string) read("text.txt"); // 6.3 MiB of text
    auto words = txt.split();
}

Now in words we have an array of slices (a contiguous area in memory filled with pointers) about the same size of the original text, as explained by Vladimir.

The GC heap is divided in (4KiB) pages, each page contains cells of a fixed type called bins. There are bin sizes of 16 (B_16) to 4096 (B_PAGE), incrementing in steps of power of 2 (32, 64, etc.). See Understanding the current GC for more details.

For large contiguous objects (like txt in this case) multiple pages are needed, and that pages contains only one bin of size B_PAGEPLUS, indicating that this object is distributed among several pages.

Now, back with the words array, we have a range of about 3 millions interior pointers into the txt contiguous memory (stored in about 1600 pages of bins with size B_PAGEPLUS). So each time the GC needs to mark the heap, it has to follow this 3 millions pointers and find out where is the beginning of that block to see its mark-state (if it's marked or not). Finding the beginning of the block is not that slow, but when you multiply it by 3 millions, it could get a little noticeable. Specially when this is done several times as the dynamic array of words grow and the GC collection is triggered several times, so this is kind of exponential.

The optimization consist in remembering the last page visited if the bin size was B_PAGE or B_PAGEPLUS, so if the current pointer being followed points to the last visited (cached) page, we can skip this lookup (and all the marking indeed, as we know we already visited that page).