[erlang-questions] Is it possible to align binary's byte array to cache line boundary?

Sat Sep 17 22:50:45 CEST 2011

On Sat, Sep 17, 2011 at 18:30, max tan <maxtqm@REDACTED> wrote:

> I saw in BEAM's sources, that when the binary's size is 64 bytes or less,
> BEAM will allocate a ErlHeapBin on local heap,

If we limit ourselves to small binaries only, everything will be heap
allocated. The heap is garbage collected, and the size of the heap
will govern what kind of garbage collector there is in play here. If
the heap is small enough, all of it will be in cache and there will be
little penalty for pulling it in from the first level cache, one or
two cache lines needed.

Garbage collectors can some times outperform a manual memory
management routine, especially if you think about how much garbage you
create in your application. Unfortunately, I only have fairly old
papers which study the impact of garbage collectors quo
malloc()/free() style implementations. I dare not think they still
hold these days with multiple cores, cache coherency algorithms,
"shared memory" and new cache sizes. But for small heaps with few
objects living and most objects dead, all GCs will be in cache and run
fast as a result. Also note that messaging your binary is a 64 byte
copy (+ its header and other things - chances are that it can't fit
into a single cache line in the first place!).

So if your binaries are small, I'd rather worry about other things
than cahce alignment, since it is then in the hands of the GC. I think
that the mantra of Erlang is "mod out" such details in the program and
let the underlying VM handle it. If it proves to be too slow in
practice, you can always outsource the heavy work to a NIF, in which
you can get more explicit control over memory layout and use. In my
experience though, the binaries of Erlang have a performance that is
not too shabby to put it mildly. Depending on the problem you can
often frame it in ways that will lead to less binary allocation and
work - that is optimize algorithmically what is going on.

What I'd suggest is that you quantify how fast you need your
operations to be and then go measure and optimize if you can reach
that goal - and how many cores and nodes you need to reach it at that
speed. This will guide you if you need to worry about cache alignment
performance in the GC. Erlang is focused on being robust first,
correct second and speedy third. My experience is that it is often
_fast enough_ but it can't compete in raw CPU-bound kernel performance
besides Ocaml, Java, C or C++. But then you throw robustness
overboard.

-- 
J.