[erlang-bugs] Re: [erlang-bugs 10] Re: Distributed node crashes silently when initially receiving a big chunk of messages from another node
Wed Mar 30 12:26:54 CEST 2011
I do not have a reasonably small demo yet, but I managed to get some
coredumps of beam.smp. The nodes crash with a segfault at
hipe_mode_switch.c, line 244 (of R14B02). This is code that is
responsible for calling a native code closure.
My application code does indeed send a few closures via messages, that
are later called by the receiver node. I do not use hot code upgrades
however, and the crashes are timing-related, as described before. I
therefore suspect the crashes are the result of a race condition
involving whatever code is responsible for making a received fun callable.
On 03/29/2011 03:26 PM, wrote:
> This sounds really bad! A demo application that reproduces the bug
> would be really nice.
> Have you tried to enable core dumps to see if the erlang node crashes
> with a segfault? I suppose there are no erl_crash.dump files left
> after the crash that I can look at either?
> Any way to reproduce it would make it more easy to find!
> On Mon, 28 Mar 2011, Philipp Unterbrunner wrote:
>> The bug persists in r14b02.
>> If I find time, I will make a small demo application so that others can
>> reproduce the bug.
>> On 02/23/2011 04:14 PM, Philipp Unterbrunner wrote:
>>> I have run into a serious and very annoying bug.
>>> Affects (at least); R13B04, R14A, R14B, R14B01
>>> Platform: Ubuntu Linux 10.10, kernel 2.6.35-25-server (SMP)
>>> When a newly started distributed node receives a high number of
>>> messages from another node, the newly started node crashes silently.
>>> Nothing is printed to the console. No crash dump or core dump is
>>> In trying to find a work-around, I found the following curious
>>> * The bug *only* occurs for distributed nodes (but regardless of
>>> whether the nodes run on the same machine).
>>> * Waiting a few seconds (or even longer) before sending the first
>>> message to the newly started node does *not* make a difference. The
>>> node will still crash when confronted with a large number of
>>> incoming messages later.
>>> * Speed matters. When doing a debug build, the bug appears less
>>> often then when doing a release build, especially when HiPE is
>>> enabled. However, I managed to cause the bug even in debug mode, and
>>> when OTP was not compiled with native libs. The bug is simply much
>>> less likely to be observed.
>>> * The number of messages sent *initially* matters most. Slowly
>>> "ramping up" the load is a work-around. Once a node is working at
>>> high throughput, it is OK to stop sending messages for an arbitrary
>>> period and at a later point send a big chunk of messages that would
>>> have killed the node if sent initially.
>>> * Timing matters. Running the receiver node with +T 7 or higher
>>> makes the problem disappear.
>>> * Setting the sender node's distribution buffer size to the minimum
>>> (+zdbbl 1) makes the problem appear less often.
>>> I have reproduced the bug in various applications. The behavior
>>> described above also makes it fairly obvious that the application is
>>> not at fault.
>>> Rather, it appears that the receiver node is unable to buffer
>>> incoming messages and crashes. Of particular interest here is the
>>> fact that "ramping up" the load is a work-around. I suspect a
>>> low-level race condition where the receiver node does not allocate
>>> sufficient buffer space in time and crashes.
>>> Given that the existing work-arounds are not desirable ("ramp up"
>>> requires changes to the application code, +T 7 and +zdbbl 1 decrease
>>> performance), and given that the bug now persists over multiple
>>> releases, I hope someone can soon look into it.
>>> Thank you,
-------------- next part --------------
A non-text attachment was scrubbed...
Size: 262 bytes
Desc: OpenPGP digital signature
More information about the erlang-bugs