[erlang-bugs] Re: [erlang-bugs 10] Re: Distributed node crashes silently when initially receiving a big chunk of messages from another node
Tue Mar 29 15:26:11 CEST 2011
This sounds really bad! A demo application that reproduces the bug would
be really nice.
Have you tried to enable core dumps to see if the erlang node crashes with
a segfault? I suppose there are no erl_crash.dump files left after the
crash that I can look at either?
Any way to reproduce it would make it more easy to find!
On Mon, 28 Mar 2011, Philipp Unterbrunner wrote:
> The bug persists in r14b02.
> If I find time, I will make a small demo application so that others can
> reproduce the bug.
> On 02/23/2011 04:14 PM, Philipp Unterbrunner wrote:
>> I have run into a serious and very annoying bug.
>> Affects (at least); R13B04, R14A, R14B, R14B01
>> Platform: Ubuntu Linux 10.10, kernel 2.6.35-25-server (SMP)
>> When a newly started distributed node receives a high number of messages from another node, the newly started node crashes silently. Nothing is printed to the console. No crash dump or core dump is produced.
>> In trying to find a work-around, I found the following curious behavior:
>> * The bug *only* occurs for distributed nodes (but regardless of whether the nodes run on the same machine).
>> * Waiting a few seconds (or even longer) before sending the first message to the newly started node does *not* make a difference. The node will still crash when confronted with a large number of incoming messages later.
>> * Speed matters. When doing a debug build, the bug appears less often then when doing a release build, especially when HiPE is enabled. However, I managed to cause the bug even in debug mode, and when OTP was not compiled with native libs. The bug is simply much less likely to be observed.
>> * The number of messages sent *initially* matters most. Slowly "ramping up" the load is a work-around. Once a node is working at high throughput, it is OK to stop sending messages for an arbitrary period and at a later point send a big chunk of messages that would have killed the node if sent initially.
>> * Timing matters. Running the receiver node with +T 7 or higher makes the problem disappear.
>> * Setting the sender node's distribution buffer size to the minimum (+zdbbl 1) makes the problem appear less often.
>> I have reproduced the bug in various applications. The behavior described above also makes it fairly obvious that the application is not at fault.
>> Rather, it appears that the receiver node is unable to buffer incoming messages and crashes. Of particular interest here is the fact that "ramping up" the load is a work-around. I suspect a low-level race condition where the receiver node does not allocate sufficient buffer space in time and crashes.
>> Given that the existing work-arounds are not desirable ("ramp up" requires changes to the application code, +T 7 and +zdbbl 1 decrease performance), and given that the bug now persists over multiple releases, I hope someone can soon look into it.
>> Thank you,
More information about the erlang-bugs