[erlang-bugs] Re: [erlang-bugs 10] Re: Distributed node crashes silently when initially receiving a big chunk of messages from another node

Tue Mar 29 15:26:11 CEST 2011

Hi!

This sounds really bad! A demo application that reproduces the bug would 
be really nice.

Have you tried to enable core dumps to see if the erlang node crashes with 
a segfault? I suppose there are no erl_crash.dump files left after the 
crash that I can look at either?

Any way to reproduce it would make it more easy to find!

Cheers,
/Patrik

On Mon, 28 Mar 2011, Philipp Unterbrunner wrote:

> The bug persists in r14b02.
>
> If I find time, I will make a small demo application so that others can
> reproduce the bug.
>
> Philipp
>
> On 02/23/2011 04:14 PM, Philipp Unterbrunner wrote:
>> Hello,
>>
>> I have run into a serious and very annoying bug.
>>
>> Affects (at least); R13B04, R14A, R14B, R14B01
>> Platform: Ubuntu Linux 10.10, kernel 2.6.35-25-server (SMP)
>>
>> When a newly started distributed node receives a high number of messages from another node, the newly started node crashes silently. Nothing is printed to the console. No crash dump or core dump is produced.
>>
>> In trying to find a work-around, I found the following curious behavior:
>>
>> * The bug *only* occurs for distributed nodes (but regardless of whether the nodes run on the same machine).
>> * Waiting a few seconds (or even longer) before sending the first message to the newly started node does *not* make a difference. The node will still crash when confronted with a large number of incoming messages later.
>> * Speed matters. When doing a debug build, the bug appears less often then when doing a release build, especially when HiPE is enabled. However, I managed to cause the bug even in debug mode, and when OTP was not compiled with native libs. The bug is simply much less likely to be observed.
>> * The number of messages sent *initially* matters most. Slowly "ramping up" the load is a work-around. Once a node is working at high throughput, it is OK to stop sending messages for an arbitrary period and at a later point send a big chunk of messages that would have killed the node if sent initially.
>> * Timing matters. Running the receiver node with +T 7 or higher makes the problem disappear.
>> * Setting the sender node's distribution buffer size to the minimum (+zdbbl 1) makes the problem appear less often.
>>
>> I have reproduced the bug in various applications. The behavior described above also makes it fairly obvious that the application is not at fault.
>>
>> Rather, it appears that the receiver node is unable to buffer incoming messages and crashes. Of particular interest here is the fact that "ramping up" the load is a work-around. I suspect a low-level race condition where the receiver node does not allocate sufficient buffer space in time and crashes.
>>
>> Given that the existing work-arounds are not desirable ("ramp up" requires changes to the application code, +T 7 and +zdbbl 1 decrease performance), and given that the bug now persists over multiple releases, I hope someone can soon look into it.
>>
>> Thank you,
>>
>> Philipp
>