[erlang-bugs 10] Re: [erlang-bugs] Distributed node crashes silently when initially receiving a big chunk of messages from another node

Philipp Unterbrunner philippu@REDACTED
Mon Mar 28 12:38:57 CEST 2011


The bug persists in r14b02.

If I find time, I will make a small demo application so that others can
reproduce the bug.

Philipp

On 02/23/2011 04:14 PM, Philipp Unterbrunner wrote:
> Hello,
>
> I have run into a serious and very annoying bug.
>
> Affects (at least); R13B04, R14A, R14B, R14B01
> Platform: Ubuntu Linux 10.10, kernel 2.6.35-25-server (SMP)
>
> When a newly started distributed node receives a high number of messages from another node, the newly started node crashes silently. Nothing is printed to the console. No crash dump or core dump is produced.
>
> In trying to find a work-around, I found the following curious behavior:
>
> * The bug *only* occurs for distributed nodes (but regardless of whether the nodes run on the same machine).
> * Waiting a few seconds (or even longer) before sending the first message to the newly started node does *not* make a difference. The node will still crash when confronted with a large number of incoming messages later.
> * Speed matters. When doing a debug build, the bug appears less often then when doing a release build, especially when HiPE is enabled. However, I managed to cause the bug even in debug mode, and when OTP was not compiled with native libs. The bug is simply much less likely to be observed.
> * The number of messages sent *initially* matters most. Slowly "ramping up" the load is a work-around. Once a node is working at high throughput, it is OK to stop sending messages for an arbitrary period and at a later point send a big chunk of messages that would have killed the node if sent initially.
> * Timing matters. Running the receiver node with +T 7 or higher makes the problem disappear.
> * Setting the sender node's distribution buffer size to the minimum (+zdbbl 1) makes the problem appear less often.
>
> I have reproduced the bug in various applications. The behavior described above also makes it fairly obvious that the application is not at fault.
>
> Rather, it appears that the receiver node is unable to buffer incoming messages and crashes. Of particular interest here is the fact that "ramping up" the load is a work-around. I suspect a low-level race condition where the receiver node does not allocate sufficient buffer space in time and crashes.
>
> Given that the existing work-arounds are not desirable ("ramp up" requires changes to the application code, +T 7 and +zdbbl 1 decrease performance), and given that the bug now persists over multiple releases, I hope someone can soon look into it.
>
> Thank you,
>
> Philipp
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-bugs/attachments/20110328/18c7619f/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 262 bytes
Desc: OpenPGP digital signature
URL: <http://erlang.org/pipermail/erlang-bugs/attachments/20110328/18c7619f/attachment.bin>


More information about the erlang-bugs mailing list