Distributed node crashes silently when initially receiving a big chunk of messages from another node

Wed Feb 23 16:14:48 CET 2011

Hello,

I have run into a serious and very annoying bug.

Affects (at least); R13B04, R14A, R14B, R14B01
Platform: Ubuntu Linux 10.10, kernel 2.6.35-25-server (SMP)

When a newly started distributed node receives a high number of messages from another node, the newly started node crashes silently. Nothing is printed to the console. No crash dump or core dump is produced.

In trying to find a work-around, I found the following curious behavior:

* The bug *only* occurs for distributed nodes (but regardless of whether the nodes run on the same machine).
* Waiting a few seconds (or even longer) before sending the first message to the newly started node does *not* make a difference. The node will still crash when confronted with a large number of incoming messages later.
* Speed matters. When doing a debug build, the bug appears less often then when doing a release build, especially when HiPE is enabled. However, I managed to cause the bug even in debug mode, and when OTP was not compiled with native libs. The bug is simply much less likely to be observed.
* The number of messages sent *initially* matters most. Slowly "ramping up" the load is a work-around. Once a node is working at high throughput, it is OK to stop sending messages for an arbitrary period and at a later point send a big chunk of messages that would have killed the node if sent initially.
* Timing matters. Running the receiver node with +T 7 or higher makes the problem disappear.
* Setting the sender node's distribution buffer size to the minimum (+zdbbl 1) makes the problem appear less often.

I have reproduced the bug in various applications. The behavior described above also makes it fairly obvious that the application is not at fault.

Rather, it appears that the receiver node is unable to buffer incoming messages and crashes. Of particular interest here is the fact that "ramping up" the load is a work-around. I suspect a low-level race condition where the receiver node does not allocate sufficient buffer space in time and crashes.

Given that the existing work-arounds are not desirable ("ramp up" requires changes to the application code, +T 7 and +zdbbl 1 decrease performance), and given that the bug now persists over multiple releases, I hope someone can soon look into it.

Thank you,

Philipp

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-bugs/attachments/20110223/50e4a298/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 262 bytes
Desc: OpenPGP digital signature
URL: <http://erlang.org/pipermail/erlang-bugs/attachments/20110223/50e4a298/attachment.bin>