[erlang-bugs] Re: [erlang-bugs 10] Re: Distributed node crashes silently when initially receiving a big chunk of messages from another node

Wed Mar 30 12:26:54 CEST 2011

I do not have a reasonably small demo yet, but I managed to get some
coredumps of beam.smp. The nodes crash with a segfault at
hipe_mode_switch.c, line 244 (of R14B02). This is code that is
responsible for calling a native code closure.

My application code does indeed send a few closures via messages, that
are later called by the receiver node. I do not use hot code upgrades
however, and the crashes are timing-related, as described before. I
therefore suspect the crashes are the result of a race condition
involving whatever code is responsible for making a received fun callable.

Philipp

On 03/29/2011 03:26 PM, pan@REDACTED wrote:
> Hi!
>
> This sounds really bad! A demo application that reproduces the bug
> would be really nice.
>
> Have you tried to enable core dumps to see if the erlang node crashes
> with a segfault? I suppose there are no erl_crash.dump files left
> after the crash that I can look at either?
>
> Any way to reproduce it would make it more easy to find!
>
> Cheers,
> /Patrik
>
> On Mon, 28 Mar 2011, Philipp Unterbrunner wrote:
>
>> The bug persists in r14b02.
>>
>> If I find time, I will make a small demo application so that others can
>> reproduce the bug.
>>
>> Philipp
>>
>> On 02/23/2011 04:14 PM, Philipp Unterbrunner wrote:
>>> Hello,
>>>
>>> I have run into a serious and very annoying bug.
>>>
>>> Affects (at least); R13B04, R14A, R14B, R14B01
>>> Platform: Ubuntu Linux 10.10, kernel 2.6.35-25-server (SMP)
>>>
>>> When a newly started distributed node receives a high number of
>>> messages from another node, the newly started node crashes silently.
>>> Nothing is printed to the console. No crash dump or core dump is
>>> produced.
>>>
>>> In trying to find a work-around, I found the following curious
>>> behavior:
>>>
>>> * The bug *only* occurs for distributed nodes (but regardless of
>>> whether the nodes run on the same machine).
>>> * Waiting a few seconds (or even longer) before sending the first
>>> message to the newly started node does *not* make a difference. The
>>> node will still crash when confronted with a large number of
>>> incoming messages later.
>>> * Speed matters. When doing a debug build, the bug appears less
>>> often then when doing a release build, especially when HiPE is
>>> enabled. However, I managed to cause the bug even in debug mode, and
>>> when OTP was not compiled with native libs. The bug is simply much
>>> less likely to be observed.
>>> * The number of messages sent *initially* matters most. Slowly
>>> "ramping up" the load is a work-around. Once a node is working at
>>> high throughput, it is OK to stop sending messages for an arbitrary
>>> period and at a later point send a big chunk of messages that would
>>> have killed the node if sent initially.
>>> * Timing matters. Running the receiver node with +T 7 or higher
>>> makes the problem disappear.
>>> * Setting the sender node's distribution buffer size to the minimum
>>> (+zdbbl 1) makes the problem appear less often.
>>>
>>> I have reproduced the bug in various applications. The behavior
>>> described above also makes it fairly obvious that the application is
>>> not at fault.
>>>
>>> Rather, it appears that the receiver node is unable to buffer
>>> incoming messages and crashes. Of particular interest here is the
>>> fact that "ramping up" the load is a work-around. I suspect a
>>> low-level race condition where the receiver node does not allocate
>>> sufficient buffer space in time and crashes.
>>>
>>> Given that the existing work-arounds are not desirable ("ramp up"
>>> requires changes to the application code, +T 7 and +zdbbl 1 decrease
>>> performance), and given that the bug now persists over multiple
>>> releases, I hope someone can soon look into it.
>>>
>>> Thank you,
>>>
>>> Philipp
>>

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 262 bytes
Desc: OpenPGP digital signature
URL: <http://erlang.org/pipermail/erlang-bugs/attachments/20110330/4898962e/attachment.bin>