[erlang-bugs] Re: [erlang-bugs 10] Re: Distributed node crashes silently when initially receiving a big chunk of messages from another node

Wed Mar 30 14:59:14 CEST 2011

We have one known hipe-bug. I haven't merged it to dev yet, but you can 
get it from

https://github.com/sverker/otp/commit/b715c077a88d5ba68e4e79b32c1c0de087234bbf

It's a "minor" heap corruption related to binary matching. Could be 
worth trying even though we haven't confirmed it as the cause of any faults.

/Sverker, Erlang/OTP

Philipp Unterbrunner wrote:
> I do not have a reasonably small demo yet, but I managed to get some
> coredumps of beam.smp. The nodes crash with a segfault at
> hipe_mode_switch.c, line 244 (of R14B02). This is code that is
> responsible for calling a native code closure.
>
> My application code does indeed send a few closures via messages, that
> are later called by the receiver node. I do not use hot code upgrades
> however, and the crashes are timing-related, as described before. I
> therefore suspect the crashes are the result of a race condition
> involving whatever code is responsible for making a received fun callable.
>
> Philipp
>
>
> On 03/29/2011 03:26 PM, pan@REDACTED wrote:
>   
>> Hi!
>>
>> This sounds really bad! A demo application that reproduces the bug
>> would be really nice.
>>
>> Have you tried to enable core dumps to see if the erlang node crashes
>> with a segfault? I suppose there are no erl_crash.dump files left
>> after the crash that I can look at either?
>>
>> Any way to reproduce it would make it more easy to find!
>>
>> Cheers,
>> /Patrik
>>
>> On Mon, 28 Mar 2011, Philipp Unterbrunner wrote:
>>
>>     
>>> The bug persists in r14b02.
>>>
>>> If I find time, I will make a small demo application so that others can
>>> reproduce the bug.
>>>
>>> Philipp
>>>
>>> On 02/23/2011 04:14 PM, Philipp Unterbrunner wrote:
>>>       
>>>> Hello,
>>>>
>>>> I have run into a serious and very annoying bug.
>>>>
>>>> Affects (at least); R13B04, R14A, R14B, R14B01
>>>> Platform: Ubuntu Linux 10.10, kernel 2.6.35-25-server (SMP)
>>>>
>>>> When a newly started distributed node receives a high number of
>>>> messages from another node, the newly started node crashes silently.
>>>> Nothing is printed to the console. No crash dump or core dump is
>>>> produced.
>>>>
>>>> In trying to find a work-around, I found the following curious
>>>> behavior:
>>>>
>>>> * The bug *only* occurs for distributed nodes (but regardless of
>>>> whether the nodes run on the same machine).
>>>> * Waiting a few seconds (or even longer) before sending the first
>>>> message to the newly started node does *not* make a difference. The
>>>> node will still crash when confronted with a large number of
>>>> incoming messages later.
>>>> * Speed matters. When doing a debug build, the bug appears less
>>>> often then when doing a release build, especially when HiPE is
>>>> enabled. However, I managed to cause the bug even in debug mode, and
>>>> when OTP was not compiled with native libs. The bug is simply much
>>>> less likely to be observed.
>>>> * The number of messages sent *initially* matters most. Slowly
>>>> "ramping up" the load is a work-around. Once a node is working at
>>>> high throughput, it is OK to stop sending messages for an arbitrary
>>>> period and at a later point send a big chunk of messages that would
>>>> have killed the node if sent initially.
>>>> * Timing matters. Running the receiver node with +T 7 or higher
>>>> makes the problem disappear.
>>>> * Setting the sender node's distribution buffer size to the minimum
>>>> (+zdbbl 1) makes the problem appear less often.
>>>>
>>>> I have reproduced the bug in various applications. The behavior
>>>> described above also makes it fairly obvious that the application is
>>>> not at fault.
>>>>
>>>> Rather, it appears that the receiver node is unable to buffer
>>>> incoming messages and crashes. Of particular interest here is the
>>>> fact that "ramping up" the load is a work-around. I suspect a
>>>> low-level race condition where the receiver node does not allocate
>>>> sufficient buffer space in time and crashes.
>>>>
>>>> Given that the existing work-arounds are not desirable ("ramp up"
>>>> requires changes to the application code, +T 7 and +zdbbl 1 decrease
>>>> performance), and given that the bug now persists over multiple
>>>> releases, I hope someone can soon look into it.
>>>>
>>>> Thank you,
>>>>
>>>> Philipp
>>>>         
>
>   
> ------------------------------------------------------------------------
>
> _______________________________________________
> erlang-bugs mailing list
> erlang-bugs@REDACTED
> http://erlang.org/mailman/listinfo/erlang-bugs
>