[erlang-questions] heart restarting erlang node

Fri Jul 30 16:55:25 CEST 2010

Hi List,

Sorry if this is noise, just finishing off this thread in case it comes up
in anyones google search in the future.

This problem appears to have fixed itself when we convinced our customer to
go for a hardware reboot. We also changed the +A attribute from 256 to 128
but we can't say that's what caused this to go away. Previously it was
restarting approx every 1.5 days but has now been up for 22.

Not being able to reproduce this on our test site we didn't have an
opportunity to chase it down and find a definitive cause and solution. For
anyone reading this who has a node inexplicably restarting on a customers
site, I sympathise! Ask them to reboot the HW if it doesn't make you look
like you're clutching at straws! ;-)

//Tom.

On Sun, Jun 27, 2010 at 5:14 PM, tom kelly <ttom.kelly@REDACTED> wrote:

> Hi Scott,
> Thanks for your very useful answers!
> We found some segmentation errors reported by the OS so we were starting to
> think that heart wasn't the problem after all.
> This is proving difficult to pin down as it's on a customers site and
> happens at very irregular intervals.
> For anyone else experiencing similar problems we'll inform the list if we
> find a definitive solution.
> //Tom.
>
>
>
> On Sat, Jun 26, 2010 at 12:37 AM, Scott Lystig Fritchie <
> fritchie@REDACTED> wrote:
>
>> tom kelly <ttom.kelly@REDACTED> wrote:
>>
>> tk> We've found this post from Serge Aleynikov which we're
>> tk> investigating:
>> tk>
>> http://www.erlang.org/pipermail/erlang-questions/2006-December/024365.html
>>
>> tk> But I'm not yet sure it's the same issue. This can cause heart to
>> tk> restart our system but only after memory usage was sustained around
>> tk> 90% for 5-10 minutes which wasn't the case for all of our restarts.
>>
>> Tom, if your Erlang process is causing your OS to page VM to/from disk,
>> then all expectations of soft realtime performance will be thrown out
>> the window.  If the VM tries to do something simple like "char foo =
>> *(some_pointer)", and if some_pointer points to a page that isn't
>> resident in RAM, that thread will wait a *long* time before progress can
>> be made again.  Typically you've got 1 scheduler thread per CPU, but if
>> your working set isn't resident in RAM, you'll quickly block all
>> scheduler threads...
>>
>> ... and then when it comes time to answer a heartbeat, you won't do it
>> in time, and you'll be killed because you're too !@#$! slow.
>>
>> If you're using Linux, crank the /proc/vm/*swappiness* (I forget the
>> exact path) down to 0.  Many kernels (RedHat comes to mind) use 60,
>> which is not what you want a snappy server to do.
>>
>> If you can't blame your OS for moving your VM's pages to RAM, you'll
>> have to blame yourself: use less data or buy more RAM.  :-)
>>
>> -Scott
>>
>
>