[erlang-questions] Erlang VM hanging on node death

Thu Jul 13 01:10:22 CEST 2017

Juan,
Here's the sequence of events:
1. One of our machines was inadvertently shut off, killing all of the
processes on it
2. We immediately saw a drop in CPU across the board on the sessions
cluster. CPU on the sessions cluster eventually went to zero.
3. We were completely unable to use remote console on any of the machines
in the cluster, and they all needed to be restarted.

So, to answer your question, we don't know how long it took for down
messages to be processed, since we didn't have visibility at the time.  We
suspected a problem with the net_ticktime, but what's confusing to us is
that the host that went down went down hard, so the DOWN events should have
been created on the other nodes, not sent across distribution (correct me
if I'm wrong here). Also, my intuition is that processing DOWN messages
would cause CPU usage on the cluster to go up, but we saw the exact
opposite.

Since we couldn't connect to the machines via remote console, we couldn't
call connect_node. It was my understanding that the connect call would
happen when the node in question reestablished itself.

On Tue, Jul 11, 2017 at 8:34 PM, Juan Jose Comellas <juanjo@REDACTED>
wrote:

> How long does it take for all the DOWN messages to be sent/processed?
>
> These messages might not be allowing the net tick messages (see
> net_ticktime in http://erlang.org/doc/man/kernel_app.html) to be
> responded in time. If this happens, the node that isn't able to respond
> before the net_ticktime expires will be assumed to be disconnected.
>
> What happens if after processing all the DOWN messages you issue a call to
> net_kernel:connect_node/1 for each of the nodes that seems to be down?
>
> On Mon, Jul 10, 2017 at 4:14 PM, Steve Cohen <scohen@REDACTED>
> wrote:
>
>> Hi all,
>>
>> We have 12 nodes in a our guilds cluster, and on each, 500,000
>> processes.  We have another cluster that has 15 nodes with roughly four
>> million processes on it, called sessions. Both clusters are in the same
>> erlang distribution since our guilds monitor sessions and vice-versa.
>>
>> Now, when one of our guild servers dies, as expected it generates a large
>> number of DOWN messages to the sessions cluster. These messages bog down
>> the sessions servers (obviously) while they process them, but when they're
>> done processing, distribution appears to be completely broken.
>>
>> By broken, I mean that the nodes are disconnected from one another,
>> they're not exchanging messages, CPU usage was 0 and we couldn't even
>> launch the remote console.
>>
>> I can't imagine this is expected behavior, and was wondering if someone
>> can shed some light on it.
>> We're open to the idea that we're doing something very, very wrong.
>>
>>
>> Thanks in advance for the help
>>
>> --
>> Steve Cohen
>>
>> _______________________________________________
>> erlang-questions mailing list
>> erlang-questions@REDACTED
>> http://erlang.org/mailman/listinfo/erlang-questions
>>
>>
>

-- 
-Steve
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20170712/1b3440e8/attachment.htm>