[erlang-questions] Erlang VM hanging on node death

Wed Jul 12 05:34:04 CEST 2017

How long does it take for all the DOWN messages to be sent/processed?

These messages might not be allowing the net tick messages (see net_ticktime
in http://erlang.org/doc/man/kernel_app.html) to be responded in time. If
this happens, the node that isn't able to respond before the
net_ticktime expires
will be assumed to be disconnected.

What happens if after processing all the DOWN messages you issue a call to
net_kernel:connect_node/1 for each of the nodes that seems to be down?

On Mon, Jul 10, 2017 at 4:14 PM, Steve Cohen <scohen@REDACTED> wrote:

> Hi all,
>
> We have 12 nodes in a our guilds cluster, and on each, 500,000 processes.
> We have another cluster that has 15 nodes with roughly four million
> processes on it, called sessions. Both clusters are in the same erlang
> distribution since our guilds monitor sessions and vice-versa.
>
> Now, when one of our guild servers dies, as expected it generates a large
> number of DOWN messages to the sessions cluster. These messages bog down
> the sessions servers (obviously) while they process them, but when they're
> done processing, distribution appears to be completely broken.
>
> By broken, I mean that the nodes are disconnected from one another,
> they're not exchanging messages, CPU usage was 0 and we couldn't even
> launch the remote console.
>
> I can't imagine this is expected behavior, and was wondering if someone
> can shed some light on it.
> We're open to the idea that we're doing something very, very wrong.
>
>
> Thanks in advance for the help
>
> --
> Steve Cohen
>
> _______________________________________________
> erlang-questions mailing list
> erlang-questions@REDACTED
> http://erlang.org/mailman/listinfo/erlang-questions
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20170712/7bc990bf/attachment.htm>