[erlang-questions] Distributed application and netsplit

Tue Nov 18 20:06:24 CET 2014

Hello,

I am trying to implement an application, that should be active on a
single node in a cluster. Distributed applications look like a proper
tool for implementing such behaviour, so I implemented my app using
it. It works as expected (failover ant takeover), when turning on and
off appropriate nodes.

I tried to check, what happens on a netsplit and a recovery from it. I
simulated the netsplit by erlang:disconnect_node/1. The application
has been started on both nodes (that was expected), although they were
left both running after netsplit was removed (using net_adm:ping/1).

So the application is not stopped automatically on the second node.
How can I stop it manually? I use the global process registry in that
application and the resolve function (passed to
global:register_name/3) can be used to detect such situation. I tried
the following:

  * invoke application:stop/1 on the secondary node. The application
was stopped on the secondary node, although the failover mechanism was
broken. I.e. if I shutdown the primary node after such recovery (from
the netsplit), the application is not started on the secondary node
anymore.
  * invoke application:permit/2 with Permission=false and then with
Permission=true on the secondary node (looks like workaround). The
application was started again, when permission was set to true. So it
not works either.

How can I stop the application on a secondary node so, that it would
be started automatically in the case of failover?

Karolis