Partitioned network solving

Ulf Wiger etxuwig@REDACTED
Wed Mar 20 09:07:43 CET 2002


On Wed, 20 Mar 2002, Vladimir Sekissov wrote:

>Good day,
>
>I'm bother on partitioned network solving in my distributed
>application. It seems that for global names conflict resolving
>later registered processes are unregistered. So I register
>global process in my application and use modified version of Ulf
>Wiger nodemon.erl (# Message-Id:
><200003091440.PAA02811@REDACTED>) and code below on all
>application nodes, but I'm not sure that this solution fully
>correct. Can somebody point me to possible troubles.

>Another question - I don't use dist_auto_connect = once in config file
>but when I have suspended one node with Ctrl-G from shell they don't
>try to connect after and I need to ping one of them explicitly.

Your nodes will reconnect in wait_name_resolve/1, if not before.
The auto connect feature in Erlang works in the way that two
nodes will try to connect as soon as some process tries to send a
message from on node to the other.

In my experience (and the reason why dist_auto_connect = once was
invented), it can be dangerous to reconnect once the nodes have
been partitioned. If nothing else, name clashes in global have to
be dealt with, and the risk of permanent inconsistency in the
database increases.

You can either restart one of the nodes, based on the best
information you've got, and then try to set master nodes in
mnesia afterwards (this requires some information to survive the
restart -- possibly of both nodes at the same time), or you could
open a gen_tcp dialogue between the nodes (basically a home-brewn
rpc) and negotiate what should be done through a "back door".

/Uffe

>
>%%
>%% nodemon_handler(FromNode, State) -> {Action, NewState}
>%%    called by nodemon when network partitioning detected
>%%       FromNode - node detected as partitioned
>%%       State    - handler state
>%%       Action   - stop | restart | reboot | nothing
>%%       NewState - new handler state
>%%
>nodemon_handler(FromNode, State) ->
>  %% tac_master - application globally registered process
>  case global:whereis_name(tac_master) of
>    undefined ->
>      ?LOG_ERROR("Partitioned network (~p,~p), no tac_master",
>                [node(), FromNode]),
>      {nothing, State};
>    _ ->
>
>      MasterNode = wait_name_resolve(FromNode),
>      if MasterNode == node() ->
>          %% we are master in low
>          %% global has already unregistered process on backup node
>          {nothing, State};
>         %% we are syncing with master only
>         MasterNode == FromNode ->
>          mnesia:set_master_nodes([MasterNode]),
>          {restart, State};
>         %is not a master skip it
>         true ->
>          {nothing, State}
>      end
>  end.
>
>wait_name_resolve(RemNode) ->
>  OurVsn = node(global:whereis_name(tac_master)),
>  RemVsn =
>    case rpc:call(RemNode, global, whereis_name, [tac_master]) of
>      {badrpc, _} ->
>        OurVsn;
>      RemPid ->
>        rpc:call(RemNode, erlang, node, [RemPid])
>    end,
>  if OurVsn == RemVsn -> OurVsn;
>     true -> wait_name_resolve(RemNode)
>  end.
>
>
>---
>Best Regards,
>
>Vladimir Sekissov
>




More information about the erlang-questions mailing list