[erlang-questions] Takeover failure
Szoboszlay Dániel
dszoboszlay@REDACTED
Sun Dec 1 22:41:42 CET 2013
Hi,
During a takeover your application is first started on the new node, then
stopped on the old one. In between, it runs simultaneously on both of
them. The idea is that you can take over the runtime state from the old
instance this way. The downside is that you need to be more careful with
global name registrations and use e.g. start phases.
See this post for a much better, detailed explanation: Re: Distributed
application takeover
BR,
Daniel
PS: I would avoid using distributed applications in production. The
dist_ac module in the kernel application that takes care of deciding where
to run which distributed application is a terrible spaghetti of gen_server
callbacks and ad-hoc message passing with tons of race conditions that can
block your entire cluster from starting up any distributed apps. I run
into about 3-4 different bugs of this kind before abandoning the idea of
using this feature.
On Sun, 01 Dec 2013 16:19:25 -0000, Tyron Zerafa <tyron.zerafa@REDACTED>
wrote:
> Hi all,
> I am trying to understand how to implement takeover in Erlang by
> following the example presented here. Basically, I am creating the
> application's supervisor as follows;
>start(normal, []) ->
> m8ball_sup:start_link();
> start({takeover, _OtherNode}, []) ->
> m8ball_sup:start_link().
>
>
> Supervisor init code:
> start_link() ->
> supervisor:start_link({global,?MODULE}, ?MODULE, []).
>
> Supervisor child Specification:
> {
> {one_for_one, 1, 10},
> [
> {m8ball,
> {m8ball_server, start_link, []},
> permanent,
> 5000,
> worker,
> [m8ball_server]
> }]
> }
>
> Child (m8ball_server) Initialization
> start_link() ->
> gen_server:start_link({global, ?MODULE}, ?MODULE, [], []).
>
>
> Consider the following scenario; an Erlang cluster is composed of two
> nodes A and B with application m8ball running on A.Failover works
> perfect, I'm managing to kill node A and see the application running on
> the next node, B.However, when I try to put back up node A (which have a
> higher priority then B) and init the app, I am getting the following
> error. I'm assuming that this occurs because node B already contains a
> >supervisor globally registered with that name. Log on Node A
> {error,{{already_started,<2832.61.0>},
> {m8ball,start,[{takeover,'b@REDACTED'},[]]}}}
>
> =INFO REPORT==== 1-Dec-2013::16:17:32 ===
> application: m8ball
> exited: {{already_started,<2832.61.0>},
> {m8ball,start,[{takeover,'b@REDACTED'},[]]}}
>
>
> Log on Node B
> =INFO REPORT==== 1-Dec-2013::16:24:55 ===
> application: m8ball
> exited: stopped
> type: temporary
>
> When I tried registering the supervisor locally, I got a similar
> exception failing to initializing the worker process. However, if I also
> register this as local, I would not be able to call it from any node
> >using the app name (since it would not be globally registered).
>
> Log on Node A (Supervisor Registered Locally)
> {error,
> {{shutdown,
> {failed_to_start_child,m8ball,
> {already_started,<2832.67.0>}}},
> {m8ball,start,[{takeover,'b@REDACTED'},[]]}}}
>
>Any pointers?
>
> --Best Regards,
> Tyron Zerafa
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20131201/440a2d04/attachment.htm>
More information about the erlang-questions
mailing list