[erlang-questions] Takeover failure

Szoboszlay Dániel dszoboszlay@REDACTED
Sun Dec 1 22:41:42 CET 2013


Hi,

During a takeover your application is first started on the new node, then  
stopped on the old one. In between, it runs simultaneously on both of  
them. The idea is that you can take over the runtime state from the old  
instance this way. The downside is that you need to be more careful with  
global name registrations and use e.g. start phases.

See this post for a much better, detailed explanation: Re: Distributed  
application takeover

BR,
Daniel

PS: I would avoid using distributed applications in production. The  
dist_ac module in the kernel application that takes care of deciding where  
to run which distributed application is a terrible spaghetti of gen_server  
callbacks and ad-hoc message passing with tons of race conditions that can  
block your entire cluster from starting up any distributed apps. I run  
into about 3-4 different bugs of this kind before abandoning the idea of  
using this feature.

On Sun, 01 Dec 2013 16:19:25 -0000, Tyron Zerafa <tyron.zerafa@REDACTED>  
wrote:

> Hi all,
>   I am trying to understand how to implement takeover in Erlang by  
> following the example presented here. Basically, I am creating the  
> application's supervisor as follows;
>start(normal, []) ->
> 	m8ball_sup:start_link();
> start({takeover, _OtherNode}, []) ->
> 	m8ball_sup:start_link().
>
>
> Supervisor init code:
> start_link() ->
> 	supervisor:start_link({global,?MODULE}, ?MODULE, []).
>
> Supervisor child Specification:
> {
> 		{one_for_one, 1, 10},
> 		[
> 			{m8ball,
> 			{m8ball_server, start_link, []},
> 			permanent,
> 			5000,
> 			worker,
> 			[m8ball_server]
> 		}]
> 	}
>
> Child (m8ball_server) Initialization
> start_link() ->
> 	gen_server:start_link({global, ?MODULE}, ?MODULE, [], []).
>
>
> Consider the following scenario; an Erlang cluster is composed of two  
> nodes A and B with application m8ball running on A.Failover works  
> perfect, I'm managing to kill node A and see the application running on  
> the next node, B.However, when I try to put back up node A (which have a  
> higher priority then B) and init the app, I am getting the following  
> error. I'm assuming that this occurs because node B already contains a  
> >supervisor globally registered with that name. Log on Node A 
> {error,{{already_started,<2832.61.0>},
>        {m8ball,start,[{takeover,'b@REDACTED'},[]]}}}
>
> =INFO REPORT==== 1-Dec-2013::16:17:32 ===
>    application: m8ball
>    exited: {{already_started,<2832.61.0>},
>             {m8ball,start,[{takeover,'b@REDACTED'},[]]}}
>
>
> Log on Node B
> =INFO REPORT==== 1-Dec-2013::16:24:55 ===
>    application: m8ball
>    exited: stopped
>    type: temporary
>
> When I tried registering the supervisor locally, I got a similar  
> exception failing to initializing the worker process. However, if I also  
> register this as local, I would not be able to call it from any node  
> >using the app name (since it would not be globally registered).
>
> Log on Node A (Supervisor Registered Locally)
> {error,
>    {{shutdown,
>         {failed_to_start_child,m8ball,
>             {already_started,<2832.67.0>}}},
>     {m8ball,start,[{takeover,'b@REDACTED'},[]]}}}
>
>Any pointers?
>
> --Best Regards,
> Tyron Zerafa
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20131201/440a2d04/attachment.htm>


More information about the erlang-questions mailing list