[erlang-questions] Takeover failure

Sun Dec 1 23:45:31 CET 2013

...
> 
> PS: I would avoid using distributed applications in production. The dist_ac module in the kernel application that takes care of deciding where to run which distributed application is a terrible spaghetti of gen_server callbacks and ad-hoc message passing with tons of race conditions that can block your entire cluster from starting up any distributed apps. I run into about 3-4 different bugs of this kind before abandoning the idea of using this feature.
> 

Is this really true? tons of race conditions, meaning over 1000 ? 3-4 different bugs ?
This raises some serious questions, like: Did you try to correct this and send a patch, or why not?
If distributed application is not usable, do OTP team know about this?
if so why is this feature still there and could fool people into try to use it?

/Tony

> On Sun, 01 Dec 2013 16:19:25 -0000, Tyron Zerafa <tyron.zerafa@REDACTED> wrote:
> 
> Hi all,
>  
>     I am trying to understand how to implement takeover in Erlang by following the example presentedhere. Basically, I am creating the application's supervisor as follows;
>  
> start(normal, []) ->
> 	m8ball_sup:start_link();
> start({takeover, _OtherNode}, []) ->
> 	m8ball_sup:start_link().
> 
> 
> Supervisor init code:
> start_link() ->
> 	supervisor:start_link({global,?MODULE}, ?MODULE, []).
> 
> Supervisor child Specification:
> {
> 		{one_for_one, 1, 10},
> 		[
> 			{m8ball,
> 			{m8ball_server, start_link, []},
> 			permanent,
> 			5000,
> 			worker,
> 			[m8ball_server]
> 		}]
> 	}
> 
> Child (m8ball_server) Initialization
> start_link() ->
> 	gen_server:start_link({global, ?MODULE}, ?MODULE, [], []).
> 
> 
> Consider the following scenario; an Erlang cluster is composed of two nodes A and B with application m8ball running on A. 
> Failover works perfect, I'm managing to kill node A and see the application running on the next node, B. 
> However, when I try to put back up node A (which have a higher priority then B) and init the app, I am getting the following error. I'm assuming that this occurs because node B already contains a supervisor globally registered with that name.  
> Log on Node A 
> {error,{{already_started,<2832.61.0>},
>         {m8ball,start,[{takeover,'b@REDACTED'},[]]}}}
> 
> =INFO REPORT==== 1-Dec-2013::16:17:32 ===
>     application: m8ball
>     exited: {{already_started,<2832.61.0>},
>              {m8ball,start,[{takeover,'b@REDACTED'},[]]}}
> 
> 
> Log on Node B
> =INFO REPORT==== 1-Dec-2013::16:24:55 ===
>     application: m8ball
>     exited: stopped
>     type: temporary
> 
> When I tried registering the supervisor locally, I got a similar exception failing to initializing the worker process. However, if I also register this as local, I would not be able to call it from any node using the app name (since it would not be globally registered).
> 
> Log on Node A (Supervisor Registered Locally)
> {error,
>     {{shutdown,
>          {failed_to_start_child,m8ball,
>              {already_started,<2832.67.0>}}},
>      {m8ball,start,[{takeover,'b@REDACTED'},[]]}}}
> 
>  
> Any pointers?
> 
> -- 
> Best Regards,
> Tyron Zerafa
> 
> 
> 
> _______________________________________________
> erlang-questions mailing list
> erlang-questions@REDACTED
> http://erlang.org/mailman/listinfo/erlang-questions

"Installing applications can lead to corruption over time. Applications gradually write over each other's libraries, partial upgrades occur, user and system errors happen, and minute changes may be unnoticeable and difficult to fix"

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20131201/bc89809f/attachment.htm>