[erlang-questions] Takeover failure

Mon Dec 2 09:15:09 CET 2013

Hi,

OK, "tons" might be a bit dramatic. There are 3 or 4 receive statements in  
the dist_ac gen_server code without a timeout. These are the potential  
hang up points. The root cause of all the problems is that dist_ac assumes  
all applications are started on all nodes in the same sequence. Now  
imagine a typical setup: the boot script starts app A then B, but  
unfortunately A has a restart timeout of 5000 and B has 3000. If the node  
running these apps crashes, the rest of the cluster will attempt to start  
B then A. But if the crashed node is restarted by heart within 3 seconds,  
it will rejoin the cluster before the takeover and attempt to start A then  
B. Result: neither of the apps fails over to anywhere and the restarted  
node won't even finish it's init sequence.

Regarding a patch: I wrote a fix for the first couple of bugs I  
discovered. The problem is that I did it in my work time at a big company,  
where an entire security and legal department is thinking hard since then  
whether it is OK to release code to the public...
To be honest, I don't push them hard right now either, because my fixes  
are not good for the above described scenario. That would need a complete  
rewrite of the dist_ac code to allow multiple apps to start concurrently.  
I have some ideas how to do it, but I won't have time to write a fix until  
January, I'm afraid (this time I'd do it from home).
And I don't think this feature would be widely used btw. The dist_ac  
module hasn't been modified since the erlang/otp git repo exists.  
Furthermore I believe you are also safe to use it as long as you have only  
one distributed application. So I guess I'm the first one to run into this  
problems using 5-6 distributed apps and 5 nodes with equal priorities.

BR,
Daniel

On Sun, 01 Dec 2013 22:45:31 -0000, Tony Rogvall <tony@REDACTED> wrote:

> ...
>>
>> PS: I would avoid using distributed applications in production. The  
>> dist_ac module in the kernel application that takes care of deciding  
>> >>where to run which distributed application is a terrible spaghetti of  
>> gen_server callbacks and ad-hoc message passing with tons of race  
>> >>conditions that can block your entire cluster from starting up any  
>> distributed apps. I run into about 3-4 different bugs of this kind  
>> before >>abandoning the idea of using this feature.
>>
>
> Is this really true? tons of race conditions, meaning over 1000 ? 3-4  
> different bugs ?
> This raises some serious questions, like: Did you try to correct this  
> and send a patch, or why not?
> If distributed application is not usable, do OTP team know about this?
> if so why is this feature still there and could fool people into try to  
> use it?
>
> /Tony
>
>
>> On Sun, 01 Dec 2013 16:19:25 -0000, Tyron Zerafa  
>> <tyron.zerafa@REDACTED> wrote:
>>
>>> Hi all,
>>>   I am trying to understand how to implement takeover in Erlang by  
>>> following the example presentedhere. Basically, I am creating the  
>>> >>>application's supervisor as follows;
>>>start(normal, []) ->
>>> 	m8ball_sup:start_link();
>>> start({takeover, _OtherNode}, []) ->
>>> 	m8ball_sup:start_link().
>>>
>>>
>>> Supervisor init code:
>>> start_link() ->
>>> 	supervisor:start_link({global,?MODULE}, ?MODULE, []).
>>>
>>> Supervisor child Specification:
>>> {
>>> 		{one_for_one, 1, 10},
>>> 		[
>>> 			{m8ball,
>>> 			{m8ball_server, start_link, []},
>>> 			permanent,
>>> 			5000,
>>> 			worker,
>>> 			[m8ball_server]
>>> 		}]
>>> 	}
>>>
>>> Child (m8ball_server) Initialization
>>> start_link() ->
>>> 	gen_server:start_link({global, ?MODULE}, ?MODULE, [], []).
>>>
>>>
>>> Consider the following scenario; an Erlang cluster is composed of two  
>>> nodes A and B with application m8ball running on A.Failover works  
>>> perfect, I'm managing to kill node A and see the application running  
>>> on the next node, B.However, when I try to put back up node A (which  
>>> have a higher priority then B) and init the app, I am getting the  
>>> following error. I'm >>>assuming that this occurs because node B  
>>> already contains a supervisor globally registered with that name. Log  
>>> on Node A{error,{{already_started,<2832.61.0>},
>>>        {m8ball,start,[{takeover,'b@REDACTED'},[]]}}}
>>>
>>> =INFO REPORT==== 1-Dec-2013::16:17:32 ===
>>>    application: m8ball
>>>    exited: {{already_started,<2832.61.0>},
>>>             {m8ball,start,[{takeover,'b@REDACTED'},[]]}}
>>>
>>>
>>> Log on Node B
>>> =INFO REPORT==== 1-Dec-2013::16:24:55 ===
>>>    application: m8ball
>>>    exited: stopped
>>>    type: temporary
>>>
>>> When I tried registering the supervisor locally, I got a similar  
>>> exception failing to initializing the worker process. However, if I  
>>> also >>>register this as local, I would not be able to call it from  
>>> any node using the app name (since it would not be globally  
>>> registered).
>>>
>>> Log on Node A (Supervisor Registered Locally)
>>> {error,
>>>    {{shutdown,
>>>         {failed_to_start_child,m8ball,
>>>             {already_started,<2832.67.0>}}},
>>>     {m8ball,start,[{takeover,'b@REDACTED'},[]]}}}
>>>
>>>Any pointers?
>>>
>>> --Best Regards,
>>> Tyron Zerafa
>>
>>
>>
>> _______________________________________________
>> erlang-questions mailing list
>> erlang-questions@REDACTED
>> http://erlang.org/mailman/listinfo/erlang-questions
>
>> "Installing applications can lead to corruption over time. Applications  
>> gradually write over each other's libraries, partial upgrades occur,  
>> user and system errors happen, and >minute changes may be unnoticeable  
>> and difficult to fix"
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20131202/bdd64bbf/attachment.htm>