[erlang-questions] Takeover failure

Tony Rogvall tony@REDACTED
Mon Dec 2 09:32:04 CET 2013


This sounds fair enough :-)

Regards

/Tony

On 2 dec 2013, at 09:15, Szoboszlay Dániel <dszoboszlay@REDACTED> wrote:

> Hi,
> 
> OK, "tons" might be a bit dramatic. There are 3 or 4 receive statements in the dist_ac gen_server code without a timeout. These are the potential hang up points. The root cause of all the problems is that dist_ac assumes all applications are started on all nodes in the same sequence. Now imagine a typical setup: the boot script starts app A then B, but unfortunately A has a restart timeout of 5000 and B has 3000. If the node running these apps crashes, the rest of the cluster will attempt to start B then A. But if the crashed node is restarted by heart within 3 seconds, it will rejoin the cluster before the takeover and attempt to start A then B. Result: neither of the apps fails over to anywhere and the restarted node won't even finish it's init sequence.
> 
> Regarding a patch: I wrote a fix for the first couple of bugs I discovered. The problem is that I did it in my work time at a big company, where an entire security and legal department is thinking hard since then whether it is OK to release code to the public...
> To be honest, I don't push them hard right now either, because my fixes are not good for the above described scenario. That would need a complete rewrite of the dist_ac code to allow multiple apps to start concurrently. I have some ideas how to do it, but I won't have time to write a fix until January, I'm afraid (this time I'd do it from home).
> And I don't think this feature would be widely used btw. The dist_ac module hasn't been modified since the erlang/otp git repo exists. Furthermore I believe you are also safe to use it as long as you have only one distributed application. So I guess I'm the first one to run into this problems using 5-6 distributed apps and 5 nodes with equal priorities.
> 
> BR,
> Daniel
> 
> On Sun, 01 Dec 2013 22:45:31 -0000, Tony Rogvall <tony@REDACTED> wrote:
> 
> ...
>> 
>> PS: I would avoid using distributed applications in production. The dist_ac module in the kernel application that takes care of deciding where to run which distributed application is a terrible spaghetti of gen_server callbacks and ad-hoc message passing with tons of race conditions that can block your entire cluster from starting up any distributed apps. I run into about 3-4 different bugs of this kind before abandoning the idea of using this feature.
>> 
> 
> Is this really true? tons of race conditions, meaning over 1000 ? 3-4 different bugs ?
> This raises some serious questions, like: Did you try to correct this and send a patch, or why not?
> If distributed application is not usable, do OTP team know about this?
> if so why is this feature still there and could fool people into try to use it?
> 
> /Tony
> 
> 
>> On Sun, 01 Dec 2013 16:19:25 -0000, Tyron Zerafa <tyron.zerafa@REDACTED> wrote:
>> 
>> Hi all,
>>  
>>     I am trying to understand how to implement takeover in Erlang by following the example presentedhere. Basically, I am creating the application's supervisor as follows;
>>  
>> start(normal, []) ->
>> 	m8ball_sup:start_link();
>> start({takeover, _OtherNode}, []) ->
>> 	m8ball_sup:start_link().
>> 
>> 
>> Supervisor init code:
>> start_link() ->
>> 	supervisor:start_link({global,?MODULE}, ?MODULE, []).
>> 
>> Supervisor child Specification:
>> {
>> 		{one_for_one, 1, 10},
>> 		[
>> 			{m8ball,
>> 			{m8ball_server, start_link, []},
>> 			permanent,
>> 			5000,
>> 			worker,
>> 			[m8ball_server]
>> 		}]
>> 	}
>> 
>> Child (m8ball_server) Initialization
>> start_link() ->
>> 	gen_server:start_link({global, ?MODULE}, ?MODULE, [], []).
>> 
>> 
>> Consider the following scenario; an Erlang cluster is composed of two nodes A and B with application m8ball running on A. 
>> Failover works perfect, I'm managing to kill node A and see the application running on the next node, B. 
>> However, when I try to put back up node A (which have a higher priority then B) and init the app, I am getting the following error. I'm assuming that this occurs because node B already contains a supervisor globally registered with that name.  
>> Log on Node A 
>> {error,{{already_started,<2832.61.0>},
>>         {m8ball,start,[{takeover,'b@REDACTED'},[]]}}}
>> 
>> =INFO REPORT==== 1-Dec-2013::16:17:32 ===
>>     application: m8ball
>>     exited: {{already_started,<2832.61.0>},
>>              {m8ball,start,[{takeover,'b@REDACTED'},[]]}}
>> 
>> 
>> Log on Node B
>> =INFO REPORT==== 1-Dec-2013::16:24:55 ===
>>     application: m8ball
>>     exited: stopped
>>     type: temporary
>> 
>> When I tried registering the supervisor locally, I got a similar exception failing to initializing the worker process. However, if I also register this as local, I would not be able to call it from any node using the app name (since it would not be globally registered).
>> 
>> Log on Node A (Supervisor Registered Locally)
>> {error,
>>     {{shutdown,
>>          {failed_to_start_child,m8ball,
>>              {already_started,<2832.67.0>}}},
>>      {m8ball,start,[{takeover,'b@REDACTED'},[]]}}}
>> 
>>  
>> Any pointers?
>> 
>> -- 
>> Best Regards,
>> Tyron Zerafa
>> 
>> 
>> 
>> _______________________________________________
>> erlang-questions mailing list
>> erlang-questions@REDACTED
>> http://erlang.org/mailman/listinfo/erlang-questions
> 
> "Installing applications can lead to corruption over time. Applications gradually write over each other's libraries, partial upgrades occur, user and system errors happen, and minute changes may be unnoticeable and difficult to fix"
> 
> 
> 
> 
> 
> 
> _______________________________________________
> erlang-questions mailing list
> erlang-questions@REDACTED
> http://erlang.org/mailman/listinfo/erlang-questions

"Installing applications can lead to corruption over time. Applications gradually write over each other's libraries, partial upgrades occur, user and system errors happen, and minute changes may be unnoticeable and difficult to fix"



-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20131202/bf739b44/attachment.htm>


More information about the erlang-questions mailing list