[erlang-questions] Takeover failure
Szoboszlay Dániel
dszoboszlay@REDACTED
Mon Dec 2 09:15:09 CET 2013
Hi,
OK, "tons" might be a bit dramatic. There are 3 or 4 receive statements in
the dist_ac gen_server code without a timeout. These are the potential
hang up points. The root cause of all the problems is that dist_ac assumes
all applications are started on all nodes in the same sequence. Now
imagine a typical setup: the boot script starts app A then B, but
unfortunately A has a restart timeout of 5000 and B has 3000. If the node
running these apps crashes, the rest of the cluster will attempt to start
B then A. But if the crashed node is restarted by heart within 3 seconds,
it will rejoin the cluster before the takeover and attempt to start A then
B. Result: neither of the apps fails over to anywhere and the restarted
node won't even finish it's init sequence.
Regarding a patch: I wrote a fix for the first couple of bugs I
discovered. The problem is that I did it in my work time at a big company,
where an entire security and legal department is thinking hard since then
whether it is OK to release code to the public...
To be honest, I don't push them hard right now either, because my fixes
are not good for the above described scenario. That would need a complete
rewrite of the dist_ac code to allow multiple apps to start concurrently.
I have some ideas how to do it, but I won't have time to write a fix until
January, I'm afraid (this time I'd do it from home).
And I don't think this feature would be widely used btw. The dist_ac
module hasn't been modified since the erlang/otp git repo exists.
Furthermore I believe you are also safe to use it as long as you have only
one distributed application. So I guess I'm the first one to run into this
problems using 5-6 distributed apps and 5 nodes with equal priorities.
BR,
Daniel
On Sun, 01 Dec 2013 22:45:31 -0000, Tony Rogvall <tony@REDACTED> wrote:
> ...
>>
>> PS: I would avoid using distributed applications in production. The
>> dist_ac module in the kernel application that takes care of deciding
>> >>where to run which distributed application is a terrible spaghetti of
>> gen_server callbacks and ad-hoc message passing with tons of race
>> >>conditions that can block your entire cluster from starting up any
>> distributed apps. I run into about 3-4 different bugs of this kind
>> before >>abandoning the idea of using this feature.
>>
>
> Is this really true? tons of race conditions, meaning over 1000 ? 3-4
> different bugs ?
> This raises some serious questions, like: Did you try to correct this
> and send a patch, or why not?
> If distributed application is not usable, do OTP team know about this?
> if so why is this feature still there and could fool people into try to
> use it?
>
> /Tony
>
>
>> On Sun, 01 Dec 2013 16:19:25 -0000, Tyron Zerafa
>> <tyron.zerafa@REDACTED> wrote:
>>
>>> Hi all,
>>> I am trying to understand how to implement takeover in Erlang by
>>> following the example presentedhere. Basically, I am creating the
>>> >>>application's supervisor as follows;
>>>start(normal, []) ->
>>> m8ball_sup:start_link();
>>> start({takeover, _OtherNode}, []) ->
>>> m8ball_sup:start_link().
>>>
>>>
>>> Supervisor init code:
>>> start_link() ->
>>> supervisor:start_link({global,?MODULE}, ?MODULE, []).
>>>
>>> Supervisor child Specification:
>>> {
>>> {one_for_one, 1, 10},
>>> [
>>> {m8ball,
>>> {m8ball_server, start_link, []},
>>> permanent,
>>> 5000,
>>> worker,
>>> [m8ball_server]
>>> }]
>>> }
>>>
>>> Child (m8ball_server) Initialization
>>> start_link() ->
>>> gen_server:start_link({global, ?MODULE}, ?MODULE, [], []).
>>>
>>>
>>> Consider the following scenario; an Erlang cluster is composed of two
>>> nodes A and B with application m8ball running on A.Failover works
>>> perfect, I'm managing to kill node A and see the application running
>>> on the next node, B.However, when I try to put back up node A (which
>>> have a higher priority then B) and init the app, I am getting the
>>> following error. I'm >>>assuming that this occurs because node B
>>> already contains a supervisor globally registered with that name. Log
>>> on Node A{error,{{already_started,<2832.61.0>},
>>> {m8ball,start,[{takeover,'b@REDACTED'},[]]}}}
>>>
>>> =INFO REPORT==== 1-Dec-2013::16:17:32 ===
>>> application: m8ball
>>> exited: {{already_started,<2832.61.0>},
>>> {m8ball,start,[{takeover,'b@REDACTED'},[]]}}
>>>
>>>
>>> Log on Node B
>>> =INFO REPORT==== 1-Dec-2013::16:24:55 ===
>>> application: m8ball
>>> exited: stopped
>>> type: temporary
>>>
>>> When I tried registering the supervisor locally, I got a similar
>>> exception failing to initializing the worker process. However, if I
>>> also >>>register this as local, I would not be able to call it from
>>> any node using the app name (since it would not be globally
>>> registered).
>>>
>>> Log on Node A (Supervisor Registered Locally)
>>> {error,
>>> {{shutdown,
>>> {failed_to_start_child,m8ball,
>>> {already_started,<2832.67.0>}}},
>>> {m8ball,start,[{takeover,'b@REDACTED'},[]]}}}
>>>
>>>Any pointers?
>>>
>>> --Best Regards,
>>> Tyron Zerafa
>>
>>
>>
>> _______________________________________________
>> erlang-questions mailing list
>> erlang-questions@REDACTED
>> http://erlang.org/mailman/listinfo/erlang-questions
>
>> "Installing applications can lead to corruption over time. Applications
>> gradually write over each other's libraries, partial upgrades occur,
>> user and system errors happen, and >minute changes may be unnoticeable
>> and difficult to fix"
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20131202/bdd64bbf/attachment.htm>
More information about the erlang-questions
mailing list