[erlang-questions] Several gen_leader questions
Vasily Sulatskov
vasily@REDACTED
Fri May 18 11:34:30 CEST 2012
Hi,
I am using gen_leader from https://github.com/abecciu/gen_leader_revival and I
want to use it with a fixed list of candidate nodes - as far as I understand
that's the easiest way.
I tried several variations of starting gen_leader but none was satisfactory.
I start it as part of a supervision tree, so all examples are taken from
gen_supervisor modules init/1 functions:
Here's what I tried so far:
init(_Args) ->
%% The same value is used on all machines in the cluster
Leader_nodes = ['foobar@REDACTED', 'foobar@REDACTED', 'foobar@REDACTED'],
Home = os:getenv("HOME"),
Gen_leader_config = {gen_leader_module, {gen_leader_module, start_link,
[Leader_nodes,
[{vardir, Home}]]},
permanent, 2000, worker, [gen_leader_module]},
{ok, {{one_for_one, 10000, 1},
[Gen_leader_config]}}.
If I do this, then nodes specified in Leader_nodes work just fine, they all
participate in elections, leaders are elected properly, they are able to do
gen_leader:leader_call() to the actual leader etc.
The problem is that on all other nodes (which are not specified in
Leader_nodes) gen_leader is not started at all. Gen_leader checks if the node
it's running on is one of "candidate nodes" or "worker nodes" and if that's not
the case - it simply doesn't start. All further attempts at
gen_leader:leader_call from that node fail.
I tried to run every node in the cluster except for "candidate nodes" as a
"worker node", so I changed supervisor to something like:
init(_Args) ->
%% The same value is used on all machines in the cluster
Leader_nodes = ['foobar@REDACTED', 'foobar@REDACTED', 'foobar@REDACTED'],
Workers =
case lists:member(node(), Leader_nodes) of
true ->
[];
false ->
[node()]
end,
Home = os:getenv("HOME"),
{ok, {{one_for_one, 10000, 1},
[{scheduler, {scheduler, start_link,
[Leader_nodes,
[{vardir, Home},
{workers, Workers}]]},
permanent, 2000, worker, [scheduler_leader]}]}}.
As far as I understand, when gen_leader runs in a worker configuration, it
doesn't participate in elections, but still keeps track of where an actual
leader is running, so gen_leader:leader_call is still possible.
This setup kind of works, but it seems that gen_leader process on "worker" nodes
constantly grows in memory usage, past several Gb at least, eventually crashing
the whole VM.
Am I running gen_leader correctly?
What is the correct way of running gen_leader with a fixed set of "candidate"
nodes and that every other node is aware of where a leader is running, so that
gen_leader:leader_call() is possible?
Which version of gen_leader is recommended to use? This one
https://github.com/abecciu/gen_leader_revival? Or maybe the version from
gproc? By the way can someone explain what's the difference between them?
And I have another, most likely unrelated, issue with gen_leader. On one
deployment, sometimes I find a cluster in a state with two leaders - most of
the nodes think that the leader is one node, but some other node thinks
that the leader is on the other node. I am not sure if the other leader is the
node that diverges from the consensus - I don't have a cluster in this state
right now to check.
It seems to happen after a gen_leader process crashes somehow (some internal
work, not related to gen_leader magic).
The other thing that I think might be important here, is that gen_leader
process in that setup can get stuck in handle_leader_call for quite some long
time. Can it cause problems with leader elections? Should gen_leader processes
not block in handle_whatever functions and always be able to handle election
callback?
Thanks in advance.
--
Vasily Sulatskov
More information about the erlang-questions
mailing list