[erlang-questions] gen_leader usage/pointers

Ulf Wiger ulf.wiger@REDACTED
Thu Sep 23 10:45:25 CEST 2010

On 23/09/2010 00:52, Piotr Kaleta wrote:
> However you should remember that current versions of gen_leader works in
> a way that you have to statically define the list of nodes that can
> become master in case of other nodes failure. Current implementations
> however, are incapable of adding nodes that might claim master role in
> future, at runtime. This makes gen_leader incapable of working in
> unstable environments when all of master nodes might go down for some
> reason.

There is a good reason for this limitation, and while I will not
venture to say that all attempts to fix it are necessarily broken,
the challenge lies in that the original gen_leader has been
subjected to unusually rigorous analysis:


The first version of gen_leader used a leader-election algorithm that
wasn't a perfect fit for Erlang's semantics, but even that version
successfully went through model checking. Once more advanced methods
were developed, including a formal semantics for Distributed Erlang,
it was found to be broken, and a new algorithm was selected. This
algorithm was tested with model checking, abstract trace analysis
and QuickCheck in order to verify the _core behaviour_ (note! this
is not the same thing as shaking out all bugs, and although quite a
few bugs have been fixed since then, they were not defects in the
core implementation).

Now the problem: In order to support the dynamic addition and
removal of nodes, you either have to show how this is compatible
with Stoller's algorithm (which is not trivial, but may well be
possible), or invent, or select, a different leader election
algorithm. Ideally, one should then embark on a fairly ambitious
project (although it probably doesn't have to amount to an
entire PhD thesis) in order to show that the added feature didn't
in fact break the core function of the leader election behaviour.

An alternative, YMMV, would be to accept that everything doesn't
have to be unbounded in terms of flexibility and failure modes.
I would go as far as saying that one of the most important parts
of designing for high availability is to figure out ways to
_simplify_ the design so that you have as few different failure
modes as possible.

Since the AXD 301 is still often used as a reference (the famous
"99.9999999% availability"), I could point out that it had two
master nodes running in an active-standby configuration. There
were also "expansion processors" for scalability up to 32 nodes,
but if both master nodes went down, the whole system was
considered down.

 From many discussions I sense the assumption that you have to
have redundancy to the nth degree in order to achieve high
availability, but the fact is that the AXD 301 scored better
than 99.999% service availability based on field reports,
each month for years (as long as I was keeping track).
With that sort of design, you would configure gen_leader to
have both master nodes as leader candidates and the rest as
worker nodes (which can more easily be handled dynamically).

This is also one reason why the limitation exists in gen_leader
in the first place, as the original source of inspiration was
the rcmLocker - a distributed read-write locker component
which was part of the AXD 301 cluster controller[1]. It had a
fairly primitive leader election implementation, based on the
global name server. Thomas Arts observed that the locker was
difficult to model-check since it combined several difficult
patterns in the same implementation, and we started writing
a behaviour for leader election that would allow you to
separate that logic. Problems that were difficult to solve,
and were strictly not needed in the AXD 301 were naturally
pushed into some unspecified future.

Ulf W

[1] http://forum.trapexit.org/viewtopic.php?p=30186#30186
Ulf Wiger
CTO, Erlang Solutions Ltd, formerly Erlang Training & Consulting Ltd

More information about the erlang-questions mailing list