Dynamic Node Additions

Wed Dec 11 11:26:45 CET 2002

On Wed, 11 Dec 2002, Per Bergqvist wrote:

>My personal view is: use non distributed applications and
>roll your own interlocking and failover mechanism.

Whoa! Heavy advice. I would suggest that you can get pretty
far with dist_ac, though. I'm not at all sure that people in
general will be more successful rolling their own than using
dist_ac as it is.

Having said that, we (AXD 301) don't use it. We have rolled
our own distributed application controller (based on a
prototype by Martin Björklund). It is not for the faint of
heart, but ours works well and is extremely well tested.

I will look into making it available. The agreement with OTP
was that it will eventually become part of OTP, but I will
not promise that whatever I may publish will be compatible
with whatever they may include in OTP in the future. (:

Daniesc, what you can do to get started is to use dist_ac,
and replicate state data in mnesia. This way, your
application can get started quickly on the other node.

Things to consider when you upgrade one node at a time like
that:

- The mnesia schema cannot be upgraded one node at a time.
  You could work around this by using one "registry" table
  (using only key+value attributes) for starters. This
  doesn't "solve" the problem, but gives you a chance to
  address it manually during your upgrade.

- Make sure that you handle all interaction across node
  boundaries with extra care. If a procotol between
  processes on different erlang nodes changes, you will
  have a harder time (it can be handled, but complicates
  things). Adding a version field in messages going between
  nodes could help you a little down the road.

I'm sure you will eventually also learn to perform smooth
upgrades synchronized across multiple nodes. OTP supports
this, but a good tutorial is needed...

/Uffe

On Wed, 11 Dec 2002, Per Bergqvist wrote:

>Hi,
>
>... [snip] ...
>
>>
>> (We were looking at downing 1 node, loading the relevant boot
>scripts etc and then bringing it up again, then downing node 2 doing
>the same, the caveat however is that every application must be run on
>at least two nodes, and both those nodes must not go down
>simultaneously).
>>
>
>(If all nodes providing a service are down there is not much to do, is
>it ?).
>
>Is this a SASL distributed applications ?
>I experienced severe problems with the distributed application at a
>customer site earlier this spring.
>My analysis was that the distributed application controller and it's
>underlying protocol is broken.
>It is really easy to get the distributed application controller into
>deadlock states when two nodes start at the same time (e.g. reboot
>after a power failure on two identical hosts).
>
>Another bizarro side effect is that dist_ac always stops the active
>running instance of the application in a distributed cluster of nodes
>and starts it on the last started node.
>
>My personal view is: use non distributed applications and roll your
>own interlocking and failover mechanism.
>
>/Per
>
>=========================================================
>Per Bergqvist
>Synapse Systems AB
>Phone: +46 709 686 685
>Email: per@REDACTED
>

-- 
Ulf Wiger, Senior Specialist,
   / / /   Architecture & Design of Carrier-Class Software
  / / /    Strategic Product & System Management
 / / /     Ericsson Telecom AB, ATM Multiservice Networks