[erlang-patches] Non-overlapping Application Distribution Node Sets

Thu Jun 13 11:16:38 CEST 2013

On Wed, Jun 12, 2013 at 05:07:38PM +0200, Siri Hansen wrote:
}  We do think that this type of functionality is interesting. We would,
}  however, like to look at it in a bigger picture and possibly investigate
}  the functionality for distributed applications a bit more before jumping to
}  any conclusion. Would it be possible for you to provide us with some more
}  information about the exact problem that you want to solve and maybe other
}  use cases?

The use case is simply that we run the same application on each node in
a distributed Erlang cluster and want to designate a standby node for
each as depicted below:

   +-----------+          +-----------+
   |serverA    |          |serverB    |
   | +-------+ |          | +-------+ |
   | | node1 | |          | | node2 | |
   | +-------+ |          | +-------+ |
   | +-------+ |          | +-------+ |
   | | node3 | |          | | node4 | |
   | +-------+ |          | +-------+ |
   +-----------+          +-----------+

   node1:
      [{kernel,
               [{distributed, [{app1, [node1@REDACTED, node2@REDACTED]}]}]},
                {sync_nodes_optional, [node2@REDACTED]},
                {sync_nodes_timeout, 5000}]}].
   node2:
      [{kernel,
               [{distributed, [{app1, [node1@REDACTED, node2@REDACTED]}]}]},
                {sync_nodes_optional, [node1@REDACTED]},
                {sync_nodes_timeout, 5000}]}].
   node3:
      [{kernel,
               [{distributed, [{app1, [node4@REDACTED, node3@REDACTED]}]}]},
                {sync_nodes_optional, [node4@REDACTED]},
                {sync_nodes_timeout, 5000}]}].
   node4:
      [{kernel,
               [{distributed, [{app1, [node4@REDACTED, node3@REDACTED]}]}]},
                {sync_nodes_optional, [node3@REDACTED]},
                {sync_nodes_timeout, 5000}]}].

}  While reviewing, I also found that your patch is a bit incomplete as it
}  does not introduce any new handling of #state.remote_started. This causes
}  (at least) a hanging in the following scenario:
}  1. Start all nodes in first group
}  2. On first node in first group: application:start(MyDistApp).
}  3. Start all nodes in second group
}  4. On first node in second group: application:start(MyDistApp).
}  5. On other node in first group: application:start(MyDistApp).
}  => hangs, since #state.remote_started contains two elements and only the
}  first (which by chance is not the correct one) is considered.

Yes, I have since discovered this issue.  I did intend to update the patch
with a permanent solution ...

}  - This is just for information, and I don't suggest that you spend a lot of
}  time solving this, since we don't yet know if we will accept the patch or
}  not.

If it can be done by simply anticipating that applications may run on
non-overlapping nodes and handling it than I sincerely hope that you
would.  For us it was not obvious at all that there was any reason that
we couldn't configure as above.  The documentation only says:

   "All involved nodes must have the same value for distributed and
    sync_nodes_timeout, or the behaviour of the system is undefined."

... which we read to say that the {App, [Node, ...]} tuple should be
consistent within those nodes.

Beyond this, hopefully simple, enhancement I plan to implement a new
distributed application controller to accomplish N+M redundancy as well.
That seems to be something which we can do without much change to OTP.
Ulf has shared his previous work on design with me, he seems to have
recognized the same requirement.  I wonder if anyone else on the list
has done any work on this sort of thing or has thoughts on requirements
or design?

-- 
	-Vance