[erlang-patches] Non-overlapping Application Distribution Node Sets

Fri May 3 10:07:52 CEST 2013

Fetch here:

   git fetch git://github.com/vances/otp.git non_overlap_application_distribution

Browse here:

   https://github.com/vances/otp/commit/61f4da70e32bf745d96455b6d2f2ca42c4e4a3a7

Commit message:

Support non-overlapping application distribution nodes

Currently all known nodes should have the same value for the
kernel application's 'distributed' environment variable.  It
is not expected that any application will be distributed on
on more than one set of nodes.

It should be possible to distributed an application between
multiple non-overlapping sets of nodes.  For example with this
system configuration file on nodes a@REDACTED and b@REDACTED:

   [{kernel,
      [{distributed, [{app_no, [a@REDACTED, b@REDACTED]}]},
       {sync_nodes_optional, [a@REDACTED, b@REDACTED]},
       {sync_nodes_timeout, 5000}]}].

... and this system configuration file on nodes c@REDACTED and d@REDACTED:

   [{kernel,
      [{distributed, [{app_no, [c@REDACTED, d@REDACTED]}]},
       {sync_nodes_optional, [c@REDACTED, d@REDACTED]},
       {sync_nodes_timeout, 5000}]}].

Other applications may be distributed involving some other
combination of these nodes without interference.

This patch adds checks in dist_ac to ignore DAC protocol
messages of an application from nodes not included in that
application's distribution specification locally.

Rationale:

We often want to have active/standby pairs for applications
while also having multiple instances of the application running
on different nodes.  When nodes within the cluster are communicating
in order to, for example, distribute mnesia tables, suddenly there
is a potential conflict between these otherwise unrelated node pairs.

Currently there is a window of time during node (re)starts where a
conflict may occur.  This patch simply corrects this error case.

The documentation is somewaht unclear as to whether the configuration
above is legal or not.  The fact that, other than in the race condition
noted above, this distribution does work as expected allows one to use
the more liberal interpretation that when it says:

   "All involved nodes must have the same value for distributed and 
    sync_nodes_timeout, or the behaviour of the system is undefined."

Involved nodes refers to nodes involved in tthis application's distribution.
With this patch that interpretation holds true.

Tests:

The existing tests are extended to support the following application
distribution configuration:

   cp1@REDACTED:
      [{kernel,
         [{sync_nodes_optional, [cp2@REDACTED, cp3@REDACTED]},
          {distributed,
             [{app1, [cp1@REDACTED, cp2@REDACTED, cp3@REDACTED]},
              {app2, 2000, [cp1@REDACTED, cp2@REDACTED, cp3@REDACTED]},
              {app_sp, 1000, [{cp1@REDACTED, cp2@REDACTED}, cp3@REDACTED]},"
              {app_no, 1000, [cp1@REDACTED, cp2@REDACTED]}]}]}].
'
   cp2@REDACTED:
      [{kernel,
         [{sync_nodes_optional, [cp1@REDACTED, cp3@REDACTED]},
          {distributed,
             [{app1, [cp1@REDACTED, cp2@REDACTED, cp3@REDACTED]},
              {app2, 2000, [cp1@REDACTED, cp2@REDACTED, cp3@REDACTED]},
              {app_sp, 1000, [{cp1@REDACTED, cp2@REDACTED}, cp3@REDACTED]},"
              {app_no, 1000, [cp1@REDACTED, cp2@REDACTED]}]}]}].

   cp3@REDACTED:
      [{kernel,
         [{sync_nodes_optional, [cp1@REDACTED, cp2@REDACTED, cp4@REDACTED]},
          {distributed,
             [{app1, [cp1@REDACTED, cp2@REDACTED, cp3@REDACTED]},
              {app2, 2000, [cp1@REDACTED, cp2@REDACTED, cp3@REDACTED]},
              {app_sp, 1000, [{cp1@REDACTED, cp2@REDACTED}, cp3@REDACTED]},"
              {app_no, 1000, [cp3@REDACTED, cp4@REDACTED]}]}]}].

   cp4@REDACTED:
      [{kernel,
         [{sync_nodes_optional, [cp2@REDACTED]},
          {distributed,
              {app_no, 1000, [cp3@REDACTED, cp4@REDACTED]}]}]}].

The Cp4 node is added along with the app_no application which is 
distributed in two active/standby pairs on Cp1/Cp2 and Cp3/Cp4.
The tests check that these pairs are unaffected by the other applications'
starts, stops, failovers and takeovers.  And that they do not affect each
other's.

-- 
	-Vance