[erlang-questions] Delivery of messages

Wed Feb 29 18:37:46 CET 2012

Hi,

This seems a very busy mailing list!  Please accept my apologies if this
is not the right place for questions about Erlang fundamentals - but, if
so, please tell me where is the right place?

Anyway, many thanks to all who replied to my question and/or sent me links
to relevant publications.  I had asked:

> If process P sends a message to process Q and:
> 
>   * neither process crashes;
>   * Q commits to receive P's message;
>   * the cores/processors/computers on which P and Q are running
>     do not crash or get turned off;
>   * no part of the communication fabric delivering the message
>     crashes or gets turned off;
> 
> is Q guaranteed to receive the message?                       (1)

The answer seems to be no (or, maybe, yes) ... but ... :(

> I want to be sure that, under these conditions (in which nothing is
> breaking), no message is ever lost in the system between P and Q.

Svensson and Fredlund report in:

  "Programming Distributed Erlang Applications: Pitfalls and Recipes"
  Erlang'07
  http://scholar.google.com/scholar_url?hl=en&q=http://citeseerx.ist.psu.edu/viewdoc/download%3Fdoi%3D10.1.1.116.9929%26rep%3Drep1%26type%3Dpdf&sa=X&scisig=AAGBfm1DRFuGBWTA_7CNYwqep87ZwEzrWg&oi=scholarr

that messages can indeed be lost - i.e. P can send m1, m2 and m3 (in that
order) and Q can receive m1 and m3.  This was provoked by a temporary
failure in the comms fabric between the machines on which P and Q were
running (e.g. pull out the ethernet cable, wait long enough for the
internode deadman's tick between nodes to be missed, plug the cable back).

They give an algorithm to overcome such temporary comms failure.  P must
monitor Q, messages must contain sequence numbers, Q must remember
the sequence number for the last message received and P must cache
(or be able to regenerate) everything it has sent.  Between sends,
P polls for a monitor 'DOWN' message and, if received, goes into a
timeout-mitigated busy-loop to sync with Q (re-monitor Q, send 'sync'
to Q, wait-a-bit then loop if another 'DOWN', otherwise Q replies with
its last received sequence number and P can resume from there).  Phew!

The paper goes on to consider possible Erlang extensions to help here
(e.g. explicitly report reconnection following a temporary comms failure -
which would, at least, avoid the busy-loop).

That paper was 2007.  Is this still the current state - i.e. can messages
still be lost this (or any other) way?

I was also pointed to Svensson, Fredlund, and Earle:

  "A Unified Semantics for Future Erlang"
  Erlang'10
  http://dl.acm.org/citation.cfm?id=3D1863509.1863514&coll=3DDL&dl=3DGUIDE&C=FID=3D86243344&CFTOKEN=3D74478234

This sets out an operational semantics for an Erlang with some nice
rationalisations - making links and monitors operate in one direction
(only the process making the link or monitor being notified of failure in
the linked/monitored process), removing the 'trap exit' option from links
(which is what monitoring is for).

Section 2.4 (Message Passing Guarantees) says that a stream of messages
from A to B will be received in the order of sending, except that some
at the end of the sequence may be dropped in case of a node disconnect.
This explicitly excludes the losing of messages in the middle of a
sequence, if I have interpreted the text correctly?

However, it also says in Section 5 (Fairness), that the formal semantic
rules (specifically Table 4) also mean that an implementation would still
conform to the semantics even if "some messages are never delivered",
which seems to contradict the Section 2.4 statement?  It then proposes a
"Fairness" rule for execution sequences to overcome this, which would
mean that for every message sent:

  * it is either delivered or ...
  * the receiving process dies or ...
  * the node running the receiving process dies.

This is, at least, a statement of intent ... but it seems to be outside
the proposed formal semantics.  Will (future) Erlang systems be able to
rely on this?

The paper does not mention the problem raised in their earlier 2007 paper
(where messages are lost by a temporary break in contact with the node
holding the receiving process).  But that may be covered by some of the
semantic rules in the Tables that I've not yet got my head around?

Some of the replies pointed out that Erlang makes no promises about how
long it will take to deliver a message ("heat death of the universe"
etc.).  So, if it did promise to deliver messages, how could you hold it
to account?  Just because your process gets nothing and the Andromeda and
Milky Way galaxies have collided, how do you know it's not going to keep
its promise sometime?  The argument implies that an individual message
delivery guarantee is not worth having.  If you were in charge of (say)
Amazon, how would that argument work for your customers?

Erlang (I guess) also makes no promises about how long it will take to
do anything.  For instance, a simple pattern match:

  N = 42

may also take an indeterminate time to complete - but, for reasoning
about the system, I'm going to assume that it *will* happen!  For similar
reasoning about a system in which a process sends 42 to another process
and that other process commits to receive it and bind it to N, why can
I not make the same assumption?  Modern computer systems are all about
computation and communication.  Both are essential - why is the latter
second class?

For reasoning about system behaviour, it seems like we are computing in
treacle.  Processes launch stuff into the treacle and listen, hopefully,
for stuff to arrive.  If a process needs to know its message has been
received, it must wait for an acknowledgement.  But the message may be
received and the process sending it may never get the ack ... and the
receiver won't know if the sender got its ack ... unless it waits for
an ack of the ack ... (no happy ending here).

Now, the implementation may, of course, not be made of treacle and may
be very lightweight and effective.  But our reasoning cannot assume that.
How do we know that the run-time support is not implemented as follows:

  * all message sends are compiled to no-ops;

  * all message receives are compiled to STOP (i.e. freeze/block,
    not terminate/exit) ... unless they have an "after" clause,
    in which case they are compiled to a delay (the "after" value)
    followed by the "after" response.

?  The above certainly honours the no-out-of-order guarantees, :).

But it doesn't honour the "Fairness" rule - so I would feel much more
comfortable if that were mandated.  Is it / will it / can it be mandated?

Thanks,

Peter.