[erlang-questions] question re. message delivery

Joe Armstrong erlang@REDACTED
Wed Sep 27 13:08:02 CEST 2017


On Wed, Sep 27, 2017 at 2:16 AM, Miles Fidelman
<mfidelman@REDACTED> wrote:
> Hi Joe,
>
> Hmmm....
>
> Joe Armstrong wrote:
>
> What I said was "message passing is assumed to be reliable"
>
>
> The key word here is *assumed* my assumption is that if I open a TCP socket
> and send it five messages numbered 1 to 5 then If I successfully read
> message
> 5 and have seen no error indicators then I can *assume* that messages 1 to
> 4 also arrived in order.
>
>
> Well yes, but with TCP one has sequence numbers, buffering, and
> retransmission - and GUARANTEES, by design, that if you (say a socket
> connection) receive packet 5, then you've also received packets 1-4, in
> order.
>
> My understanding is that Erlang does NOT make that guarantee.  As stated:
>
> - message delivery is assumed to be UNRELIABLE

That was the short version - Here's the long version.

Actually there is no such thing as "delivering a message" in Erlang.

What "delivering a message means" is "putting the message in the
mailbox of the receiver and scheduling the process for execution"

So all sorts of things can go wrong - the message is put in the mailbox
but an earlier message in the mailbox causes the process to crash
before the process reaches your message.

There *is* a guarantee that if create a link to a process and the process
dies you get sent a message.

So we can can make the statement "in the absence of errors message passing
order is preserved"

What does this mean? If you are linked to a message receiver and see no
error message then it is alive. If you send a sequences of messages
to the receiver and see no error messages then the messages have been
placed in the mailbox in order. This is guarantee (if the code is correct).

Note that there is no guarantee that the process will ever read the mailbox.

It's like the postal service - the letters get put in the mailbox but
there's no guarantee they get taken out, but you get to know if the
owner of the mailbox dies.

The bit that should be reliable is putting the messages in the mailbox in order
if nothing has crashed (we assume this to be correctly coded)

The bit that is unreliable is the guarantee that the message is removed
from the mailbox and correctly processed.

I'm not sure where you quoted me from - but there should be some small
print nearby with the an extended explanation.

The world "unreliable" means different things to different people.
TCP might well be reliable by design - but is it correctly implemented?
I have seen many good designs with bad implementations.

I've helped design fault-tolerant systems for years - so I'm a trust
as little as
possible sort of person. Assume things will crash and clean up later.

I was told years ago not to trust processes, a wise man said "if you
want to know
if a process has done something, get it to send you a reply message,
if you don't get the reply message then you can't assume anything about
the receiving process. So generating unique tags which we send in
round trips become important ...

Aside: Telecoms protocols make great use of tags, and timeouts
you send a request with a tag, wait a relative long time (the timeout) -
much longer than the operation should take. Then on a timeout
assume the worse - crash everything and restart.

Works very well in practise - theory wise it's very dodgy - millions of lines
of code doing this stuff is way to complex to prove anything about.

Since I basically don't trust any of the underlying layers - you have
to ask what to
I trust.

Well nothing really - but I have higher levels of trust for some
things than other.

Round trip confirmations including SHA1 checksums seems pretty good
to me.

If I say to a server "get me a file called 'foo'" and get something back
It may of may not be correct.

If I say "get me some data that has the sha1 checksum 34ad34..."
and get some data back I can check the data and see if it has the correct
checksum. I don't even need secure sockets. I do need a secure way to
know the checksum - but that is an entirely different problem.

This boils down to system design - in the latter case I need to place no
trust in the layers (I need to trust SHA1 so it's not absolute)


>
> - ordering is guaranteed to be maintained
>
> The implication being that one might well receive packets 1, 2, 3, 5 - and
> not know that 4 is missing.
>
> Actually I have no idea if this is true - but it does seem to be a
> reasonable
> assumption.
>
> Messages 1 to 4 might have arrived got put in a buffer prior to my reading
> them and accidentally reordered due to a software bug. An alpha particle
> might have hit the data in message 3 and changed it -- who knows?
>
>
> More likely, a TCP connection has dropped, taking a message or two with it,
> and once the connection is re-established, stuff starts flowing after a gap.
>
> With UDP, packets could arrive out of order as well as get dropped.
>
> There are ways to extend TCP, or write a higher level protocol that will
> detect dropped connections, and packets, reconnect, request retransmission -
> with the result that both the sender & receiver are guaranteed both delivery
> & order.
>
> Which brings us back to implementation.
>
>
> Having assumed that message passing is reliable I build code based on
> this assumption.
>
> But, for Erlang, we can't make this assumption - the documentation
> specifically says so.
>
>
> I'm not, of course, saying that the assumption is true, just that I trust
> the
> implementers of the system have done a good job to try and make it true.
> Certainly any repeatable counter examples should have been investigated
> to see if there were any errors in the system.
>
> All this builds on layers of trust. I trust that erlang message passing is
> ordered and reliable in the absence of errors.
>
> The Erlang implementers trust that TCP is reliable.
>
>
> Well, that is the question, isn't it.  Lots of things cause TCP to drop
> connections.  So the question remains - how are dropped connections handled?
> And, if after a connection is dropped and restored, how are dropped messages
> and/or messages received out of order handled?
>
> Actually, there's another design question in there - in a multi-node Erlang
> system, maintaining n2 TCP connections seems just a tad unwieldy.
> Personally, I'd be more likely to use a connectionless protocol, maybe even
> broadcast.
>
>
>
> The TCP implementors trust that the OS is reliable.
>
> The OS implementors trust that the processor is reliable.
>
> The processor implementors trust that the VLSI compilers are correct.
>
> Software runs on physical machines - so really the laws of physics apply not
> maths. Physics takes into account space and time, and the concept of
> simultaneity does not exist, no so in maths.
>
> It seems to me that software is built upon chains of trust, not upon
> mathematical chains of proof.
>
> I've just been saying "what we want to achieve" and not "how we can achieve
> it".
>
> Which brings us back to:
>
> stated goals:  unreliable delivery, ordered delivery
>
> The BEAM Book details how this works within a node, but is silent on how
> distributed Erlang is implemented.  I'm really interested in some details.



>
> The statements that people make about the system should be in terms
> of belief rather than proof.
>
> I'd say "I believe we have reliable message passing"
> It would be plain daft to say "we have reliable message passing" or
> "we can prove it be correct" since there is no way of validating this.
>
> Sure there is.  The state machine model of TCP is very clearly defined,
> including its various error conditions.  And one can test an implementation
> for adherence to the state machine model.  (In some cases, one can also
> demonstrate that software is provably correct - but let's not go there).
>
>
>
> Call me old fashioned but I think that claims that, for example,
> "we have unlimited storage" and so on are just nuts ...
>
> Agreed.  But claims like "when allocated storage reaches 80% use, additional
> storage is allocated by <mechanism>" are not just reasonable, but mandatory
> when designing systems that have to scale under uncertain load.
>
> Which brings us back to - how is message passing implemented between Erlang
> nodes?
>
> Cheers,
>
> Miles
>
> --
> In theory, there is no difference between theory and practice.
> In practice, there is.  .... Yogi Berra
>
>
> _______________________________________________
> erlang-questions mailing list
> erlang-questions@REDACTED
> http://erlang.org/mailman/listinfo/erlang-questions
>



More information about the erlang-questions mailing list