UBF and asynchronous communication

Wed Mar 24 16:20:24 CET 2004

Joe Armstrong <joe@REDACTED> writes:

> This is very much work in progress - the non-generic encoding rule language is
> incomplete, as is the semantics language.
>
> Spontaneous events occurring in the middle of a synchronous RPC cannot be
> expressed nicely and the state machine can become messy.
>
> On the bright side - most protocols fall into the trivial stateless pure client-server
> category and are *very* easy to describe.

Me and some hacker-pals have recently been working a lot with a
protocol that at first glance looks orderly and BNF-describable, but
in reality was riddled with race conditions due to asynchronous
communication (over a socket) and asynchronously-triggered
events. Here is a long description of our trials and tribulations, in
case somebody wants to read it and tell us a beautiful solution :-)

Our protocol is used by Emacs to talk to a Lisp server. The basic
operation is an RPC from Emacs to Lisp, and the complication is that
if Lisp encounters an error then it will enter a debugger which it
asks Emacs to control. While inside the debugger Emacs can make
further RPCs, which may recursively enter the debugger.

The original protocol, which is almost-okay, is (pardon my ad-hoc
BNF-ish notation):

  IDLE state:
    Emacs->Lisp: RPC <what> (push into EVALUATING state)

  EVALUATING state:
    Lisp->Emacs: RETURN <foo> (pop back to previous state)
    Lisp->Emacs: DEBUG <error> (push into DEBUGGING state)

  DEBUGGING state:
    Lisp->Emacs: DEBUG-FINISHED (pop back to previous state)
    Emacs->Lisp: RPC <what> (push into EVALUATING state)

Both sides of the connection essentially maintain the same state
machine. The stack in the state machine is used to match up a return
value with the code that is supposed to process it, to keep track of
whether an RPC is currently being served, and so on.

Now, any state in which both sides can initiate an event are potential
race conditions, since both sides could do it at once and move off
into different states. The only state like that here is DEBUGGING, so
if Emacs makes a fresh RPC at the same time the Lisp debugger returns
then the protocol will be violated. In the IDLE and EVALUATING states
there is no problem because the protocol only allows one party to
initiate an event.

We implemented this (several times) as a state machine (PDA) and
included a UBF-style contract checker(*). This was okay in practice,
race condition aside, but started to break down as the protocol
expanded.

The protocol expanded because the one above is too abstract. In
reality Lisp can enter the debugger at any time (e.g. if it receives a
SIGINT). To cope with this we had to add DEBUG events (with
transitions into DEBUGGING) to every state, and then we had the same
race condition in every state. We also had other similar events
(e.g. Lisp could ask for user-input at any time), which introduced
more states, more events that all states must handle, and more of the
same type of race condition.

We got to thinking this wasn't very good. The root of our problem was
that the protocol only worked properly synchronously (both sides must
see events in the same order), but we only had asynchronous
communication available (a socket). It was possible for things to go
awry and for RPCs to be sent when they weren't allowed, and so on.

We considered two solutions, but only implemented and tried one. The
one we didn't go with was keeping the race conditions but adding
'undo' transitions to roll back inconsistent changes. We would
designate the server the "master": if it receives an inconsistent
request it ignores it, and when Emacs discovers its request was
ignored it "rolls back" to the previous state. This could perhaps be
implemented e.g. with TCP-style acknowledgements, but it all sounds
bloody complicated.

Instead we stripped as much state as possible out of Emacs, because
you can't lose synchronisation of state you don't have. So when Emacs
makes an RPC it just tags it with an ID number and tucks away the {ID,
Function} saying how to handle the reply when it arrives.

I suppose this really means we made the whole protocol
asynchronous. It seems to have worked out pretty well. One special
advantage is that since Emacs no longer cares in what order it gets
replies it can issue parallel requests, and those can be executed
concurrently and complete out-of-order (using threads or signal-driven
I/O). One disadvantage is that Emacs no longer knows whether Lisp is
"busy" doing something for it or not.

So far so good, overall.

There were some more interesting issues in mixing synchronous- and
asynchronous- RPC interfaces on this using a recursive event loop, but
that is probably pretty Emacs-specific.

*Anyway*. I'm interested in hearing new ideas about how to do this
stuff, and to help shoot down ideas that don't go the whole mile :-)

Cheers,
Luke

(*): The UBF-ish contract checker was really nice because it reports
protocol errors very clearly, i.e. we can see what the bug is just
from an automatically-generated report sent by a user. This way we get
to find out whether any of our race conditions being triggered in
practical use. Here is an example of a bug report we got:

  http://article.gmane.org/gmane.lisp.slime.devel/579