[erlang-questions] OTP 22.1 socket.erl somehow breaks message delivery or scheduler

Björn Gustavsson bjorn@REDACTED
Tue Oct 29 12:40:07 CET 2019


On Tue, Oct 29, 2019 at 9:14 AM Andreas Schultz
<andreas.schultz@REDACTED> wrote:
>
> I'm sure that over time all the benefits of the new compiler architecture are well worth the price.
> But for OTP 22.x it has led to a few, but highly frustrating problems. Even the few incorrect code generation bugs have led to very hard to understand problems for the users.
>
> I'm not sure that scope and impact of this bug here is even fully understood.
>
> My demonstration code used a bare receive, but the code that actually triggered it used a plain gen_server. The result of the bug was that gen_sever:calls seemed to arrive extremely late (multiple seconds).
> It would therefore seem that the incorrect code was present in main receive loop of gen_server (and probably also gen_statem and gen_event). This would mean almost all Erlang applications on OTP 22.x could be affected.
>
> The effects might go unnoticed in many tests cases in other projects, until it causes unexplainable failures.
>
> Since all this was introduced in OTP 22, the sensible suggestion for everyone seem to be test OTP 22.x as well as they can, but to stay away from it for production use.
>

We have done some more investigating.

The issues you saw are caused by TWO bugs,
the compiler bug and a bug in the run-time
system.

The bug in the compiler caused a position in
the message queue to be saved even when it
was not guaranteed that a receive would be
executed. The only module in OTP that was
hit by this bug was the socket module, and
in this case it would have been harmless
without the other bug.

The bug in the run-time system would cause
the saved position to be used for a receive that
was not supposed to use the saved position.
This bug was introduced in OTP 21, but it is
unlikely that it could ever be triggered by
code emitted by the compiler in OTP 21.

So what does that mean for production use
of OTP 22.x?

We know that OTP 22 is already used in
production. The problems you saw were
caused by the use of the socket module that
was hit by the compiler bug. Unless one has
another module that uses receive in a similar
way to how socket uses it, using gen_server
and the other modules is perfectly safe.

Since the socket module was introduced
in OTP 22 and is still experimental, it
would be advisable to avoid it in production
use for the moment.

The fix for the compiler should be enough
to fix the kind of problems you saw, but we
will of course fix the bug in the run-time
system as well.

/Björn

-- 
Björn Gustavsson, Erlang/OTP, Ericsson AB



More information about the erlang-questions mailing list