[erlang-questions] [erlang-bugs] R11B-2 SMP Timer Race Condition Bug [Re: bug in timer:sleep/1 smp implementation (R11B-0)]

Chris Newcombe chris.newcombe@REDACTED
Thu Dec 28 17:09:18 CET 2006


Hi Rickard,

First of all, many thanks indeed for the very fast response time on
investigating and fixing issues like this!  That level of
responsiveness really helps reassure new adopters of Erlang.

How risky is this patch?  i.e. Should everyone apply it?

Is the patch ...

a) An experimental fix that needs testing by Serge and Dmitriy before
others consider it.

b) A definite fix for a definite problem, and has been tested.   But
it may or may not be the problem that Serge and Dmitriy found.

regards,

Chris

On 12/27/06, Rickard Green <rickard.s.green@REDACTED> wrote:
> The process lock plays an important role here. Unfortunately a faulty
> optimization (blush) prevented the process lock from playing that role.
> ptimer_timeout() has to acquire the process lock before looking at the
> ptimer flags. I've attached a patch that should fix the problem.
>
>      $ tar -zxf otp_src_R11B-2.tar.gz
>      $ patch -p0 < ptimer.patch
>      patching file `otp_src_R11B-2/erts/emulator/beam/utils.c'
>
> Please, report to us whether or not the problem went away.
>
> Great work Dmitriy and Serge! Many thanks!
>
> BR,
> Rickard Green, Erlang/OTP
>
> Rickard Green wrote:
> > Thanks for your detailed bug report. I'll look at this as soon as possible.
> >
> > BR,
> > Rickard Green, Erlang/OTP
> >
> > Serge Aleynikov wrote:
> >> Additionally, I should say that we've been able to reproduce this bug on
> >> several Linux platforms (RH ES 4.0, Fedora Core 5.0, 32bit and 64bit) in
> >> R11B-0, R11B-1 and R11B-2.  The bug (what appears to be a race
> >> condition) is seen only if the emulator is started in the SMP mode and
> >> results in the following construct blocking infinitely in the context of
> >> some Erlang process handing a message dispatching function:
> >>
> >> receive
> >> after N ->    % Where N is between 1 and 999
> >>      ok
> >> end.
> >>
> >> It happens when all the CPUs in SMP mode are over 75% loaded.  The bug
> >> doesn't happen immediately after starting a release, but after a period
> >> of 5 min to 3 hours, which makes it pretty hard to diagnose.  The
> >> tracing method that we initially tried to use was to include printf
> >> statements in the emulator to stderr.  However, this prevented the bug
> >> from showing up.  Further it was changed to using SysV message queue to
> >> communicate trace to an external process that dumped the trace to a
> >> file.  This allowed to gain further understanding of the problem, but as
> >> Dmitry indicated any attempt to reduce the code to a minimal example
> >> made the problem disappear.
> >>
> >> The emulator code is quite involved, but hopefully someone in the OTP
> >> team could come up with a recommendation of how/where to put a missing
> >> synchronization.  If needed we can arrange for a remote SSH login to the
> >> system(s) where the problem is reproducible.
> >>
> >> Regards,
> >>
> >> Serge
> >>
> >> Dmitriy Kargapolov wrote:
> >>> Unfortunately I can not create standalone test for this bug, even when I
> >>> became much more close to understanding the effect.
> >>> This bug appears only in highly loaded system.
> >>>
> >>> Recently I did manage to trace some points in the code and see at least
> >>> one scenario for the race condition bug.
> >>>
> >>>  1. Thread A    erl_set_timer (time.c)            Lock Timing Wheel
> >>>  2. Thread A    insert_timer (time.c)             Insert Timer T1
> >>>  3. Thread A    erl_set_timer (time.c)            Unlock Timing Wheel
> >>>  4. Thread B    bump_timer_internal (time.c)      Lock Timing Wheel
> >>>  5. Thread A    cancel_timer (erl_process.c)      Cancel timer T1
> >>>  6. Thread B    bump_timer_internal (time.c)      Build list of Expired
> >>> Timers
> >>>  7. Thread A    erl_cancel_timer (time.c)         Cancel timer T1:
> >>> Waiting for Timing Wheel Lock
> >>>  8. Thread B    bump_timer_internal (time.c)      Unlock Timing Wheel
> >>>  9. Thread C    set_timer (erl_process.c)         New Timeout Request (T2)
> >>> 10. Thread B    bump_timer_internal (time.c)      Call Expired Timers
> >>> Callbacks
> >>> 11. Thread B    free_ptimer (utils.c)             Timer T1 callback
> >>> invokes free_ptimer()
> >>> 12. Thread C    erts_create_smp_ptimer (utils.c)  Create Timer
> >>> ErtsSmpPTimer for T2
> >>> 13. Thread B    free_ptimer (utils.c)             Free ErtsSmpPTimer
> >>> memory block
> >>> 14. Thread C    erts_create_smp_ptimer (utils.c)  Allocate ErtsSmpPTimer
> >>> for T2, block reused!
> >>> 15. Thread C    erl_set_timer (time.c)            erl_set_timer invoked
> >>> for T2
> >>> 16. Thread C    erl_set_timer (time.c)            Lock Timing Wheel
> >>> 17. Thread C    insert_timer (time.c)             Insert Timer T2
> >>> 18. Thread C    erl_set_timer (time.c)            Unlock Timing Wheel
> >>> 19. Thread A    erl_cancel_timer (time.c)         Lock Timing Wheel
> >>> 20. Thread A    erl_cancel_timer (time.c)         Remove ex-T1 == T2
> >>> from the timing wheel
> >>> 21. Thread A    erl_cancel_timer (time.c)         Unlock Timing Wheel
> >>>
> >>> See also attached diagram.
> >>>
> >>> Looks like one more mutex required, excluding release of ErtsSmpPTimer
> >>> memory block by timeout callback if cancel request was issued for the
> >>> timer and vise versa. The two point of control - cancel timer and timer
> >>> expiration should not interfere.
> >>> This bug happens only in SMP mode since there additional timer control
> >>> structure ErtsSmpPTimer is used between emulator and timing wheel.
> >>>
> >>> Mikael Pettersson wrote:
> >>>> Dmitriy Kargapolov writes:
> >>>>  >  > When running erl with -smp +S 2 option, sometimes process gets
> >>>> stuck in  > timer:sleep/1.
> >>>>  > Process code looks like:
> >>>>  >  > some_receiver(State) ->
> >>>>  >      NewState = receive
> >>>>  >          % legal packet
> >>>>  >          {some_keyword, Address, Port, Packet} ->
> >>>>  >              State1 = handle_packet(Address, Port, Packet, State),
> >>>>  >              timer:sleep(get_loop_delay()),
> >>>>  >              State1;
> >>>>  >          % unknown message
> >>>>  >          _ ->
> >>>>  >              State
> >>>>  >      end,
> >>>>  >      some_receiver(NewState).
> >>>>  >  > Delay value varies in range 1..999
> >>>>  >  > Since timer:sleep/1 implemented as:
> >>>>  > sleep(T) ->
> >>>>  >      receive
> >>>>  >      after T -> ok
> >>>>  >      end.
> >>>>  > it seems to be problem with "after" in smp implementation in R11B-0
> >>>>  >  > I don't have more details yet but will continue testing.
> >>>>  > My platform: 2.6.9-5.ELsmp #1 SMP i686 i686 i386 GNU/Linux
> >>>>
> >>>> Interesting. Please send us a small standalone module that exhibits
> >>>> the bug, and I'll see if I can reproduce it.
> >>>>
> >>>> /Mikael
> >>>>
> >>> ------------------------------------------------------------------------
> >>>
> >>> _______________________________________________
> >>> erlang-questions mailing list
> >>> erlang-questions@REDACTED
> >>> http://www.erlang.org/mailman/listinfo/erlang-questions
> > _______________________________________________
> > erlang-bugs mailing list
> > erlang-bugs@REDACTED
> > http://www.erlang.org/mailman/listinfo/erlang-bugs
> >
>
>
>
>
>
> --- otp_src_R11B-2/erts/emulator/beam/utils.c   2006-11-06 14:51:50.000000000 +0100
> +++ otp_src_R11B-2.ptimer_patch/erts/emulator/beam/utils.c      2006-12-27 18:11:44.772758000 +0100
> @@ -2999,15 +2999,16 @@
>  static void
>  ptimer_timeout(ErtsSmpPTimer *ptimer)
>  {
> -    if (!(ptimer->timer.flags & ERTS_PTMR_FLG_CANCELLED)) {
>        if (is_internal_pid(ptimer->timer.id)) {
>            Process *p;
> -           p = erts_pid2proc(NULL,
> -                             0,
> -                             ptimer->timer.id,
> -                             ERTS_PROC_LOCK_MAIN|ERTS_PROC_LOCK_STATUS);
> +           p = erts_pid2proc_opt(NULL,
> +                                 0,
> +                                 ptimer->timer.id,
> +                                 ERTS_PROC_LOCK_MAIN|ERTS_PROC_LOCK_STATUS,
> +                                 ERTS_P2P_FLG_ALLOW_OTHER_X);
>            if (p) {
> -               if (!(ptimer->timer.flags & ERTS_PTMR_FLG_CANCELLED)) {
> +               if (!p->is_exiting
> +                   && !(ptimer->timer.flags & ERTS_PTMR_FLG_CANCELLED)) {
>                    ASSERT(*ptimer->timer.timer_ref == ptimer);
>                    *ptimer->timer.timer_ref = NULL;
>                    (*ptimer->timer.timeout_func)(p);
> @@ -3028,7 +3029,6 @@
>                erts_smp_io_unlock();
>            }
>        }
> -    }
>     free_ptimer(ptimer);
>  }
>
>
>
>
> _______________________________________________
> erlang-bugs mailing list
> erlang-bugs@REDACTED
> http://www.erlang.org/mailman/listinfo/erlang-bugs
>
>
>



More information about the erlang-questions mailing list