[erlang-questions] [erlang-bugs] R11B-2 SMP Timer Race Condition Bug [Re: bug in timer:sleep/1 smp implementation (R11B-0)]

Chris Newcombe chris.newcombe@REDACTED
Fri Dec 29 17:50:30 CET 2006


Excellent -- many thanks again for fixing it so quickly.

Chris

On 12/29/06, Rickard Green <rickard.s.green@REDACTED> wrote:
> The scenario described by Serge and Dmitriy can happen due to this bug.
> The fix has been tested and I am quite sure it will fix the described
> problem. There could of course exist yet another bug causing the same
> problem, but I don't think so. The results of Serge's and Dmitriy's
> tests are of course interesting, but regardless that the patch fixes a
> real bug. If you use the smp emulator, apply the patch.
>
> BR,
> Rickard Green, Erlang/OTP
>
> Chris Newcombe wrote:
> > Hi Rickard,
> >
> > First of all, many thanks indeed for the very fast response time on
> > investigating and fixing issues like this!  That level of
> > responsiveness really helps reassure new adopters of Erlang.
> >
> > How risky is this patch?  i.e. Should everyone apply it?
> >
> > Is the patch ...
> >
> > a) An experimental fix that needs testing by Serge and Dmitriy before
> > others consider it.
> >
> > b) A definite fix for a definite problem, and has been tested.   But
> > it may or may not be the problem that Serge and Dmitriy found.
> >
> > regards,
> >
> > Chris
> >
> > On 12/27/06, Rickard Green <rickard.s.green@REDACTED> wrote:
> >> The process lock plays an important role here. Unfortunately a faulty
> >> optimization (blush) prevented the process lock from playing that role.
> >> ptimer_timeout() has to acquire the process lock before looking at the
> >> ptimer flags. I've attached a patch that should fix the problem.
> >>
> >>      $ tar -zxf otp_src_R11B-2.tar.gz
> >>      $ patch -p0 < ptimer.patch
> >>      patching file `otp_src_R11B-2/erts/emulator/beam/utils.c'
> >>
> >> Please, report to us whether or not the problem went away.
> >>
> >> Great work Dmitriy and Serge! Many thanks!
> >>
> >> BR,
> >> Rickard Green, Erlang/OTP
> >>
> >> Rickard Green wrote:
> >> > Thanks for your detailed bug report. I'll look at this as soon as
> >> possible.
> >> >
> >> > BR,
> >> > Rickard Green, Erlang/OTP
> >> >
> >> > Serge Aleynikov wrote:
> >> >> Additionally, I should say that we've been able to reproduce this
> >> bug on
> >> >> several Linux platforms (RH ES 4.0, Fedora Core 5.0, 32bit and
> >> 64bit) in
> >> >> R11B-0, R11B-1 and R11B-2.  The bug (what appears to be a race
> >> >> condition) is seen only if the emulator is started in the SMP mode and
> >> >> results in the following construct blocking infinitely in the
> >> context of
> >> >> some Erlang process handing a message dispatching function:
> >> >>
> >> >> receive
> >> >> after N ->    % Where N is between 1 and 999
> >> >>      ok
> >> >> end.
> >> >>
> >> >> It happens when all the CPUs in SMP mode are over 75% loaded.  The bug
> >> >> doesn't happen immediately after starting a release, but after a
> >> period
> >> >> of 5 min to 3 hours, which makes it pretty hard to diagnose.  The
> >> >> tracing method that we initially tried to use was to include printf
> >> >> statements in the emulator to stderr.  However, this prevented the bug
> >> >> from showing up.  Further it was changed to using SysV message
> >> queue to
> >> >> communicate trace to an external process that dumped the trace to a
> >> >> file.  This allowed to gain further understanding of the problem,
> >> but as
> >> >> Dmitry indicated any attempt to reduce the code to a minimal example
> >> >> made the problem disappear.
> >> >>
> >> >> The emulator code is quite involved, but hopefully someone in the OTP
> >> >> team could come up with a recommendation of how/where to put a missing
> >> >> synchronization.  If needed we can arrange for a remote SSH login
> >> to the
> >> >> system(s) where the problem is reproducible.
> >> >>
> >> >> Regards,
> >> >>
> >> >> Serge
> >> >>
> >> >> Dmitriy Kargapolov wrote:
> >> >>> Unfortunately I can not create standalone test for this bug, even
> >> when I
> >> >>> became much more close to understanding the effect.
> >> >>> This bug appears only in highly loaded system.
> >> >>>
> >> >>> Recently I did manage to trace some points in the code and see at
> >> least
> >> >>> one scenario for the race condition bug.
> >> >>>
> >> >>>  1. Thread A    erl_set_timer (time.c)            Lock Timing Wheel
> >> >>>  2. Thread A    insert_timer (time.c)             Insert Timer T1
> >> >>>  3. Thread A    erl_set_timer (time.c)            Unlock Timing Wheel
> >> >>>  4. Thread B    bump_timer_internal (time.c)      Lock Timing Wheel
> >> >>>  5. Thread A    cancel_timer (erl_process.c)      Cancel timer T1
> >> >>>  6. Thread B    bump_timer_internal (time.c)      Build list of
> >> Expired
> >> >>> Timers
> >> >>>  7. Thread A    erl_cancel_timer (time.c)         Cancel timer T1:
> >> >>> Waiting for Timing Wheel Lock
> >> >>>  8. Thread B    bump_timer_internal (time.c)      Unlock Timing Wheel
> >> >>>  9. Thread C    set_timer (erl_process.c)         New Timeout
> >> Request (T2)
> >> >>> 10. Thread B    bump_timer_internal (time.c)      Call Expired Timers
> >> >>> Callbacks
> >> >>> 11. Thread B    free_ptimer (utils.c)             Timer T1 callback
> >> >>> invokes free_ptimer()
> >> >>> 12. Thread C    erts_create_smp_ptimer (utils.c)  Create Timer
> >> >>> ErtsSmpPTimer for T2
> >> >>> 13. Thread B    free_ptimer (utils.c)             Free ErtsSmpPTimer
> >> >>> memory block
> >> >>> 14. Thread C    erts_create_smp_ptimer (utils.c)  Allocate
> >> ErtsSmpPTimer
> >> >>> for T2, block reused!
> >> >>> 15. Thread C    erl_set_timer (time.c)            erl_set_timer
> >> invoked
> >> >>> for T2
> >> >>> 16. Thread C    erl_set_timer (time.c)            Lock Timing Wheel
> >> >>> 17. Thread C    insert_timer (time.c)             Insert Timer T2
> >> >>> 18. Thread C    erl_set_timer (time.c)            Unlock Timing Wheel
> >> >>> 19. Thread A    erl_cancel_timer (time.c)         Lock Timing Wheel
> >> >>> 20. Thread A    erl_cancel_timer (time.c)         Remove ex-T1 == T2
> >> >>> from the timing wheel
> >> >>> 21. Thread A    erl_cancel_timer (time.c)         Unlock Timing Wheel
> >> >>>
> >> >>> See also attached diagram.
> >> >>>
> >> >>> Looks like one more mutex required, excluding release of
> >> ErtsSmpPTimer
> >> >>> memory block by timeout callback if cancel request was issued for the
> >> >>> timer and vise versa. The two point of control - cancel timer and
> >> timer
> >> >>> expiration should not interfere.
> >> >>> This bug happens only in SMP mode since there additional timer
> >> control
> >> >>> structure ErtsSmpPTimer is used between emulator and timing wheel.
> >> >>>
> >> >>> Mikael Pettersson wrote:
> >> >>>> Dmitriy Kargapolov writes:
> >> >>>>  >  > When running erl with -smp +S 2 option, sometimes process gets
> >> >>>> stuck in  > timer:sleep/1.
> >> >>>>  > Process code looks like:
> >> >>>>  >  > some_receiver(State) ->
> >> >>>>  >      NewState = receive
> >> >>>>  >          % legal packet
> >> >>>>  >          {some_keyword, Address, Port, Packet} ->
> >> >>>>  >              State1 = handle_packet(Address, Port, Packet,
> >> State),
> >> >>>>  >              timer:sleep(get_loop_delay()),
> >> >>>>  >              State1;
> >> >>>>  >          % unknown message
> >> >>>>  >          _ ->
> >> >>>>  >              State
> >> >>>>  >      end,
> >> >>>>  >      some_receiver(NewState).
> >> >>>>  >  > Delay value varies in range 1..999
> >> >>>>  >  > Since timer:sleep/1 implemented as:
> >> >>>>  > sleep(T) ->
> >> >>>>  >      receive
> >> >>>>  >      after T -> ok
> >> >>>>  >      end.
> >> >>>>  > it seems to be problem with "after" in smp implementation in
> >> R11B-0
> >> >>>>  >  > I don't have more details yet but will continue testing.
> >> >>>>  > My platform: 2.6.9-5.ELsmp #1 SMP i686 i686 i386 GNU/Linux
> >> >>>>
> >> >>>> Interesting. Please send us a small standalone module that exhibits
> >> >>>> the bug, and I'll see if I can reproduce it.
> >> >>>>
> >> >>>> /Mikael
> >> >>>>
> >> >>>
> >> ------------------------------------------------------------------------
> >> >>>
> >> >>> _______________________________________________
> >> >>> erlang-questions mailing list
> >> >>> erlang-questions@REDACTED
> >> >>> http://www.erlang.org/mailman/listinfo/erlang-questions
> >> > _______________________________________________
> >> > erlang-bugs mailing list
> >> > erlang-bugs@REDACTED
> >> > http://www.erlang.org/mailman/listinfo/erlang-bugs
> >> >
> >>
> >>
> >>
> >>
> >>
> >> --- otp_src_R11B-2/erts/emulator/beam/utils.c   2006-11-06
> >> 14:51:50.000000000 +0100
> >> +++ otp_src_R11B-2.ptimer_patch/erts/emulator/beam/utils.c
> >> 2006-12-27 18:11:44.772758000 +0100
> >> @@ -2999,15 +2999,16 @@
> >>  static void
> >>  ptimer_timeout(ErtsSmpPTimer *ptimer)
> >>  {
> >> -    if (!(ptimer->timer.flags & ERTS_PTMR_FLG_CANCELLED)) {
> >>        if (is_internal_pid(ptimer->timer.id)) {
> >>            Process *p;
> >> -           p = erts_pid2proc(NULL,
> >> -                             0,
> >> -                             ptimer->timer.id,
> >> -                             ERTS_PROC_LOCK_MAIN|ERTS_PROC_LOCK_STATUS);
> >> +           p = erts_pid2proc_opt(NULL,
> >> +                                 0,
> >> +                                 ptimer->timer.id,
> >> +
> >> ERTS_PROC_LOCK_MAIN|ERTS_PROC_LOCK_STATUS,
> >> +                                 ERTS_P2P_FLG_ALLOW_OTHER_X);
> >>            if (p) {
> >> -               if (!(ptimer->timer.flags & ERTS_PTMR_FLG_CANCELLED)) {
> >> +               if (!p->is_exiting
> >> +                   && !(ptimer->timer.flags &
> >> ERTS_PTMR_FLG_CANCELLED)) {
> >>                    ASSERT(*ptimer->timer.timer_ref == ptimer);
> >>                    *ptimer->timer.timer_ref = NULL;
> >>                    (*ptimer->timer.timeout_func)(p);
> >> @@ -3028,7 +3029,6 @@
> >>                erts_smp_io_unlock();
> >>            }
> >>        }
> >> -    }
> >>     free_ptimer(ptimer);
> >>  }
> >>
> >>
> >>
> >>
> >> _______________________________________________
> >> erlang-bugs mailing list
> >> erlang-bugs@REDACTED
> >> http://www.erlang.org/mailman/listinfo/erlang-bugs
> >>
> >>
> >>
> >
>



More information about the erlang-questions mailing list