[erlang-questions] [erlang-bugs] R11B-2 SMP Timer Race Condition Bug [Re: bug in timer:sleep/1 smp implementation (R11B-0)]
Chris Newcombe
chris.newcombe@REDACTED
Fri Dec 29 17:50:30 CET 2006
Excellent -- many thanks again for fixing it so quickly.
Chris
On 12/29/06, Rickard Green <rickard.s.green@REDACTED> wrote:
> The scenario described by Serge and Dmitriy can happen due to this bug.
> The fix has been tested and I am quite sure it will fix the described
> problem. There could of course exist yet another bug causing the same
> problem, but I don't think so. The results of Serge's and Dmitriy's
> tests are of course interesting, but regardless that the patch fixes a
> real bug. If you use the smp emulator, apply the patch.
>
> BR,
> Rickard Green, Erlang/OTP
>
> Chris Newcombe wrote:
> > Hi Rickard,
> >
> > First of all, many thanks indeed for the very fast response time on
> > investigating and fixing issues like this! That level of
> > responsiveness really helps reassure new adopters of Erlang.
> >
> > How risky is this patch? i.e. Should everyone apply it?
> >
> > Is the patch ...
> >
> > a) An experimental fix that needs testing by Serge and Dmitriy before
> > others consider it.
> >
> > b) A definite fix for a definite problem, and has been tested. But
> > it may or may not be the problem that Serge and Dmitriy found.
> >
> > regards,
> >
> > Chris
> >
> > On 12/27/06, Rickard Green <rickard.s.green@REDACTED> wrote:
> >> The process lock plays an important role here. Unfortunately a faulty
> >> optimization (blush) prevented the process lock from playing that role.
> >> ptimer_timeout() has to acquire the process lock before looking at the
> >> ptimer flags. I've attached a patch that should fix the problem.
> >>
> >> $ tar -zxf otp_src_R11B-2.tar.gz
> >> $ patch -p0 < ptimer.patch
> >> patching file `otp_src_R11B-2/erts/emulator/beam/utils.c'
> >>
> >> Please, report to us whether or not the problem went away.
> >>
> >> Great work Dmitriy and Serge! Many thanks!
> >>
> >> BR,
> >> Rickard Green, Erlang/OTP
> >>
> >> Rickard Green wrote:
> >> > Thanks for your detailed bug report. I'll look at this as soon as
> >> possible.
> >> >
> >> > BR,
> >> > Rickard Green, Erlang/OTP
> >> >
> >> > Serge Aleynikov wrote:
> >> >> Additionally, I should say that we've been able to reproduce this
> >> bug on
> >> >> several Linux platforms (RH ES 4.0, Fedora Core 5.0, 32bit and
> >> 64bit) in
> >> >> R11B-0, R11B-1 and R11B-2. The bug (what appears to be a race
> >> >> condition) is seen only if the emulator is started in the SMP mode and
> >> >> results in the following construct blocking infinitely in the
> >> context of
> >> >> some Erlang process handing a message dispatching function:
> >> >>
> >> >> receive
> >> >> after N -> % Where N is between 1 and 999
> >> >> ok
> >> >> end.
> >> >>
> >> >> It happens when all the CPUs in SMP mode are over 75% loaded. The bug
> >> >> doesn't happen immediately after starting a release, but after a
> >> period
> >> >> of 5 min to 3 hours, which makes it pretty hard to diagnose. The
> >> >> tracing method that we initially tried to use was to include printf
> >> >> statements in the emulator to stderr. However, this prevented the bug
> >> >> from showing up. Further it was changed to using SysV message
> >> queue to
> >> >> communicate trace to an external process that dumped the trace to a
> >> >> file. This allowed to gain further understanding of the problem,
> >> but as
> >> >> Dmitry indicated any attempt to reduce the code to a minimal example
> >> >> made the problem disappear.
> >> >>
> >> >> The emulator code is quite involved, but hopefully someone in the OTP
> >> >> team could come up with a recommendation of how/where to put a missing
> >> >> synchronization. If needed we can arrange for a remote SSH login
> >> to the
> >> >> system(s) where the problem is reproducible.
> >> >>
> >> >> Regards,
> >> >>
> >> >> Serge
> >> >>
> >> >> Dmitriy Kargapolov wrote:
> >> >>> Unfortunately I can not create standalone test for this bug, even
> >> when I
> >> >>> became much more close to understanding the effect.
> >> >>> This bug appears only in highly loaded system.
> >> >>>
> >> >>> Recently I did manage to trace some points in the code and see at
> >> least
> >> >>> one scenario for the race condition bug.
> >> >>>
> >> >>> 1. Thread A erl_set_timer (time.c) Lock Timing Wheel
> >> >>> 2. Thread A insert_timer (time.c) Insert Timer T1
> >> >>> 3. Thread A erl_set_timer (time.c) Unlock Timing Wheel
> >> >>> 4. Thread B bump_timer_internal (time.c) Lock Timing Wheel
> >> >>> 5. Thread A cancel_timer (erl_process.c) Cancel timer T1
> >> >>> 6. Thread B bump_timer_internal (time.c) Build list of
> >> Expired
> >> >>> Timers
> >> >>> 7. Thread A erl_cancel_timer (time.c) Cancel timer T1:
> >> >>> Waiting for Timing Wheel Lock
> >> >>> 8. Thread B bump_timer_internal (time.c) Unlock Timing Wheel
> >> >>> 9. Thread C set_timer (erl_process.c) New Timeout
> >> Request (T2)
> >> >>> 10. Thread B bump_timer_internal (time.c) Call Expired Timers
> >> >>> Callbacks
> >> >>> 11. Thread B free_ptimer (utils.c) Timer T1 callback
> >> >>> invokes free_ptimer()
> >> >>> 12. Thread C erts_create_smp_ptimer (utils.c) Create Timer
> >> >>> ErtsSmpPTimer for T2
> >> >>> 13. Thread B free_ptimer (utils.c) Free ErtsSmpPTimer
> >> >>> memory block
> >> >>> 14. Thread C erts_create_smp_ptimer (utils.c) Allocate
> >> ErtsSmpPTimer
> >> >>> for T2, block reused!
> >> >>> 15. Thread C erl_set_timer (time.c) erl_set_timer
> >> invoked
> >> >>> for T2
> >> >>> 16. Thread C erl_set_timer (time.c) Lock Timing Wheel
> >> >>> 17. Thread C insert_timer (time.c) Insert Timer T2
> >> >>> 18. Thread C erl_set_timer (time.c) Unlock Timing Wheel
> >> >>> 19. Thread A erl_cancel_timer (time.c) Lock Timing Wheel
> >> >>> 20. Thread A erl_cancel_timer (time.c) Remove ex-T1 == T2
> >> >>> from the timing wheel
> >> >>> 21. Thread A erl_cancel_timer (time.c) Unlock Timing Wheel
> >> >>>
> >> >>> See also attached diagram.
> >> >>>
> >> >>> Looks like one more mutex required, excluding release of
> >> ErtsSmpPTimer
> >> >>> memory block by timeout callback if cancel request was issued for the
> >> >>> timer and vise versa. The two point of control - cancel timer and
> >> timer
> >> >>> expiration should not interfere.
> >> >>> This bug happens only in SMP mode since there additional timer
> >> control
> >> >>> structure ErtsSmpPTimer is used between emulator and timing wheel.
> >> >>>
> >> >>> Mikael Pettersson wrote:
> >> >>>> Dmitriy Kargapolov writes:
> >> >>>> > > When running erl with -smp +S 2 option, sometimes process gets
> >> >>>> stuck in > timer:sleep/1.
> >> >>>> > Process code looks like:
> >> >>>> > > some_receiver(State) ->
> >> >>>> > NewState = receive
> >> >>>> > % legal packet
> >> >>>> > {some_keyword, Address, Port, Packet} ->
> >> >>>> > State1 = handle_packet(Address, Port, Packet,
> >> State),
> >> >>>> > timer:sleep(get_loop_delay()),
> >> >>>> > State1;
> >> >>>> > % unknown message
> >> >>>> > _ ->
> >> >>>> > State
> >> >>>> > end,
> >> >>>> > some_receiver(NewState).
> >> >>>> > > Delay value varies in range 1..999
> >> >>>> > > Since timer:sleep/1 implemented as:
> >> >>>> > sleep(T) ->
> >> >>>> > receive
> >> >>>> > after T -> ok
> >> >>>> > end.
> >> >>>> > it seems to be problem with "after" in smp implementation in
> >> R11B-0
> >> >>>> > > I don't have more details yet but will continue testing.
> >> >>>> > My platform: 2.6.9-5.ELsmp #1 SMP i686 i686 i386 GNU/Linux
> >> >>>>
> >> >>>> Interesting. Please send us a small standalone module that exhibits
> >> >>>> the bug, and I'll see if I can reproduce it.
> >> >>>>
> >> >>>> /Mikael
> >> >>>>
> >> >>>
> >> ------------------------------------------------------------------------
> >> >>>
> >> >>> _______________________________________________
> >> >>> erlang-questions mailing list
> >> >>> erlang-questions@REDACTED
> >> >>> http://www.erlang.org/mailman/listinfo/erlang-questions
> >> > _______________________________________________
> >> > erlang-bugs mailing list
> >> > erlang-bugs@REDACTED
> >> > http://www.erlang.org/mailman/listinfo/erlang-bugs
> >> >
> >>
> >>
> >>
> >>
> >>
> >> --- otp_src_R11B-2/erts/emulator/beam/utils.c 2006-11-06
> >> 14:51:50.000000000 +0100
> >> +++ otp_src_R11B-2.ptimer_patch/erts/emulator/beam/utils.c
> >> 2006-12-27 18:11:44.772758000 +0100
> >> @@ -2999,15 +2999,16 @@
> >> static void
> >> ptimer_timeout(ErtsSmpPTimer *ptimer)
> >> {
> >> - if (!(ptimer->timer.flags & ERTS_PTMR_FLG_CANCELLED)) {
> >> if (is_internal_pid(ptimer->timer.id)) {
> >> Process *p;
> >> - p = erts_pid2proc(NULL,
> >> - 0,
> >> - ptimer->timer.id,
> >> - ERTS_PROC_LOCK_MAIN|ERTS_PROC_LOCK_STATUS);
> >> + p = erts_pid2proc_opt(NULL,
> >> + 0,
> >> + ptimer->timer.id,
> >> +
> >> ERTS_PROC_LOCK_MAIN|ERTS_PROC_LOCK_STATUS,
> >> + ERTS_P2P_FLG_ALLOW_OTHER_X);
> >> if (p) {
> >> - if (!(ptimer->timer.flags & ERTS_PTMR_FLG_CANCELLED)) {
> >> + if (!p->is_exiting
> >> + && !(ptimer->timer.flags &
> >> ERTS_PTMR_FLG_CANCELLED)) {
> >> ASSERT(*ptimer->timer.timer_ref == ptimer);
> >> *ptimer->timer.timer_ref = NULL;
> >> (*ptimer->timer.timeout_func)(p);
> >> @@ -3028,7 +3029,6 @@
> >> erts_smp_io_unlock();
> >> }
> >> }
> >> - }
> >> free_ptimer(ptimer);
> >> }
> >>
> >>
> >>
> >>
> >> _______________________________________________
> >> erlang-bugs mailing list
> >> erlang-bugs@REDACTED
> >> http://www.erlang.org/mailman/listinfo/erlang-bugs
> >>
> >>
> >>
> >
>
More information about the erlang-questions
mailing list