[erlang-questions] [erlang-bugs] R11B-2 SMP Timer Race Condition Bug [Re: bug in timer:sleep/1 smp implementation (R11B-0)]

Dmitriy Kargapolov dmitry.kargapolov@REDACTED
Fri Dec 29 17:15:50 CET 2006


First test of the patch has been successfully completed. We will
continue testing after the holiday, but so far patch works fine.
Thank you very much for fixing the problem in so short time!

BR and Happy New Year!


Rickard Green wrote:
> The scenario described by Serge and Dmitriy can happen due to this bug. 
> The fix has been tested and I am quite sure it will fix the described 
> problem. There could of course exist yet another bug causing the same 
> problem, but I don't think so. The results of Serge's and Dmitriy's 
> tests are of course interesting, but regardless that the patch fixes a 
> real bug. If you use the smp emulator, apply the patch.
> 
> BR,
> Rickard Green, Erlang/OTP
> 
> Chris Newcombe wrote:
>> Hi Rickard,
>>
>> First of all, many thanks indeed for the very fast response time on
>> investigating and fixing issues like this!  That level of
>> responsiveness really helps reassure new adopters of Erlang.
>>
>> How risky is this patch?  i.e. Should everyone apply it?
>>
>> Is the patch ...
>>
>> a) An experimental fix that needs testing by Serge and Dmitriy before
>> others consider it.
>>
>> b) A definite fix for a definite problem, and has been tested.   But
>> it may or may not be the problem that Serge and Dmitriy found.
>>
>> regards,
>>
>> Chris
>>
>> On 12/27/06, Rickard Green <rickard.s.green@REDACTED> wrote:
>>> The process lock plays an important role here. Unfortunately a faulty
>>> optimization (blush) prevented the process lock from playing that role.
>>> ptimer_timeout() has to acquire the process lock before looking at the
>>> ptimer flags. I've attached a patch that should fix the problem.
>>>
>>>      $ tar -zxf otp_src_R11B-2.tar.gz
>>>      $ patch -p0 < ptimer.patch
>>>      patching file `otp_src_R11B-2/erts/emulator/beam/utils.c'
>>>
>>> Please, report to us whether or not the problem went away.
>>>
>>> Great work Dmitriy and Serge! Many thanks!
>>>
>>> BR,
>>> Rickard Green, Erlang/OTP
>>>
>>> Rickard Green wrote:
>>> > Thanks for your detailed bug report. I'll look at this as soon as 
>>> possible.
>>> >
>>> > BR,
>>> > Rickard Green, Erlang/OTP
>>> >
>>> > Serge Aleynikov wrote:
>>> >> Additionally, I should say that we've been able to reproduce this 
>>> bug on
>>> >> several Linux platforms (RH ES 4.0, Fedora Core 5.0, 32bit and 
>>> 64bit) in
>>> >> R11B-0, R11B-1 and R11B-2.  The bug (what appears to be a race
>>> >> condition) is seen only if the emulator is started in the SMP mode 
>>> and
>>> >> results in the following construct blocking infinitely in the 
>>> context of
>>> >> some Erlang process handing a message dispatching function:
>>> >>
>>> >> receive
>>> >> after N ->    % Where N is between 1 and 999
>>> >>      ok
>>> >> end.
>>> >>
>>> >> It happens when all the CPUs in SMP mode are over 75% loaded.  The 
>>> bug
>>> >> doesn't happen immediately after starting a release, but after a 
>>> period
>>> >> of 5 min to 3 hours, which makes it pretty hard to diagnose.  The
>>> >> tracing method that we initially tried to use was to include printf
>>> >> statements in the emulator to stderr.  However, this prevented the 
>>> bug
>>> >> from showing up.  Further it was changed to using SysV message 
>>> queue to
>>> >> communicate trace to an external process that dumped the trace to a
>>> >> file.  This allowed to gain further understanding of the problem, 
>>> but as
>>> >> Dmitry indicated any attempt to reduce the code to a minimal example
>>> >> made the problem disappear.
>>> >>
>>> >> The emulator code is quite involved, but hopefully someone in the OTP
>>> >> team could come up with a recommendation of how/where to put a 
>>> missing
>>> >> synchronization.  If needed we can arrange for a remote SSH login 
>>> to the
>>> >> system(s) where the problem is reproducible.
>>> >>
>>> >> Regards,
>>> >>
>>> >> Serge
>>> >>
>>> >> Dmitriy Kargapolov wrote:
>>> >>> Unfortunately I can not create standalone test for this bug, even 
>>> when I
>>> >>> became much more close to understanding the effect.
>>> >>> This bug appears only in highly loaded system.
>>> >>>
>>> >>> Recently I did manage to trace some points in the code and see at 
>>> least
>>> >>> one scenario for the race condition bug.
>>> >>>
>>> >>>  1. Thread A    erl_set_timer (time.c)            Lock Timing Wheel
>>> >>>  2. Thread A    insert_timer (time.c)             Insert Timer T1
>>> >>>  3. Thread A    erl_set_timer (time.c)            Unlock Timing 
>>> Wheel
>>> >>>  4. Thread B    bump_timer_internal (time.c)      Lock Timing Wheel
>>> >>>  5. Thread A    cancel_timer (erl_process.c)      Cancel timer T1
>>> >>>  6. Thread B    bump_timer_internal (time.c)      Build list of 
>>> Expired
>>> >>> Timers
>>> >>>  7. Thread A    erl_cancel_timer (time.c)         Cancel timer T1:
>>> >>> Waiting for Timing Wheel Lock
>>> >>>  8. Thread B    bump_timer_internal (time.c)      Unlock Timing 
>>> Wheel
>>> >>>  9. Thread C    set_timer (erl_process.c)         New Timeout 
>>> Request (T2)
>>> >>> 10. Thread B    bump_timer_internal (time.c)      Call Expired 
>>> Timers
>>> >>> Callbacks
>>> >>> 11. Thread B    free_ptimer (utils.c)             Timer T1 callback
>>> >>> invokes free_ptimer()
>>> >>> 12. Thread C    erts_create_smp_ptimer (utils.c)  Create Timer
>>> >>> ErtsSmpPTimer for T2
>>> >>> 13. Thread B    free_ptimer (utils.c)             Free ErtsSmpPTimer
>>> >>> memory block
>>> >>> 14. Thread C    erts_create_smp_ptimer (utils.c)  Allocate 
>>> ErtsSmpPTimer
>>> >>> for T2, block reused!
>>> >>> 15. Thread C    erl_set_timer (time.c)            erl_set_timer 
>>> invoked
>>> >>> for T2
>>> >>> 16. Thread C    erl_set_timer (time.c)            Lock Timing Wheel
>>> >>> 17. Thread C    insert_timer (time.c)             Insert Timer T2
>>> >>> 18. Thread C    erl_set_timer (time.c)            Unlock Timing 
>>> Wheel
>>> >>> 19. Thread A    erl_cancel_timer (time.c)         Lock Timing Wheel
>>> >>> 20. Thread A    erl_cancel_timer (time.c)         Remove ex-T1 == T2
>>> >>> from the timing wheel
>>> >>> 21. Thread A    erl_cancel_timer (time.c)         Unlock Timing 
>>> Wheel
>>> >>>
>>> >>> See also attached diagram.
>>> >>>
>>> >>> Looks like one more mutex required, excluding release of 
>>> ErtsSmpPTimer
>>> >>> memory block by timeout callback if cancel request was issued for 
>>> the
>>> >>> timer and vise versa. The two point of control - cancel timer and 
>>> timer
>>> >>> expiration should not interfere.
>>> >>> This bug happens only in SMP mode since there additional timer 
>>> control
>>> >>> structure ErtsSmpPTimer is used between emulator and timing wheel.
>>> >>>
>>> >>> Mikael Pettersson wrote:
>>> >>>> Dmitriy Kargapolov writes:
>>> >>>>  >  > When running erl with -smp +S 2 option, sometimes process 
>>> gets
>>> >>>> stuck in  > timer:sleep/1.
>>> >>>>  > Process code looks like:
>>> >>>>  >  > some_receiver(State) ->
>>> >>>>  >      NewState = receive
>>> >>>>  >          % legal packet
>>> >>>>  >          {some_keyword, Address, Port, Packet} ->
>>> >>>>  >              State1 = handle_packet(Address, Port, Packet, 
>>> State),
>>> >>>>  >              timer:sleep(get_loop_delay()),
>>> >>>>  >              State1;
>>> >>>>  >          % unknown message
>>> >>>>  >          _ ->
>>> >>>>  >              State
>>> >>>>  >      end,
>>> >>>>  >      some_receiver(NewState).
>>> >>>>  >  > Delay value varies in range 1..999
>>> >>>>  >  > Since timer:sleep/1 implemented as:
>>> >>>>  > sleep(T) ->
>>> >>>>  >      receive
>>> >>>>  >      after T -> ok
>>> >>>>  >      end.
>>> >>>>  > it seems to be problem with "after" in smp implementation in 
>>> R11B-0
>>> >>>>  >  > I don't have more details yet but will continue testing.
>>> >>>>  > My platform: 2.6.9-5.ELsmp #1 SMP i686 i686 i386 GNU/Linux
>>> >>>>
>>> >>>> Interesting. Please send us a small standalone module that exhibits
>>> >>>> the bug, and I'll see if I can reproduce it.
>>> >>>>
>>> >>>> /Mikael
>>> >>>>
>>> >>> 
>>> ------------------------------------------------------------------------
>>> >>>
>>> >>> _______________________________________________
>>> >>> erlang-questions mailing list
>>> >>> erlang-questions@REDACTED
>>> >>> http://www.erlang.org/mailman/listinfo/erlang-questions
>>> > _______________________________________________
>>> > erlang-bugs mailing list
>>> > erlang-bugs@REDACTED
>>> > http://www.erlang.org/mailman/listinfo/erlang-bugs
>>> >
>>>
>>>
>>>
>>>
>>>
>>> --- otp_src_R11B-2/erts/emulator/beam/utils.c   2006-11-06 
>>> 14:51:50.000000000 +0100
>>> +++ otp_src_R11B-2.ptimer_patch/erts/emulator/beam/utils.c      
>>> 2006-12-27 18:11:44.772758000 +0100
>>> @@ -2999,15 +2999,16 @@
>>>  static void
>>>  ptimer_timeout(ErtsSmpPTimer *ptimer)
>>>  {
>>> -    if (!(ptimer->timer.flags & ERTS_PTMR_FLG_CANCELLED)) {
>>>        if (is_internal_pid(ptimer->timer.id)) {
>>>            Process *p;
>>> -           p = erts_pid2proc(NULL,
>>> -                             0,
>>> -                             ptimer->timer.id,
>>> -                             
>>> ERTS_PROC_LOCK_MAIN|ERTS_PROC_LOCK_STATUS);
>>> +           p = erts_pid2proc_opt(NULL,
>>> +                                 0,
>>> +                                 ptimer->timer.id,
>>> +                                 
>>> ERTS_PROC_LOCK_MAIN|ERTS_PROC_LOCK_STATUS,
>>> +                                 ERTS_P2P_FLG_ALLOW_OTHER_X);
>>>            if (p) {
>>> -               if (!(ptimer->timer.flags & ERTS_PTMR_FLG_CANCELLED)) {
>>> +               if (!p->is_exiting
>>> +                   && !(ptimer->timer.flags & 
>>> ERTS_PTMR_FLG_CANCELLED)) {
>>>                    ASSERT(*ptimer->timer.timer_ref == ptimer);
>>>                    *ptimer->timer.timer_ref = NULL;
>>>                    (*ptimer->timer.timeout_func)(p);
>>> @@ -3028,7 +3029,6 @@
>>>                erts_smp_io_unlock();
>>>            }
>>>        }
>>> -    }
>>>     free_ptimer(ptimer);
>>>  }
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> erlang-bugs mailing list
>>> erlang-bugs@REDACTED
>>> http://www.erlang.org/mailman/listinfo/erlang-bugs
>>>
>>>
>>>
>>
> 




More information about the erlang-questions mailing list