[erlang-questions] [erlang-bugs] R11B-2 SMP Timer Race Condition Bug [Re: bug in timer:sleep/1 smp implementation (R11B-0)]

Rickard Green rickard.s.green@REDACTED
Fri Dec 29 14:33:46 CET 2006


The scenario described by Serge and Dmitriy can happen due to this bug. 
The fix has been tested and I am quite sure it will fix the described 
problem. There could of course exist yet another bug causing the same 
problem, but I don't think so. The results of Serge's and Dmitriy's 
tests are of course interesting, but regardless that the patch fixes a 
real bug. If you use the smp emulator, apply the patch.

BR,
Rickard Green, Erlang/OTP

Chris Newcombe wrote:
> Hi Rickard,
> 
> First of all, many thanks indeed for the very fast response time on
> investigating and fixing issues like this!  That level of
> responsiveness really helps reassure new adopters of Erlang.
> 
> How risky is this patch?  i.e. Should everyone apply it?
> 
> Is the patch ...
> 
> a) An experimental fix that needs testing by Serge and Dmitriy before
> others consider it.
> 
> b) A definite fix for a definite problem, and has been tested.   But
> it may or may not be the problem that Serge and Dmitriy found.
> 
> regards,
> 
> Chris
> 
> On 12/27/06, Rickard Green <rickard.s.green@REDACTED> wrote:
>> The process lock plays an important role here. Unfortunately a faulty
>> optimization (blush) prevented the process lock from playing that role.
>> ptimer_timeout() has to acquire the process lock before looking at the
>> ptimer flags. I've attached a patch that should fix the problem.
>>
>>      $ tar -zxf otp_src_R11B-2.tar.gz
>>      $ patch -p0 < ptimer.patch
>>      patching file `otp_src_R11B-2/erts/emulator/beam/utils.c'
>>
>> Please, report to us whether or not the problem went away.
>>
>> Great work Dmitriy and Serge! Many thanks!
>>
>> BR,
>> Rickard Green, Erlang/OTP
>>
>> Rickard Green wrote:
>> > Thanks for your detailed bug report. I'll look at this as soon as 
>> possible.
>> >
>> > BR,
>> > Rickard Green, Erlang/OTP
>> >
>> > Serge Aleynikov wrote:
>> >> Additionally, I should say that we've been able to reproduce this 
>> bug on
>> >> several Linux platforms (RH ES 4.0, Fedora Core 5.0, 32bit and 
>> 64bit) in
>> >> R11B-0, R11B-1 and R11B-2.  The bug (what appears to be a race
>> >> condition) is seen only if the emulator is started in the SMP mode and
>> >> results in the following construct blocking infinitely in the 
>> context of
>> >> some Erlang process handing a message dispatching function:
>> >>
>> >> receive
>> >> after N ->    % Where N is between 1 and 999
>> >>      ok
>> >> end.
>> >>
>> >> It happens when all the CPUs in SMP mode are over 75% loaded.  The bug
>> >> doesn't happen immediately after starting a release, but after a 
>> period
>> >> of 5 min to 3 hours, which makes it pretty hard to diagnose.  The
>> >> tracing method that we initially tried to use was to include printf
>> >> statements in the emulator to stderr.  However, this prevented the bug
>> >> from showing up.  Further it was changed to using SysV message 
>> queue to
>> >> communicate trace to an external process that dumped the trace to a
>> >> file.  This allowed to gain further understanding of the problem, 
>> but as
>> >> Dmitry indicated any attempt to reduce the code to a minimal example
>> >> made the problem disappear.
>> >>
>> >> The emulator code is quite involved, but hopefully someone in the OTP
>> >> team could come up with a recommendation of how/where to put a missing
>> >> synchronization.  If needed we can arrange for a remote SSH login 
>> to the
>> >> system(s) where the problem is reproducible.
>> >>
>> >> Regards,
>> >>
>> >> Serge
>> >>
>> >> Dmitriy Kargapolov wrote:
>> >>> Unfortunately I can not create standalone test for this bug, even 
>> when I
>> >>> became much more close to understanding the effect.
>> >>> This bug appears only in highly loaded system.
>> >>>
>> >>> Recently I did manage to trace some points in the code and see at 
>> least
>> >>> one scenario for the race condition bug.
>> >>>
>> >>>  1. Thread A    erl_set_timer (time.c)            Lock Timing Wheel
>> >>>  2. Thread A    insert_timer (time.c)             Insert Timer T1
>> >>>  3. Thread A    erl_set_timer (time.c)            Unlock Timing Wheel
>> >>>  4. Thread B    bump_timer_internal (time.c)      Lock Timing Wheel
>> >>>  5. Thread A    cancel_timer (erl_process.c)      Cancel timer T1
>> >>>  6. Thread B    bump_timer_internal (time.c)      Build list of 
>> Expired
>> >>> Timers
>> >>>  7. Thread A    erl_cancel_timer (time.c)         Cancel timer T1:
>> >>> Waiting for Timing Wheel Lock
>> >>>  8. Thread B    bump_timer_internal (time.c)      Unlock Timing Wheel
>> >>>  9. Thread C    set_timer (erl_process.c)         New Timeout 
>> Request (T2)
>> >>> 10. Thread B    bump_timer_internal (time.c)      Call Expired Timers
>> >>> Callbacks
>> >>> 11. Thread B    free_ptimer (utils.c)             Timer T1 callback
>> >>> invokes free_ptimer()
>> >>> 12. Thread C    erts_create_smp_ptimer (utils.c)  Create Timer
>> >>> ErtsSmpPTimer for T2
>> >>> 13. Thread B    free_ptimer (utils.c)             Free ErtsSmpPTimer
>> >>> memory block
>> >>> 14. Thread C    erts_create_smp_ptimer (utils.c)  Allocate 
>> ErtsSmpPTimer
>> >>> for T2, block reused!
>> >>> 15. Thread C    erl_set_timer (time.c)            erl_set_timer 
>> invoked
>> >>> for T2
>> >>> 16. Thread C    erl_set_timer (time.c)            Lock Timing Wheel
>> >>> 17. Thread C    insert_timer (time.c)             Insert Timer T2
>> >>> 18. Thread C    erl_set_timer (time.c)            Unlock Timing Wheel
>> >>> 19. Thread A    erl_cancel_timer (time.c)         Lock Timing Wheel
>> >>> 20. Thread A    erl_cancel_timer (time.c)         Remove ex-T1 == T2
>> >>> from the timing wheel
>> >>> 21. Thread A    erl_cancel_timer (time.c)         Unlock Timing Wheel
>> >>>
>> >>> See also attached diagram.
>> >>>
>> >>> Looks like one more mutex required, excluding release of 
>> ErtsSmpPTimer
>> >>> memory block by timeout callback if cancel request was issued for the
>> >>> timer and vise versa. The two point of control - cancel timer and 
>> timer
>> >>> expiration should not interfere.
>> >>> This bug happens only in SMP mode since there additional timer 
>> control
>> >>> structure ErtsSmpPTimer is used between emulator and timing wheel.
>> >>>
>> >>> Mikael Pettersson wrote:
>> >>>> Dmitriy Kargapolov writes:
>> >>>>  >  > When running erl with -smp +S 2 option, sometimes process gets
>> >>>> stuck in  > timer:sleep/1.
>> >>>>  > Process code looks like:
>> >>>>  >  > some_receiver(State) ->
>> >>>>  >      NewState = receive
>> >>>>  >          % legal packet
>> >>>>  >          {some_keyword, Address, Port, Packet} ->
>> >>>>  >              State1 = handle_packet(Address, Port, Packet, 
>> State),
>> >>>>  >              timer:sleep(get_loop_delay()),
>> >>>>  >              State1;
>> >>>>  >          % unknown message
>> >>>>  >          _ ->
>> >>>>  >              State
>> >>>>  >      end,
>> >>>>  >      some_receiver(NewState).
>> >>>>  >  > Delay value varies in range 1..999
>> >>>>  >  > Since timer:sleep/1 implemented as:
>> >>>>  > sleep(T) ->
>> >>>>  >      receive
>> >>>>  >      after T -> ok
>> >>>>  >      end.
>> >>>>  > it seems to be problem with "after" in smp implementation in 
>> R11B-0
>> >>>>  >  > I don't have more details yet but will continue testing.
>> >>>>  > My platform: 2.6.9-5.ELsmp #1 SMP i686 i686 i386 GNU/Linux
>> >>>>
>> >>>> Interesting. Please send us a small standalone module that exhibits
>> >>>> the bug, and I'll see if I can reproduce it.
>> >>>>
>> >>>> /Mikael
>> >>>>
>> >>> 
>> ------------------------------------------------------------------------
>> >>>
>> >>> _______________________________________________
>> >>> erlang-questions mailing list
>> >>> erlang-questions@REDACTED
>> >>> http://www.erlang.org/mailman/listinfo/erlang-questions
>> > _______________________________________________
>> > erlang-bugs mailing list
>> > erlang-bugs@REDACTED
>> > http://www.erlang.org/mailman/listinfo/erlang-bugs
>> >
>>
>>
>>
>>
>>
>> --- otp_src_R11B-2/erts/emulator/beam/utils.c   2006-11-06 
>> 14:51:50.000000000 +0100
>> +++ otp_src_R11B-2.ptimer_patch/erts/emulator/beam/utils.c      
>> 2006-12-27 18:11:44.772758000 +0100
>> @@ -2999,15 +2999,16 @@
>>  static void
>>  ptimer_timeout(ErtsSmpPTimer *ptimer)
>>  {
>> -    if (!(ptimer->timer.flags & ERTS_PTMR_FLG_CANCELLED)) {
>>        if (is_internal_pid(ptimer->timer.id)) {
>>            Process *p;
>> -           p = erts_pid2proc(NULL,
>> -                             0,
>> -                             ptimer->timer.id,
>> -                             ERTS_PROC_LOCK_MAIN|ERTS_PROC_LOCK_STATUS);
>> +           p = erts_pid2proc_opt(NULL,
>> +                                 0,
>> +                                 ptimer->timer.id,
>> +                                 
>> ERTS_PROC_LOCK_MAIN|ERTS_PROC_LOCK_STATUS,
>> +                                 ERTS_P2P_FLG_ALLOW_OTHER_X);
>>            if (p) {
>> -               if (!(ptimer->timer.flags & ERTS_PTMR_FLG_CANCELLED)) {
>> +               if (!p->is_exiting
>> +                   && !(ptimer->timer.flags & 
>> ERTS_PTMR_FLG_CANCELLED)) {
>>                    ASSERT(*ptimer->timer.timer_ref == ptimer);
>>                    *ptimer->timer.timer_ref = NULL;
>>                    (*ptimer->timer.timeout_func)(p);
>> @@ -3028,7 +3029,6 @@
>>                erts_smp_io_unlock();
>>            }
>>        }
>> -    }
>>     free_ptimer(ptimer);
>>  }
>>
>>
>>
>>
>> _______________________________________________
>> erlang-bugs mailing list
>> erlang-bugs@REDACTED
>> http://www.erlang.org/mailman/listinfo/erlang-bugs
>>
>>
>>
> 



More information about the erlang-questions mailing list