[erlang-questions] [erlang-bugs] R11B-2 SMP Timer Race Condition Bug [Re: bug in timer:sleep/1 smp implementation (R11B-0)]
Rickard Green
rickard.s.green@REDACTED
Fri Dec 29 14:33:46 CET 2006
The scenario described by Serge and Dmitriy can happen due to this bug.
The fix has been tested and I am quite sure it will fix the described
problem. There could of course exist yet another bug causing the same
problem, but I don't think so. The results of Serge's and Dmitriy's
tests are of course interesting, but regardless that the patch fixes a
real bug. If you use the smp emulator, apply the patch.
BR,
Rickard Green, Erlang/OTP
Chris Newcombe wrote:
> Hi Rickard,
>
> First of all, many thanks indeed for the very fast response time on
> investigating and fixing issues like this! That level of
> responsiveness really helps reassure new adopters of Erlang.
>
> How risky is this patch? i.e. Should everyone apply it?
>
> Is the patch ...
>
> a) An experimental fix that needs testing by Serge and Dmitriy before
> others consider it.
>
> b) A definite fix for a definite problem, and has been tested. But
> it may or may not be the problem that Serge and Dmitriy found.
>
> regards,
>
> Chris
>
> On 12/27/06, Rickard Green <rickard.s.green@REDACTED> wrote:
>> The process lock plays an important role here. Unfortunately a faulty
>> optimization (blush) prevented the process lock from playing that role.
>> ptimer_timeout() has to acquire the process lock before looking at the
>> ptimer flags. I've attached a patch that should fix the problem.
>>
>> $ tar -zxf otp_src_R11B-2.tar.gz
>> $ patch -p0 < ptimer.patch
>> patching file `otp_src_R11B-2/erts/emulator/beam/utils.c'
>>
>> Please, report to us whether or not the problem went away.
>>
>> Great work Dmitriy and Serge! Many thanks!
>>
>> BR,
>> Rickard Green, Erlang/OTP
>>
>> Rickard Green wrote:
>> > Thanks for your detailed bug report. I'll look at this as soon as
>> possible.
>> >
>> > BR,
>> > Rickard Green, Erlang/OTP
>> >
>> > Serge Aleynikov wrote:
>> >> Additionally, I should say that we've been able to reproduce this
>> bug on
>> >> several Linux platforms (RH ES 4.0, Fedora Core 5.0, 32bit and
>> 64bit) in
>> >> R11B-0, R11B-1 and R11B-2. The bug (what appears to be a race
>> >> condition) is seen only if the emulator is started in the SMP mode and
>> >> results in the following construct blocking infinitely in the
>> context of
>> >> some Erlang process handing a message dispatching function:
>> >>
>> >> receive
>> >> after N -> % Where N is between 1 and 999
>> >> ok
>> >> end.
>> >>
>> >> It happens when all the CPUs in SMP mode are over 75% loaded. The bug
>> >> doesn't happen immediately after starting a release, but after a
>> period
>> >> of 5 min to 3 hours, which makes it pretty hard to diagnose. The
>> >> tracing method that we initially tried to use was to include printf
>> >> statements in the emulator to stderr. However, this prevented the bug
>> >> from showing up. Further it was changed to using SysV message
>> queue to
>> >> communicate trace to an external process that dumped the trace to a
>> >> file. This allowed to gain further understanding of the problem,
>> but as
>> >> Dmitry indicated any attempt to reduce the code to a minimal example
>> >> made the problem disappear.
>> >>
>> >> The emulator code is quite involved, but hopefully someone in the OTP
>> >> team could come up with a recommendation of how/where to put a missing
>> >> synchronization. If needed we can arrange for a remote SSH login
>> to the
>> >> system(s) where the problem is reproducible.
>> >>
>> >> Regards,
>> >>
>> >> Serge
>> >>
>> >> Dmitriy Kargapolov wrote:
>> >>> Unfortunately I can not create standalone test for this bug, even
>> when I
>> >>> became much more close to understanding the effect.
>> >>> This bug appears only in highly loaded system.
>> >>>
>> >>> Recently I did manage to trace some points in the code and see at
>> least
>> >>> one scenario for the race condition bug.
>> >>>
>> >>> 1. Thread A erl_set_timer (time.c) Lock Timing Wheel
>> >>> 2. Thread A insert_timer (time.c) Insert Timer T1
>> >>> 3. Thread A erl_set_timer (time.c) Unlock Timing Wheel
>> >>> 4. Thread B bump_timer_internal (time.c) Lock Timing Wheel
>> >>> 5. Thread A cancel_timer (erl_process.c) Cancel timer T1
>> >>> 6. Thread B bump_timer_internal (time.c) Build list of
>> Expired
>> >>> Timers
>> >>> 7. Thread A erl_cancel_timer (time.c) Cancel timer T1:
>> >>> Waiting for Timing Wheel Lock
>> >>> 8. Thread B bump_timer_internal (time.c) Unlock Timing Wheel
>> >>> 9. Thread C set_timer (erl_process.c) New Timeout
>> Request (T2)
>> >>> 10. Thread B bump_timer_internal (time.c) Call Expired Timers
>> >>> Callbacks
>> >>> 11. Thread B free_ptimer (utils.c) Timer T1 callback
>> >>> invokes free_ptimer()
>> >>> 12. Thread C erts_create_smp_ptimer (utils.c) Create Timer
>> >>> ErtsSmpPTimer for T2
>> >>> 13. Thread B free_ptimer (utils.c) Free ErtsSmpPTimer
>> >>> memory block
>> >>> 14. Thread C erts_create_smp_ptimer (utils.c) Allocate
>> ErtsSmpPTimer
>> >>> for T2, block reused!
>> >>> 15. Thread C erl_set_timer (time.c) erl_set_timer
>> invoked
>> >>> for T2
>> >>> 16. Thread C erl_set_timer (time.c) Lock Timing Wheel
>> >>> 17. Thread C insert_timer (time.c) Insert Timer T2
>> >>> 18. Thread C erl_set_timer (time.c) Unlock Timing Wheel
>> >>> 19. Thread A erl_cancel_timer (time.c) Lock Timing Wheel
>> >>> 20. Thread A erl_cancel_timer (time.c) Remove ex-T1 == T2
>> >>> from the timing wheel
>> >>> 21. Thread A erl_cancel_timer (time.c) Unlock Timing Wheel
>> >>>
>> >>> See also attached diagram.
>> >>>
>> >>> Looks like one more mutex required, excluding release of
>> ErtsSmpPTimer
>> >>> memory block by timeout callback if cancel request was issued for the
>> >>> timer and vise versa. The two point of control - cancel timer and
>> timer
>> >>> expiration should not interfere.
>> >>> This bug happens only in SMP mode since there additional timer
>> control
>> >>> structure ErtsSmpPTimer is used between emulator and timing wheel.
>> >>>
>> >>> Mikael Pettersson wrote:
>> >>>> Dmitriy Kargapolov writes:
>> >>>> > > When running erl with -smp +S 2 option, sometimes process gets
>> >>>> stuck in > timer:sleep/1.
>> >>>> > Process code looks like:
>> >>>> > > some_receiver(State) ->
>> >>>> > NewState = receive
>> >>>> > % legal packet
>> >>>> > {some_keyword, Address, Port, Packet} ->
>> >>>> > State1 = handle_packet(Address, Port, Packet,
>> State),
>> >>>> > timer:sleep(get_loop_delay()),
>> >>>> > State1;
>> >>>> > % unknown message
>> >>>> > _ ->
>> >>>> > State
>> >>>> > end,
>> >>>> > some_receiver(NewState).
>> >>>> > > Delay value varies in range 1..999
>> >>>> > > Since timer:sleep/1 implemented as:
>> >>>> > sleep(T) ->
>> >>>> > receive
>> >>>> > after T -> ok
>> >>>> > end.
>> >>>> > it seems to be problem with "after" in smp implementation in
>> R11B-0
>> >>>> > > I don't have more details yet but will continue testing.
>> >>>> > My platform: 2.6.9-5.ELsmp #1 SMP i686 i686 i386 GNU/Linux
>> >>>>
>> >>>> Interesting. Please send us a small standalone module that exhibits
>> >>>> the bug, and I'll see if I can reproduce it.
>> >>>>
>> >>>> /Mikael
>> >>>>
>> >>>
>> ------------------------------------------------------------------------
>> >>>
>> >>> _______________________________________________
>> >>> erlang-questions mailing list
>> >>> erlang-questions@REDACTED
>> >>> http://www.erlang.org/mailman/listinfo/erlang-questions
>> > _______________________________________________
>> > erlang-bugs mailing list
>> > erlang-bugs@REDACTED
>> > http://www.erlang.org/mailman/listinfo/erlang-bugs
>> >
>>
>>
>>
>>
>>
>> --- otp_src_R11B-2/erts/emulator/beam/utils.c 2006-11-06
>> 14:51:50.000000000 +0100
>> +++ otp_src_R11B-2.ptimer_patch/erts/emulator/beam/utils.c
>> 2006-12-27 18:11:44.772758000 +0100
>> @@ -2999,15 +2999,16 @@
>> static void
>> ptimer_timeout(ErtsSmpPTimer *ptimer)
>> {
>> - if (!(ptimer->timer.flags & ERTS_PTMR_FLG_CANCELLED)) {
>> if (is_internal_pid(ptimer->timer.id)) {
>> Process *p;
>> - p = erts_pid2proc(NULL,
>> - 0,
>> - ptimer->timer.id,
>> - ERTS_PROC_LOCK_MAIN|ERTS_PROC_LOCK_STATUS);
>> + p = erts_pid2proc_opt(NULL,
>> + 0,
>> + ptimer->timer.id,
>> +
>> ERTS_PROC_LOCK_MAIN|ERTS_PROC_LOCK_STATUS,
>> + ERTS_P2P_FLG_ALLOW_OTHER_X);
>> if (p) {
>> - if (!(ptimer->timer.flags & ERTS_PTMR_FLG_CANCELLED)) {
>> + if (!p->is_exiting
>> + && !(ptimer->timer.flags &
>> ERTS_PTMR_FLG_CANCELLED)) {
>> ASSERT(*ptimer->timer.timer_ref == ptimer);
>> *ptimer->timer.timer_ref = NULL;
>> (*ptimer->timer.timeout_func)(p);
>> @@ -3028,7 +3029,6 @@
>> erts_smp_io_unlock();
>> }
>> }
>> - }
>> free_ptimer(ptimer);
>> }
>>
>>
>>
>>
>> _______________________________________________
>> erlang-bugs mailing list
>> erlang-bugs@REDACTED
>> http://www.erlang.org/mailman/listinfo/erlang-bugs
>>
>>
>>
>
More information about the erlang-questions
mailing list