[erlang-questions] R11B-2 SMP Timer Race Condition Bug [Re: bug in timer:sleep/1 smp implementation (R11B-0)]
Rickard Green
rickard.s.green@REDACTED
Sun Dec 24 01:36:25 CET 2006
Thanks for your detailed bug report. I'll look at this as soon as possible.
BR,
Rickard Green, Erlang/OTP
Serge Aleynikov wrote:
> Additionally, I should say that we've been able to reproduce this bug on
> several Linux platforms (RH ES 4.0, Fedora Core 5.0, 32bit and 64bit) in
> R11B-0, R11B-1 and R11B-2. The bug (what appears to be a race
> condition) is seen only if the emulator is started in the SMP mode and
> results in the following construct blocking infinitely in the context of
> some Erlang process handing a message dispatching function:
>
> receive
> after N -> % Where N is between 1 and 999
> ok
> end.
>
> It happens when all the CPUs in SMP mode are over 75% loaded. The bug
> doesn't happen immediately after starting a release, but after a period
> of 5 min to 3 hours, which makes it pretty hard to diagnose. The
> tracing method that we initially tried to use was to include printf
> statements in the emulator to stderr. However, this prevented the bug
> from showing up. Further it was changed to using SysV message queue to
> communicate trace to an external process that dumped the trace to a
> file. This allowed to gain further understanding of the problem, but as
> Dmitry indicated any attempt to reduce the code to a minimal example
> made the problem disappear.
>
> The emulator code is quite involved, but hopefully someone in the OTP
> team could come up with a recommendation of how/where to put a missing
> synchronization. If needed we can arrange for a remote SSH login to the
> system(s) where the problem is reproducible.
>
> Regards,
>
> Serge
>
> Dmitriy Kargapolov wrote:
>> Unfortunately I can not create standalone test for this bug, even when I
>> became much more close to understanding the effect.
>> This bug appears only in highly loaded system.
>>
>> Recently I did manage to trace some points in the code and see at least
>> one scenario for the race condition bug.
>>
>> 1. Thread A erl_set_timer (time.c) Lock Timing Wheel
>> 2. Thread A insert_timer (time.c) Insert Timer T1
>> 3. Thread A erl_set_timer (time.c) Unlock Timing Wheel
>> 4. Thread B bump_timer_internal (time.c) Lock Timing Wheel
>> 5. Thread A cancel_timer (erl_process.c) Cancel timer T1
>> 6. Thread B bump_timer_internal (time.c) Build list of Expired
>> Timers
>> 7. Thread A erl_cancel_timer (time.c) Cancel timer T1:
>> Waiting for Timing Wheel Lock
>> 8. Thread B bump_timer_internal (time.c) Unlock Timing Wheel
>> 9. Thread C set_timer (erl_process.c) New Timeout Request (T2)
>> 10. Thread B bump_timer_internal (time.c) Call Expired Timers
>> Callbacks
>> 11. Thread B free_ptimer (utils.c) Timer T1 callback
>> invokes free_ptimer()
>> 12. Thread C erts_create_smp_ptimer (utils.c) Create Timer
>> ErtsSmpPTimer for T2
>> 13. Thread B free_ptimer (utils.c) Free ErtsSmpPTimer
>> memory block
>> 14. Thread C erts_create_smp_ptimer (utils.c) Allocate ErtsSmpPTimer
>> for T2, block reused!
>> 15. Thread C erl_set_timer (time.c) erl_set_timer invoked
>> for T2
>> 16. Thread C erl_set_timer (time.c) Lock Timing Wheel
>> 17. Thread C insert_timer (time.c) Insert Timer T2
>> 18. Thread C erl_set_timer (time.c) Unlock Timing Wheel
>> 19. Thread A erl_cancel_timer (time.c) Lock Timing Wheel
>> 20. Thread A erl_cancel_timer (time.c) Remove ex-T1 == T2
>> from the timing wheel
>> 21. Thread A erl_cancel_timer (time.c) Unlock Timing Wheel
>>
>> See also attached diagram.
>>
>> Looks like one more mutex required, excluding release of ErtsSmpPTimer
>> memory block by timeout callback if cancel request was issued for the
>> timer and vise versa. The two point of control - cancel timer and timer
>> expiration should not interfere.
>> This bug happens only in SMP mode since there additional timer control
>> structure ErtsSmpPTimer is used between emulator and timing wheel.
>>
>> Mikael Pettersson wrote:
>>> Dmitriy Kargapolov writes:
>>> > > When running erl with -smp +S 2 option, sometimes process gets
>>> stuck in > timer:sleep/1.
>>> > Process code looks like:
>>> > > some_receiver(State) ->
>>> > NewState = receive
>>> > % legal packet
>>> > {some_keyword, Address, Port, Packet} ->
>>> > State1 = handle_packet(Address, Port, Packet, State),
>>> > timer:sleep(get_loop_delay()),
>>> > State1;
>>> > % unknown message
>>> > _ ->
>>> > State
>>> > end,
>>> > some_receiver(NewState).
>>> > > Delay value varies in range 1..999
>>> > > Since timer:sleep/1 implemented as:
>>> > sleep(T) ->
>>> > receive
>>> > after T -> ok
>>> > end.
>>> > it seems to be problem with "after" in smp implementation in R11B-0
>>> > > I don't have more details yet but will continue testing.
>>> > My platform: 2.6.9-5.ELsmp #1 SMP i686 i686 i386 GNU/Linux
>>>
>>> Interesting. Please send us a small standalone module that exhibits
>>> the bug, and I'll see if I can reproduce it.
>>>
>>> /Mikael
>>>
>> ------------------------------------------------------------------------
>>
>> _______________________________________________
>> erlang-questions mailing list
>> erlang-questions@REDACTED
>> http://www.erlang.org/mailman/listinfo/erlang-questions
>
More information about the erlang-questions
mailing list