[erlang-questions] [erlang-bugs] R11B-2 SMP Timer Race Condition Bug [Re: bug in timer:sleep/1 smp implementation (R11B-0)]
Rickard Green
rickard.s.green@REDACTED
Wed Dec 27 18:54:29 CET 2006
The process lock plays an important role here. Unfortunately a faulty
optimization (blush) prevented the process lock from playing that role.
ptimer_timeout() has to acquire the process lock before looking at the
ptimer flags. I've attached a patch that should fix the problem.
$ tar -zxf otp_src_R11B-2.tar.gz
$ patch -p0 < ptimer.patch
patching file `otp_src_R11B-2/erts/emulator/beam/utils.c'
Please, report to us whether or not the problem went away.
Great work Dmitriy and Serge! Many thanks!
BR,
Rickard Green, Erlang/OTP
Rickard Green wrote:
> Thanks for your detailed bug report. I'll look at this as soon as possible.
>
> BR,
> Rickard Green, Erlang/OTP
>
> Serge Aleynikov wrote:
>> Additionally, I should say that we've been able to reproduce this bug on
>> several Linux platforms (RH ES 4.0, Fedora Core 5.0, 32bit and 64bit) in
>> R11B-0, R11B-1 and R11B-2. The bug (what appears to be a race
>> condition) is seen only if the emulator is started in the SMP mode and
>> results in the following construct blocking infinitely in the context of
>> some Erlang process handing a message dispatching function:
>>
>> receive
>> after N -> % Where N is between 1 and 999
>> ok
>> end.
>>
>> It happens when all the CPUs in SMP mode are over 75% loaded. The bug
>> doesn't happen immediately after starting a release, but after a period
>> of 5 min to 3 hours, which makes it pretty hard to diagnose. The
>> tracing method that we initially tried to use was to include printf
>> statements in the emulator to stderr. However, this prevented the bug
>> from showing up. Further it was changed to using SysV message queue to
>> communicate trace to an external process that dumped the trace to a
>> file. This allowed to gain further understanding of the problem, but as
>> Dmitry indicated any attempt to reduce the code to a minimal example
>> made the problem disappear.
>>
>> The emulator code is quite involved, but hopefully someone in the OTP
>> team could come up with a recommendation of how/where to put a missing
>> synchronization. If needed we can arrange for a remote SSH login to the
>> system(s) where the problem is reproducible.
>>
>> Regards,
>>
>> Serge
>>
>> Dmitriy Kargapolov wrote:
>>> Unfortunately I can not create standalone test for this bug, even when I
>>> became much more close to understanding the effect.
>>> This bug appears only in highly loaded system.
>>>
>>> Recently I did manage to trace some points in the code and see at least
>>> one scenario for the race condition bug.
>>>
>>> 1. Thread A erl_set_timer (time.c) Lock Timing Wheel
>>> 2. Thread A insert_timer (time.c) Insert Timer T1
>>> 3. Thread A erl_set_timer (time.c) Unlock Timing Wheel
>>> 4. Thread B bump_timer_internal (time.c) Lock Timing Wheel
>>> 5. Thread A cancel_timer (erl_process.c) Cancel timer T1
>>> 6. Thread B bump_timer_internal (time.c) Build list of Expired
>>> Timers
>>> 7. Thread A erl_cancel_timer (time.c) Cancel timer T1:
>>> Waiting for Timing Wheel Lock
>>> 8. Thread B bump_timer_internal (time.c) Unlock Timing Wheel
>>> 9. Thread C set_timer (erl_process.c) New Timeout Request (T2)
>>> 10. Thread B bump_timer_internal (time.c) Call Expired Timers
>>> Callbacks
>>> 11. Thread B free_ptimer (utils.c) Timer T1 callback
>>> invokes free_ptimer()
>>> 12. Thread C erts_create_smp_ptimer (utils.c) Create Timer
>>> ErtsSmpPTimer for T2
>>> 13. Thread B free_ptimer (utils.c) Free ErtsSmpPTimer
>>> memory block
>>> 14. Thread C erts_create_smp_ptimer (utils.c) Allocate ErtsSmpPTimer
>>> for T2, block reused!
>>> 15. Thread C erl_set_timer (time.c) erl_set_timer invoked
>>> for T2
>>> 16. Thread C erl_set_timer (time.c) Lock Timing Wheel
>>> 17. Thread C insert_timer (time.c) Insert Timer T2
>>> 18. Thread C erl_set_timer (time.c) Unlock Timing Wheel
>>> 19. Thread A erl_cancel_timer (time.c) Lock Timing Wheel
>>> 20. Thread A erl_cancel_timer (time.c) Remove ex-T1 == T2
>>> from the timing wheel
>>> 21. Thread A erl_cancel_timer (time.c) Unlock Timing Wheel
>>>
>>> See also attached diagram.
>>>
>>> Looks like one more mutex required, excluding release of ErtsSmpPTimer
>>> memory block by timeout callback if cancel request was issued for the
>>> timer and vise versa. The two point of control - cancel timer and timer
>>> expiration should not interfere.
>>> This bug happens only in SMP mode since there additional timer control
>>> structure ErtsSmpPTimer is used between emulator and timing wheel.
>>>
>>> Mikael Pettersson wrote:
>>>> Dmitriy Kargapolov writes:
>>>> > > When running erl with -smp +S 2 option, sometimes process gets
>>>> stuck in > timer:sleep/1.
>>>> > Process code looks like:
>>>> > > some_receiver(State) ->
>>>> > NewState = receive
>>>> > % legal packet
>>>> > {some_keyword, Address, Port, Packet} ->
>>>> > State1 = handle_packet(Address, Port, Packet, State),
>>>> > timer:sleep(get_loop_delay()),
>>>> > State1;
>>>> > % unknown message
>>>> > _ ->
>>>> > State
>>>> > end,
>>>> > some_receiver(NewState).
>>>> > > Delay value varies in range 1..999
>>>> > > Since timer:sleep/1 implemented as:
>>>> > sleep(T) ->
>>>> > receive
>>>> > after T -> ok
>>>> > end.
>>>> > it seems to be problem with "after" in smp implementation in R11B-0
>>>> > > I don't have more details yet but will continue testing.
>>>> > My platform: 2.6.9-5.ELsmp #1 SMP i686 i686 i386 GNU/Linux
>>>>
>>>> Interesting. Please send us a small standalone module that exhibits
>>>> the bug, and I'll see if I can reproduce it.
>>>>
>>>> /Mikael
>>>>
>>> ------------------------------------------------------------------------
>>>
>>> _______________________________________________
>>> erlang-questions mailing list
>>> erlang-questions@REDACTED
>>> http://www.erlang.org/mailman/listinfo/erlang-questions
> _______________________________________________
> erlang-bugs mailing list
> erlang-bugs@REDACTED
> http://www.erlang.org/mailman/listinfo/erlang-bugs
>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: ptimer.patch
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20061227/1fd5a234/attachment.ksh>
More information about the erlang-questions
mailing list