[erlang-questions] R11B-2 SMP Timer Race Condition Bug [Re: bug in timer:sleep/1 smp implementation (R11B-0)]

Dmitriy Kargapolov dmitry.kargapolov@REDACTED
Fri Dec 22 20:52:27 CET 2006

Unfortunately I can not create standalone test for this bug, even when I 
became much more close to understanding the effect.
This bug appears only in highly loaded system.

Recently I did manage to trace some points in the code and see at least 
one scenario for the race condition bug.

  1. Thread A    erl_set_timer (time.c)            Lock Timing Wheel
  2. Thread A    insert_timer (time.c)             Insert Timer T1
  3. Thread A    erl_set_timer (time.c)            Unlock Timing Wheel
  4. Thread B    bump_timer_internal (time.c)      Lock Timing Wheel
  5. Thread A    cancel_timer (erl_process.c)      Cancel timer T1
  6. Thread B    bump_timer_internal (time.c)      Build list of Expired 
  7. Thread A    erl_cancel_timer (time.c)         Cancel timer T1: 
Waiting for Timing Wheel Lock
  8. Thread B    bump_timer_internal (time.c)      Unlock Timing Wheel
  9. Thread C    set_timer (erl_process.c)         New Timeout Request (T2)
10. Thread B    bump_timer_internal (time.c)      Call Expired Timers 
11. Thread B    free_ptimer (utils.c)             Timer T1 callback 
invokes free_ptimer()
12. Thread C    erts_create_smp_ptimer (utils.c)  Create Timer 
ErtsSmpPTimer for T2
13. Thread B    free_ptimer (utils.c)             Free ErtsSmpPTimer 
memory block
14. Thread C    erts_create_smp_ptimer (utils.c)  Allocate ErtsSmpPTimer 
for T2, block reused!
15. Thread C    erl_set_timer (time.c)            erl_set_timer invoked 
for T2
16. Thread C    erl_set_timer (time.c)            Lock Timing Wheel
17. Thread C    insert_timer (time.c)             Insert Timer T2
18. Thread C    erl_set_timer (time.c)            Unlock Timing Wheel
19. Thread A    erl_cancel_timer (time.c)         Lock Timing Wheel
20. Thread A    erl_cancel_timer (time.c)         Remove ex-T1 == T2 
from the timing wheel
21. Thread A    erl_cancel_timer (time.c)         Unlock Timing Wheel

See also attached diagram.

Looks like one more mutex required, excluding release of ErtsSmpPTimer 
memory block by timeout callback if cancel request was issued for the 
timer and vise versa. The two point of control - cancel timer and timer 
expiration should not interfere.
This bug happens only in SMP mode since there additional timer control 
structure ErtsSmpPTimer is used between emulator and timing wheel.

Mikael Pettersson wrote:
> Dmitriy Kargapolov writes:
>  > 
>  > When running erl with -smp +S 2 option, sometimes process gets stuck in 
>  > timer:sleep/1.
>  > Process code looks like:
>  > 
>  > some_receiver(State) ->
>  >      NewState = receive
>  >          % legal packet
>  >          {some_keyword, Address, Port, Packet} ->
>  >              State1 = handle_packet(Address, Port, Packet, State),
>  >              timer:sleep(get_loop_delay()),
>  >              State1;
>  >          % unknown message
>  >          _ ->
>  >              State
>  >      end,
>  >      some_receiver(NewState).
>  > 
>  > Delay value varies in range 1..999
>  > 
>  > Since timer:sleep/1 implemented as:
>  > sleep(T) ->
>  >      receive
>  >      after T -> ok
>  >      end.
>  > it seems to be problem with "after" in smp implementation in R11B-0
>  > 
>  > I don't have more details yet but will continue testing.
>  > My platform: 2.6.9-5.ELsmp #1 SMP i686 i686 i386 GNU/Linux
> Interesting. Please send us a small standalone module that exhibits
> the bug, and I'll see if I can reproduce it.
> /Mikael
-------------- next part --------------
A non-text attachment was scrubbed...
Name: RaceCond.pdf
Type: application/pdf
Size: 16115 bytes
Desc: not available
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20061222/b1790c40/attachment.pdf>

More information about the erlang-questions mailing list