[erlang-bugs] R16B01's monitor delivery is broken?

Björn-Egil Dahlberg wallentin.dahlberg@REDACTED
Tue Jul 23 23:19:26 CEST 2013


True, that seems suspicious.

The vacation for Rickard is going great I think. Last I heard from him, he
was diving round Öland (literally "island-land") in south-eastern sweden.
It will be a few weeks before he's back.

In the meanwhile it is fairly lonely here at OTP, today we were two persons
at the office, and there is a lot of stuff to do. I will have a quick look
at it and verify but will probably let Rickard deal with it when he comes
back.

Thanks for a great summary and drill down of the problem!

Regards,
Björn-Egil


2013/7/23 Scott Lystig Fritchie <fritchie@REDACTED>

> Hi, everyone.  Hope your summer vacations are going well.  I have some
> bad news for Rickard, at least.
>
>     SHA:         e794251f8e54d6697e1bcc360471fd76b20c7748
>     Author:      Rickard Green <rickard@REDACTED>
>     Date:        Thu May 30 2013 07:56:31 GMT-0500 (CDT)
>     Subject: Merge branch 'rickard/ptab-id-alloc/OTP-11077' into maint
>     Parent:      22685099ace9802016bf6203c525702084717d72
>     Parent:      5c039a1fb4979314912dc3af6626d8d7a1c73993
>     Merge branch 'rickard/ptab-id-alloc/OTP-11077' into maint
>
>     * rickard/ptab-id-alloc/OTP-11077:
>       Introduce a better id allocation algorithm for PTabs
>
> This commit appears to break monitor delivery?  And it may or may not be
> causing processes to die for reasons that we cannot see or understand.
>
> Run with R15B03-1, the example code in test6.erl is merely slow:
>
>
> https://gist.github.com/jtuple/aa4830a0ff0a94f69484/raw/02adc518e225f263a7e25d339ec7200ef2dda491/test6.erl
>
> On my 4 core/8 HT core MacBook Pro, R15B03-1 cannot go above 200% CPU
> utilization, and the execution time is correspondingly slooow.  But it
> appears to work correctly.
>
>     erl -eval '[begin io:format("Iteration ~p at ~p\n", [X,time()]),
> test6:go() end || X <- lists:seq(1, 240)].'
>
> When run with R16B, it's *much* faster.  CPU utilization above 750%
> confirms that it's going faster.  And it appears to work correctly.
>
> However, when run with R16B01, we see non-deterministic hangs on both OS
> X and various Linux platforms.  CPU consumption by the "beam.smp"
> process drops to 0, and the next cycle of the list comprehension never
> starts.
>
> Thanks to the magic of Git, it's pretty clear that the commit above is
> broken.  The commit before it appears to work well (i.e., does not
> hang).
>
>     SHA:         22685099ace9802016bf6203c525702084717d72
>     Author:      Anders Svensson <anders@REDACTED>
>     Date:        Wed May 29 2013 11:46:10 GMT-0500 (CDT)
>     Subject: Merge branch
> 'anders/diameter/watchdog_function_clause/OTP-11115' into maint
>
> Using R16B01 together with the "+P legacy" flag does not hang.  But this
> problem has given us at Basho enough ... caution ... that we will be
> much more cautious about moving our app packaging from R15B* to R16B*.
>
> Several seconds after CPU consumption drops to 0%, then I trigger the
> creation of a "erl_crash.dump" file using erlang:halt("bummer").  If I
> look at that file, then the process "Spawned as: test6:do_work2/0" says
> that there are active unidirectional links (i.e., monitors), but there
> is one process on that list that does not have a corresponding
> "=proc:<pid.number.here>" entry in the dump ... which strongly suggests
> to me that the process is dead.  Using DTrace, I've been able to
> establish that the dead process is indeed alive at one time and has been
> scheduled & descheduled at least once.  So there are really two
> mysteries:
>
> 1. Why is one of the test6:indirect_proxy/1 processes dying
> unexpectedly?  (The monitor doesn't fire, SASL isn't logging any errors,
> etc.)
>
> 2. Why isn't a monitor message being delivered?
>
> Many thanks to Joe Blomstedt, Evan Vigil-McClanahan, Andrew Thompson,
> Steve Vinoski, and Sean Cribbs for their sleuthing work.
>
> -Scott
>
> --- snip --- snip --- snip --- snip --- snip ---
>
> R15B03 lock count analysis, FWIW:
>
>         lock     id  #tries  #collisions  collisions [%]  time [us]
>  duration [%]
>        -----    --- ------- ------------ --------------- ----------
> -------------
>     proc_tab      1 1280032      1266133         98.9142   60642804
>  557.0583
>    run_queue      8 3617608        12874          0.3559     261722
>  2.4042
>  sys_tracers      1 1280042         6445          0.5035      19365
>  0.1779
>     pix_lock    256 4480284         1213          0.0271       9777
>  0.0898
>    timeofday      1  709955         1187          0.1672       3216
>  0.0295
> [......]
>
> --- snip --- snip --- snip --- snip --- snip ---
>
> =proc:<0.29950.154>
> State: Waiting
> Spawned as: test6:do_work2/0
> Spawned by: <0.48.0>
> Started: Tue Jul 23 04:50:54 2013
> Message queue length: 0
> Number of heap fragments: 0
> Heap fragment data: 0
> Link list: [{from,<0.48.0>,#Ref<0.0.19.96773>},
> {to,<0.32497.154>,#Ref<0.0.19.96797>},
> {to,<0.1184.155>,#Ref<0.0.19.96796>},
> {to,<0.31361.154>,#Ref<0.0.19.96799>},
> {to,<0.32019.154>,#Ref<0.0.19.96801>},
> {to,<0.32501.154>,#Ref<0.0.19.96800>},
> {to,<0.1352.155>,#Ref<0.0.19.96803>},
> {to,<0.32415.154>,#Ref<0.0.19.96805>}, {to,<0.504.155>,#Ref<0.0.19.96804>},
> {to,<0.87.155>,#Ref<0.0.19.96802>}, {to,<0.776.155>,#Ref<0.0.19.96798>}]
> Reductions: 45
> Stack+heap: 233
> OldHeap: 0
> Heap unused: 155
> OldHeap unused: 0
> Memory: 3472
> Program counter: 0x000000001e1504d0 (test6:do_work2/0 + 184)
> CP: 0x0000000000000000 (invalid)
> arity = 0
> _______________________________________________
> erlang-bugs mailing list
> erlang-bugs@REDACTED
> http://erlang.org/mailman/listinfo/erlang-bugs
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-bugs/attachments/20130723/ef8714e7/attachment.htm>


More information about the erlang-bugs mailing list