[erlang-bugs] R16B01's monitor delivery is broken?
Björn-Egil Dahlberg
wallentin.dahlberg@REDACTED
Tue Jul 23 23:19:26 CEST 2013
True, that seems suspicious.
The vacation for Rickard is going great I think. Last I heard from him, he
was diving round Öland (literally "island-land") in south-eastern sweden.
It will be a few weeks before he's back.
In the meanwhile it is fairly lonely here at OTP, today we were two persons
at the office, and there is a lot of stuff to do. I will have a quick look
at it and verify but will probably let Rickard deal with it when he comes
back.
Thanks for a great summary and drill down of the problem!
Regards,
Björn-Egil
2013/7/23 Scott Lystig Fritchie <fritchie@REDACTED>
> Hi, everyone. Hope your summer vacations are going well. I have some
> bad news for Rickard, at least.
>
> SHA: e794251f8e54d6697e1bcc360471fd76b20c7748
> Author: Rickard Green <rickard@REDACTED>
> Date: Thu May 30 2013 07:56:31 GMT-0500 (CDT)
> Subject: Merge branch 'rickard/ptab-id-alloc/OTP-11077' into maint
> Parent: 22685099ace9802016bf6203c525702084717d72
> Parent: 5c039a1fb4979314912dc3af6626d8d7a1c73993
> Merge branch 'rickard/ptab-id-alloc/OTP-11077' into maint
>
> * rickard/ptab-id-alloc/OTP-11077:
> Introduce a better id allocation algorithm for PTabs
>
> This commit appears to break monitor delivery? And it may or may not be
> causing processes to die for reasons that we cannot see or understand.
>
> Run with R15B03-1, the example code in test6.erl is merely slow:
>
>
> https://gist.github.com/jtuple/aa4830a0ff0a94f69484/raw/02adc518e225f263a7e25d339ec7200ef2dda491/test6.erl
>
> On my 4 core/8 HT core MacBook Pro, R15B03-1 cannot go above 200% CPU
> utilization, and the execution time is correspondingly slooow. But it
> appears to work correctly.
>
> erl -eval '[begin io:format("Iteration ~p at ~p\n", [X,time()]),
> test6:go() end || X <- lists:seq(1, 240)].'
>
> When run with R16B, it's *much* faster. CPU utilization above 750%
> confirms that it's going faster. And it appears to work correctly.
>
> However, when run with R16B01, we see non-deterministic hangs on both OS
> X and various Linux platforms. CPU consumption by the "beam.smp"
> process drops to 0, and the next cycle of the list comprehension never
> starts.
>
> Thanks to the magic of Git, it's pretty clear that the commit above is
> broken. The commit before it appears to work well (i.e., does not
> hang).
>
> SHA: 22685099ace9802016bf6203c525702084717d72
> Author: Anders Svensson <anders@REDACTED>
> Date: Wed May 29 2013 11:46:10 GMT-0500 (CDT)
> Subject: Merge branch
> 'anders/diameter/watchdog_function_clause/OTP-11115' into maint
>
> Using R16B01 together with the "+P legacy" flag does not hang. But this
> problem has given us at Basho enough ... caution ... that we will be
> much more cautious about moving our app packaging from R15B* to R16B*.
>
> Several seconds after CPU consumption drops to 0%, then I trigger the
> creation of a "erl_crash.dump" file using erlang:halt("bummer"). If I
> look at that file, then the process "Spawned as: test6:do_work2/0" says
> that there are active unidirectional links (i.e., monitors), but there
> is one process on that list that does not have a corresponding
> "=proc:<pid.number.here>" entry in the dump ... which strongly suggests
> to me that the process is dead. Using DTrace, I've been able to
> establish that the dead process is indeed alive at one time and has been
> scheduled & descheduled at least once. So there are really two
> mysteries:
>
> 1. Why is one of the test6:indirect_proxy/1 processes dying
> unexpectedly? (The monitor doesn't fire, SASL isn't logging any errors,
> etc.)
>
> 2. Why isn't a monitor message being delivered?
>
> Many thanks to Joe Blomstedt, Evan Vigil-McClanahan, Andrew Thompson,
> Steve Vinoski, and Sean Cribbs for their sleuthing work.
>
> -Scott
>
> --- snip --- snip --- snip --- snip --- snip ---
>
> R15B03 lock count analysis, FWIW:
>
> lock id #tries #collisions collisions [%] time [us]
> duration [%]
> ----- --- ------- ------------ --------------- ----------
> -------------
> proc_tab 1 1280032 1266133 98.9142 60642804
> 557.0583
> run_queue 8 3617608 12874 0.3559 261722
> 2.4042
> sys_tracers 1 1280042 6445 0.5035 19365
> 0.1779
> pix_lock 256 4480284 1213 0.0271 9777
> 0.0898
> timeofday 1 709955 1187 0.1672 3216
> 0.0295
> [......]
>
> --- snip --- snip --- snip --- snip --- snip ---
>
> =proc:<0.29950.154>
> State: Waiting
> Spawned as: test6:do_work2/0
> Spawned by: <0.48.0>
> Started: Tue Jul 23 04:50:54 2013
> Message queue length: 0
> Number of heap fragments: 0
> Heap fragment data: 0
> Link list: [{from,<0.48.0>,#Ref<0.0.19.96773>},
> {to,<0.32497.154>,#Ref<0.0.19.96797>},
> {to,<0.1184.155>,#Ref<0.0.19.96796>},
> {to,<0.31361.154>,#Ref<0.0.19.96799>},
> {to,<0.32019.154>,#Ref<0.0.19.96801>},
> {to,<0.32501.154>,#Ref<0.0.19.96800>},
> {to,<0.1352.155>,#Ref<0.0.19.96803>},
> {to,<0.32415.154>,#Ref<0.0.19.96805>}, {to,<0.504.155>,#Ref<0.0.19.96804>},
> {to,<0.87.155>,#Ref<0.0.19.96802>}, {to,<0.776.155>,#Ref<0.0.19.96798>}]
> Reductions: 45
> Stack+heap: 233
> OldHeap: 0
> Heap unused: 155
> OldHeap unused: 0
> Memory: 3472
> Program counter: 0x000000001e1504d0 (test6:do_work2/0 + 184)
> CP: 0x0000000000000000 (invalid)
> arity = 0
> _______________________________________________
> erlang-bugs mailing list
> erlang-bugs@REDACTED
> http://erlang.org/mailman/listinfo/erlang-bugs
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-bugs/attachments/20130723/ef8714e7/attachment.htm>
More information about the erlang-bugs
mailing list