[erlang-bugs] R16B01's monitor delivery is broken?
Scott Lystig Fritchie
fritchie@REDACTED
Tue Jul 23 22:44:33 CEST 2013
Hi, everyone. Hope your summer vacations are going well. I have some
bad news for Rickard, at least.
SHA: e794251f8e54d6697e1bcc360471fd76b20c7748
Author: Rickard Green <rickard@REDACTED>
Date: Thu May 30 2013 07:56:31 GMT-0500 (CDT)
Subject: Merge branch 'rickard/ptab-id-alloc/OTP-11077' into maint
Parent: 22685099ace9802016bf6203c525702084717d72
Parent: 5c039a1fb4979314912dc3af6626d8d7a1c73993
Merge branch 'rickard/ptab-id-alloc/OTP-11077' into maint
* rickard/ptab-id-alloc/OTP-11077:
Introduce a better id allocation algorithm for PTabs
This commit appears to break monitor delivery? And it may or may not be
causing processes to die for reasons that we cannot see or understand.
Run with R15B03-1, the example code in test6.erl is merely slow:
https://gist.github.com/jtuple/aa4830a0ff0a94f69484/raw/02adc518e225f263a7e25d339ec7200ef2dda491/test6.erl
On my 4 core/8 HT core MacBook Pro, R15B03-1 cannot go above 200% CPU
utilization, and the execution time is correspondingly slooow. But it
appears to work correctly.
erl -eval '[begin io:format("Iteration ~p at ~p\n", [X,time()]), test6:go() end || X <- lists:seq(1, 240)].'
When run with R16B, it's *much* faster. CPU utilization above 750%
confirms that it's going faster. And it appears to work correctly.
However, when run with R16B01, we see non-deterministic hangs on both OS
X and various Linux platforms. CPU consumption by the "beam.smp"
process drops to 0, and the next cycle of the list comprehension never
starts.
Thanks to the magic of Git, it's pretty clear that the commit above is
broken. The commit before it appears to work well (i.e., does not
hang).
SHA: 22685099ace9802016bf6203c525702084717d72
Author: Anders Svensson <anders@REDACTED>
Date: Wed May 29 2013 11:46:10 GMT-0500 (CDT)
Subject: Merge branch 'anders/diameter/watchdog_function_clause/OTP-11115' into maint
Using R16B01 together with the "+P legacy" flag does not hang. But this
problem has given us at Basho enough ... caution ... that we will be
much more cautious about moving our app packaging from R15B* to R16B*.
Several seconds after CPU consumption drops to 0%, then I trigger the
creation of a "erl_crash.dump" file using erlang:halt("bummer"). If I
look at that file, then the process "Spawned as: test6:do_work2/0" says
that there are active unidirectional links (i.e., monitors), but there
is one process on that list that does not have a corresponding
"=proc:<pid.number.here>" entry in the dump ... which strongly suggests
to me that the process is dead. Using DTrace, I've been able to
establish that the dead process is indeed alive at one time and has been
scheduled & descheduled at least once. So there are really two
mysteries:
1. Why is one of the test6:indirect_proxy/1 processes dying
unexpectedly? (The monitor doesn't fire, SASL isn't logging any errors,
etc.)
2. Why isn't a monitor message being delivered?
Many thanks to Joe Blomstedt, Evan Vigil-McClanahan, Andrew Thompson,
Steve Vinoski, and Sean Cribbs for their sleuthing work.
-Scott
--- snip --- snip --- snip --- snip --- snip ---
R15B03 lock count analysis, FWIW:
lock id #tries #collisions collisions [%] time [us] duration [%]
----- --- ------- ------------ --------------- ---------- -------------
proc_tab 1 1280032 1266133 98.9142 60642804 557.0583
run_queue 8 3617608 12874 0.3559 261722 2.4042
sys_tracers 1 1280042 6445 0.5035 19365 0.1779
pix_lock 256 4480284 1213 0.0271 9777 0.0898
timeofday 1 709955 1187 0.1672 3216 0.0295
[......]
--- snip --- snip --- snip --- snip --- snip ---
=proc:<0.29950.154>
State: Waiting
Spawned as: test6:do_work2/0
Spawned by: <0.48.0>
Started: Tue Jul 23 04:50:54 2013
Message queue length: 0
Number of heap fragments: 0
Heap fragment data: 0
Link list: [{from,<0.48.0>,#Ref<0.0.19.96773>}, {to,<0.32497.154>,#Ref<0.0.19.96797>}, {to,<0.1184.155>,#Ref<0.0.19.96796>}, {to,<0.31361.154>,#Ref<0.0.19.96799>}, {to,<0.32019.154>,#Ref<0.0.19.96801>}, {to,<0.32501.154>,#Ref<0.0.19.96800>}, {to,<0.1352.155>,#Ref<0.0.19.96803>}, {to,<0.32415.154>,#Ref<0.0.19.96805>}, {to,<0.504.155>,#Ref<0.0.19.96804>}, {to,<0.87.155>,#Ref<0.0.19.96802>}, {to,<0.776.155>,#Ref<0.0.19.96798>}]
Reductions: 45
Stack+heap: 233
OldHeap: 0
Heap unused: 155
OldHeap unused: 0
Memory: 3472
Program counter: 0x000000001e1504d0 (test6:do_work2/0 + 184)
CP: 0x0000000000000000 (invalid)
arity = 0
More information about the erlang-bugs
mailing list