<html>
<head>
<meta content="text/html; charset=ISO-8859-1"
http-equiv="Content-Type">
</head>
<body text="#000000" bgcolor="#FFFFFF">
<div class="moz-cite-prefix">On 2013-08-01 19:59, Björn-Egil
Dahlberg wrote:<br>
</div>
<blockquote cite="mid:51FAA1F7.8000704@erlang.org" type="cite">
<meta content="text/html; charset=ISO-8859-1"
http-equiv="Content-Type">
<div class="moz-cite-prefix"><br>
We have confirmed that this problem indeed exists and we think
we understand what is happening.<br>
<br>
The problem has a low probability of occurring, though obviously
reproducible, and pretty serious if it occurs.<br>
<br>
I won't go into too much details, but the uniqueness of the
process identifier can be compromised, i.e. it will not be
unique. In essence a process might get an identifier of an
already terminated process (or an already living one though I
haven't confirmed that), the mapping is then overwritten, and by
inspecting this identifier it will look dead. Signals or
messages will not be sent since it is "dead" or sent to an
unsuspecting (wrong) process. The mappings of id's and process
pointers has become inconsistent. It's a bit more complicated
than that but in a nutshell that's what's happening.<br>
<br>
What is needed for this to occur? A wrapping of the entire
"free-list-ring" of identifiers (size of max processes) while
one thread is in progress of doing an atomic read, some shift
and masking, and then a write for creating an identifier.
*Highly unlikely* but definitely a race. I.e. while one thread
is doing a read, shift/mask, and write to memory the other
threads has to create and terminate 262144 processes (or
whatever the limit is set to, but that is the default)<br>
</div>
</blockquote>
I think I tried to simplify this explanation too much. The race
occurs when the process is deleted and writes to the free-list and a
new process is created which is 262144 "generations/spawns" after
the deleted process and reads from the free-list in between the
terminating process read-shift/mask-write. Anyway details .. it's a
race.<br>
<br>
<blockquote cite="mid:51FAA1F7.8000704@erlang.org" type="cite">
<div class="moz-cite-prefix"> <br>
If the thread is scheduled out by the OS, or a hyperthread
switch occurs because of a mem-stall (we're dealing with
membarriers here after all so it might be a thing) between the
read and write the likelihood of an incident increases. Also, by
lowering max-process-limit in the system the likelihood
increases.<br>
<br>
We think we have a solution for this and initial tests show no
evidence of uniqueness problem after the fix. I think we will
have a fix out in maint next week.<br>
<br>
Using R16B01 together with the "+P legacy" is a workaround for
this issue. The legacy option uses the old way and does not
suffer from this problem.<br>
<br>
Thank you Scott, and to the rest of you at Basho for reporting
this.<br>
<br>
Regards,<br>
Björn-Egil<br>
<br>
<br>
On 2013-07-23 23:19, Björn-Egil Dahlberg wrote:<br>
</div>
<blockquote
cite="mid:CAMjYFoOeWYOoO1w4Wx11td5FrOMp5xssO-nZHihCvphix6z_Aw@mail.gmail.com"
type="cite">
<div dir="ltr">True, that seems suspicious.
<div><br>
</div>
<div>The vacation for Rickard is going great I think. Last I
heard from him, he was diving round Öland (literally
"island-land") in south-eastern sweden. It will be a few
weeks before he's back.</div>
<div><br>
</div>
<div>In the meanwhile it is fairly lonely here at OTP, today
we were two persons at the office, and there is a lot of
stuff to do. I will have a quick look at it and verify but
will probably let Rickard deal with it when he comes back.</div>
<div><br>
</div>
<div>Thanks for a great summary and drill down of the problem!</div>
<div><br>
</div>
<div>Regards,</div>
<div>Björn-Egil</div>
</div>
<div class="gmail_extra"><br>
<br>
<div class="gmail_quote">2013/7/23 Scott Lystig Fritchie <span
dir="ltr"><<a moz-do-not-send="true"
href="mailto:fritchie@snookles.com" target="_blank">fritchie@snookles.com</a>></span><br>
<blockquote class="gmail_quote" style="margin:0 0 0
.8ex;border-left:1px #ccc solid;padding-left:1ex">Hi,
everyone. Hope your summer vacations are going well. I
have some<br>
bad news for Rickard, at least.<br>
<br>
SHA: e794251f8e54d6697e1bcc360471fd76b20c7748<br>
Author: Rickard Green <<a
moz-do-not-send="true" href="mailto:rickard@erlang.org">rickard@erlang.org</a>><br>
Date: Thu May 30 2013 07:56:31 GMT-0500 (CDT)<br>
Subject: Merge branch
'rickard/ptab-id-alloc/OTP-11077' into maint<br>
Parent: 22685099ace9802016bf6203c525702084717d72<br>
Parent: 5c039a1fb4979314912dc3af6626d8d7a1c73993<br>
Merge branch 'rickard/ptab-id-alloc/OTP-11077' into
maint<br>
<br>
* rickard/ptab-id-alloc/OTP-11077:<br>
Introduce a better id allocation algorithm for PTabs<br>
<br>
This commit appears to break monitor delivery? And it may
or may not be<br>
causing processes to die for reasons that we cannot see or
understand.<br>
<br>
Run with R15B03-1, the example code in test6.erl is merely
slow:<br>
<br>
<a moz-do-not-send="true"
href="https://gist.github.com/jtuple/aa4830a0ff0a94f69484/raw/02adc518e225f263a7e25d339ec7200ef2dda491/test6.erl"
target="_blank">https://gist.github.com/jtuple/aa4830a0ff0a94f69484/raw/02adc518e225f263a7e25d339ec7200ef2dda491/test6.erl</a><br>
<br>
On my 4 core/8 HT core MacBook Pro, R15B03-1 cannot go
above 200% CPU<br>
utilization, and the execution time is correspondingly
slooow. But it<br>
appears to work correctly.<br>
<br>
erl -eval '[begin io:format("Iteration ~p at ~p\n",
[X,time()]), test6:go() end || X <- lists:seq(1,
240)].'<br>
<br>
When run with R16B, it's *much* faster. CPU utilization
above 750%<br>
confirms that it's going faster. And it appears to work
correctly.<br>
<br>
However, when run with R16B01, we see non-deterministic
hangs on both OS<br>
X and various Linux platforms. CPU consumption by the
"beam.smp"<br>
process drops to 0, and the next cycle of the list
comprehension never<br>
starts.<br>
<br>
Thanks to the magic of Git, it's pretty clear that the
commit above is<br>
broken. The commit before it appears to work well (i.e.,
does not<br>
hang).<br>
<br>
SHA: 22685099ace9802016bf6203c525702084717d72<br>
Author: Anders Svensson <<a
moz-do-not-send="true" href="mailto:anders@erlang.org">anders@erlang.org</a>><br>
Date: Wed May 29 2013 11:46:10 GMT-0500 (CDT)<br>
Subject: Merge branch
'anders/diameter/watchdog_function_clause/OTP-11115' into
maint<br>
<br>
Using R16B01 together with the "+P legacy" flag does not
hang. But this<br>
problem has given us at Basho enough ... caution ... that
we will be<br>
much more cautious about moving our app packaging from
R15B* to R16B*.<br>
<br>
Several seconds after CPU consumption drops to 0%, then I
trigger the<br>
creation of a "erl_crash.dump" file using
erlang:halt("bummer"). If I<br>
look at that file, then the process "Spawned as:
test6:do_work2/0" says<br>
that there are active unidirectional links (i.e.,
monitors), but there<br>
is one process on that list that does not have a
corresponding<br>
"=proc:<pid.number.here>" entry in the dump ...
which strongly suggests<br>
to me that the process is dead. Using DTrace, I've been
able to<br>
establish that the dead process is indeed alive at one
time and has been<br>
scheduled & descheduled at least once. So there are
really two<br>
mysteries:<br>
<br>
1. Why is one of the test6:indirect_proxy/1 processes
dying<br>
unexpectedly? (The monitor doesn't fire, SASL isn't
logging any errors,<br>
etc.)<br>
<br>
2. Why isn't a monitor message being delivered?<br>
<br>
Many thanks to Joe Blomstedt, Evan Vigil-McClanahan,
Andrew Thompson,<br>
Steve Vinoski, and Sean Cribbs for their sleuthing work.<br>
<br>
-Scott<br>
<br>
--- snip --- snip --- snip --- snip --- snip ---<br>
<br>
R15B03 lock count analysis, FWIW:<br>
<br>
lock id #tries #collisions collisions [%]
time [us] duration [%]<br>
----- --- ------- ------------ ---------------
---------- -------------<br>
proc_tab 1 1280032 1266133 98.9142
60642804 557.0583<br>
run_queue 8 3617608 12874 0.3559
261722 2.4042<br>
sys_tracers 1 1280042 6445 0.5035
19365 0.1779<br>
pix_lock 256 4480284 1213 0.0271
9777 0.0898<br>
timeofday 1 709955 1187 0.1672
3216 0.0295<br>
[......]<br>
<br>
--- snip --- snip --- snip --- snip --- snip ---<br>
<br>
=proc:<0.29950.154><br>
State: Waiting<br>
Spawned as: test6:do_work2/0<br>
Spawned by: <0.48.0><br>
Started: Tue Jul 23 04:50:54 2013<br>
Message queue length: 0<br>
Number of heap fragments: 0<br>
Heap fragment data: 0<br>
Link list:
[{from,<0.48.0>,#Ref<0.0.19.96773>},
{to,<0.32497.154>,#Ref<0.0.19.96797>},
{to,<0.1184.155>,#Ref<0.0.19.96796>},
{to,<0.31361.154>,#Ref<0.0.19.96799>},
{to,<0.32019.154>,#Ref<0.0.19.96801>},
{to,<0.32501.154>,#Ref<0.0.19.96800>},
{to,<0.1352.155>,#Ref<0.0.19.96803>},
{to,<0.32415.154>,#Ref<0.0.19.96805>},
{to,<0.504.155>,#Ref<0.0.19.96804>},
{to,<0.87.155>,#Ref<0.0.19.96802>},
{to,<0.776.155>,#Ref<0.0.19.96798>}]<br>
Reductions: 45<br>
Stack+heap: 233<br>
OldHeap: 0<br>
Heap unused: 155<br>
OldHeap unused: 0<br>
Memory: 3472<br>
Program counter: 0x000000001e1504d0 (test6:do_work2/0 +
184)<br>
CP: 0x0000000000000000 (invalid)<br>
arity = 0<br>
_______________________________________________<br>
erlang-bugs mailing list<br>
<a moz-do-not-send="true"
href="mailto:erlang-bugs@erlang.org">erlang-bugs@erlang.org</a><br>
<a moz-do-not-send="true"
href="http://erlang.org/mailman/listinfo/erlang-bugs"
target="_blank">http://erlang.org/mailman/listinfo/erlang-bugs</a><br>
</blockquote>
</div>
<br>
</div>
<br>
<fieldset class="mimeAttachmentHeader"></fieldset>
<br>
<pre wrap="">_______________________________________________
erlang-bugs mailing list
<a moz-do-not-send="true" class="moz-txt-link-abbreviated" href="mailto:erlang-bugs@erlang.org">erlang-bugs@erlang.org</a>
<a moz-do-not-send="true" class="moz-txt-link-freetext" href="http://erlang.org/mailman/listinfo/erlang-bugs">http://erlang.org/mailman/listinfo/erlang-bugs</a>
</pre>
</blockquote>
<br>
<br>
<fieldset class="mimeAttachmentHeader"></fieldset>
<br>
<pre wrap="">_______________________________________________
erlang-bugs mailing list
<a class="moz-txt-link-abbreviated" href="mailto:erlang-bugs@erlang.org">erlang-bugs@erlang.org</a>
<a class="moz-txt-link-freetext" href="http://erlang.org/mailman/listinfo/erlang-bugs">http://erlang.org/mailman/listinfo/erlang-bugs</a>
</pre>
</blockquote>
<br>
</body>
</html>