[erlang-bugs] R16B01's monitor delivery is broken?
Björn-Egil Dahlberg
egil@REDACTED
Thu Aug 1 20:22:12 CEST 2013
On 2013-08-01 19:59, Björn-Egil Dahlberg wrote:
>
> We have confirmed that this problem indeed exists and we think we
> understand what is happening.
>
> The problem has a low probability of occurring, though obviously
> reproducible, and pretty serious if it occurs.
>
> I won't go into too much details, but the uniqueness of the process
> identifier can be compromised, i.e. it will not be unique. In essence
> a process might get an identifier of an already terminated process (or
> an already living one though I haven't confirmed that), the mapping is
> then overwritten, and by inspecting this identifier it will look dead.
> Signals or messages will not be sent since it is "dead" or sent to an
> unsuspecting (wrong) process. The mappings of id's and process
> pointers has become inconsistent. It's a bit more complicated than
> that but in a nutshell that's what's happening.
>
> What is needed for this to occur? A wrapping of the entire
> "free-list-ring" of identifiers (size of max processes) while one
> thread is in progress of doing an atomic read, some shift and masking,
> and then a write for creating an identifier. *Highly unlikely* but
> definitely a race. I.e. while one thread is doing a read, shift/mask,
> and write to memory the other threads has to create and terminate
> 262144 processes (or whatever the limit is set to, but that is the
> default)
I think I tried to simplify this explanation too much. The race occurs
when the process is deleted and writes to the free-list and a new
process is created which is 262144 "generations/spawns" after the
deleted process and reads from the free-list in between the terminating
process read-shift/mask-write. Anyway details .. it's a race.
>
> If the thread is scheduled out by the OS, or a hyperthread switch
> occurs because of a mem-stall (we're dealing with membarriers here
> after all so it might be a thing) between the read and write the
> likelihood of an incident increases. Also, by lowering
> max-process-limit in the system the likelihood increases.
>
> We think we have a solution for this and initial tests show no
> evidence of uniqueness problem after the fix. I think we will have a
> fix out in maint next week.
>
> Using R16B01 together with the "+P legacy" is a workaround for this
> issue. The legacy option uses the old way and does not suffer from
> this problem.
>
> Thank you Scott, and to the rest of you at Basho for reporting this.
>
> Regards,
> Björn-Egil
>
>
> On 2013-07-23 23:19, Björn-Egil Dahlberg wrote:
>> True, that seems suspicious.
>>
>> The vacation for Rickard is going great I think. Last I heard from
>> him, he was diving round Öland (literally "island-land") in
>> south-eastern sweden. It will be a few weeks before he's back.
>>
>> In the meanwhile it is fairly lonely here at OTP, today we were two
>> persons at the office, and there is a lot of stuff to do. I will have
>> a quick look at it and verify but will probably let Rickard deal with
>> it when he comes back.
>>
>> Thanks for a great summary and drill down of the problem!
>>
>> Regards,
>> Björn-Egil
>>
>>
>> 2013/7/23 Scott Lystig Fritchie <fritchie@REDACTED
>> <mailto:fritchie@REDACTED>>
>>
>> Hi, everyone. Hope your summer vacations are going well. I have
>> some
>> bad news for Rickard, at least.
>>
>> SHA: e794251f8e54d6697e1bcc360471fd76b20c7748
>> Author: Rickard Green <rickard@REDACTED
>> <mailto:rickard@REDACTED>>
>> Date: Thu May 30 2013 07:56:31 GMT-0500 (CDT)
>> Subject: Merge branch 'rickard/ptab-id-alloc/OTP-11077' into
>> maint
>> Parent: 22685099ace9802016bf6203c525702084717d72
>> Parent: 5c039a1fb4979314912dc3af6626d8d7a1c73993
>> Merge branch 'rickard/ptab-id-alloc/OTP-11077' into maint
>>
>> * rickard/ptab-id-alloc/OTP-11077:
>> Introduce a better id allocation algorithm for PTabs
>>
>> This commit appears to break monitor delivery? And it may or may
>> not be
>> causing processes to die for reasons that we cannot see or
>> understand.
>>
>> Run with R15B03-1, the example code in test6.erl is merely slow:
>>
>> https://gist.github.com/jtuple/aa4830a0ff0a94f69484/raw/02adc518e225f263a7e25d339ec7200ef2dda491/test6.erl
>>
>> On my 4 core/8 HT core MacBook Pro, R15B03-1 cannot go above 200% CPU
>> utilization, and the execution time is correspondingly slooow.
>> But it
>> appears to work correctly.
>>
>> erl -eval '[begin io:format("Iteration ~p at ~p\n",
>> [X,time()]), test6:go() end || X <- lists:seq(1, 240)].'
>>
>> When run with R16B, it's *much* faster. CPU utilization above 750%
>> confirms that it's going faster. And it appears to work correctly.
>>
>> However, when run with R16B01, we see non-deterministic hangs on
>> both OS
>> X and various Linux platforms. CPU consumption by the "beam.smp"
>> process drops to 0, and the next cycle of the list comprehension
>> never
>> starts.
>>
>> Thanks to the magic of Git, it's pretty clear that the commit
>> above is
>> broken. The commit before it appears to work well (i.e., does not
>> hang).
>>
>> SHA: 22685099ace9802016bf6203c525702084717d72
>> Author: Anders Svensson <anders@REDACTED
>> <mailto:anders@REDACTED>>
>> Date: Wed May 29 2013 11:46:10 GMT-0500 (CDT)
>> Subject: Merge branch
>> 'anders/diameter/watchdog_function_clause/OTP-11115' into maint
>>
>> Using R16B01 together with the "+P legacy" flag does not hang.
>> But this
>> problem has given us at Basho enough ... caution ... that we will be
>> much more cautious about moving our app packaging from R15B* to
>> R16B*.
>>
>> Several seconds after CPU consumption drops to 0%, then I trigger the
>> creation of a "erl_crash.dump" file using erlang:halt("bummer").
>> If I
>> look at that file, then the process "Spawned as:
>> test6:do_work2/0" says
>> that there are active unidirectional links (i.e., monitors), but
>> there
>> is one process on that list that does not have a corresponding
>> "=proc:<pid.number.here>" entry in the dump ... which strongly
>> suggests
>> to me that the process is dead. Using DTrace, I've been able to
>> establish that the dead process is indeed alive at one time and
>> has been
>> scheduled & descheduled at least once. So there are really two
>> mysteries:
>>
>> 1. Why is one of the test6:indirect_proxy/1 processes dying
>> unexpectedly? (The monitor doesn't fire, SASL isn't logging any
>> errors,
>> etc.)
>>
>> 2. Why isn't a monitor message being delivered?
>>
>> Many thanks to Joe Blomstedt, Evan Vigil-McClanahan, Andrew Thompson,
>> Steve Vinoski, and Sean Cribbs for their sleuthing work.
>>
>> -Scott
>>
>> --- snip --- snip --- snip --- snip --- snip ---
>>
>> R15B03 lock count analysis, FWIW:
>>
>> lock id #tries #collisions collisions [%] time
>> [us] duration [%]
>> ----- --- ------- ------------ ---------------
>> ---------- -------------
>> proc_tab 1 1280032 1266133 98.9142 60642804
>> 557.0583
>> run_queue 8 3617608 12874 0.3559 261722
>> 2.4042
>> sys_tracers 1 1280042 6445 0.5035 19365
>> 0.1779
>> pix_lock 256 4480284 1213 0.0271 9777
>> 0.0898
>> timeofday 1 709955 1187 0.1672 3216
>> 0.0295
>> [......]
>>
>> --- snip --- snip --- snip --- snip --- snip ---
>>
>> =proc:<0.29950.154>
>> State: Waiting
>> Spawned as: test6:do_work2/0
>> Spawned by: <0.48.0>
>> Started: Tue Jul 23 04:50:54 2013
>> Message queue length: 0
>> Number of heap fragments: 0
>> Heap fragment data: 0
>> Link list: [{from,<0.48.0>,#Ref<0.0.19.96773>},
>> {to,<0.32497.154>,#Ref<0.0.19.96797>},
>> {to,<0.1184.155>,#Ref<0.0.19.96796>},
>> {to,<0.31361.154>,#Ref<0.0.19.96799>},
>> {to,<0.32019.154>,#Ref<0.0.19.96801>},
>> {to,<0.32501.154>,#Ref<0.0.19.96800>},
>> {to,<0.1352.155>,#Ref<0.0.19.96803>},
>> {to,<0.32415.154>,#Ref<0.0.19.96805>},
>> {to,<0.504.155>,#Ref<0.0.19.96804>},
>> {to,<0.87.155>,#Ref<0.0.19.96802>},
>> {to,<0.776.155>,#Ref<0.0.19.96798>}]
>> Reductions: 45
>> Stack+heap: 233
>> OldHeap: 0
>> Heap unused: 155
>> OldHeap unused: 0
>> Memory: 3472
>> Program counter: 0x000000001e1504d0 (test6:do_work2/0 + 184)
>> CP: 0x0000000000000000 (invalid)
>> arity = 0
>> _______________________________________________
>> erlang-bugs mailing list
>> erlang-bugs@REDACTED <mailto:erlang-bugs@REDACTED>
>> http://erlang.org/mailman/listinfo/erlang-bugs
>>
>>
>>
>>
>> _______________________________________________
>> erlang-bugs mailing list
>> erlang-bugs@REDACTED
>> http://erlang.org/mailman/listinfo/erlang-bugs
>
>
>
> _______________________________________________
> erlang-bugs mailing list
> erlang-bugs@REDACTED
> http://erlang.org/mailman/listinfo/erlang-bugs
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-bugs/attachments/20130801/5a28e6bb/attachment.htm>
More information about the erlang-bugs
mailing list