<html>

  <head>

    <meta content="text/html; charset=ISO-8859-1"

      http-equiv="Content-Type">

  </head>

  <body text="#000000" bgcolor="#FFFFFF">

    <div class="moz-cite-prefix"><br>

      We have confirmed that this problem indeed exists and we think we

      understand what is happening.<br>

      <br>

      The problem has a low probability of occurring, though obviously

      reproducible, and pretty serious if it occurs.<br>

      <br>

      I won't go into too much details, but the uniqueness of the

      process identifier can be compromised, i.e. it will not be unique.

      In essence a process might get an identifier of an already

      terminated process (or an already living one though I haven't

      confirmed that), the mapping is then overwritten, and by

      inspecting this identifier it will look dead. Signals or messages

      will not be sent since it is "dead" or sent to an unsuspecting

      (wrong) process. The mappings of id's and process pointers has

      become inconsistent. It's a bit more complicated than that but in

      a nutshell that's what's happening.<br>

      <br>

      What is needed for this to occur? A wrapping of the entire

      "free-list-ring" of identifiers (size of max processes) while one

      thread is in progress of doing an atomic read, some shift and

      masking, and then a write for creating an identifier. *Highly

      unlikely* but definitely a race. I.e. while one thread is doing a

      read, shift/mask, and write to memory the other threads has to

      create and terminate 262144 processes (or whatever the limit is

      set to, but that is the default)<br>

      <br>

      If the thread is scheduled out by the OS, or a hyperthread switch

      occurs because of a mem-stall (we're dealing with membarriers here

      after all so it might be a thing) between the read and write the

      likelihood of an incident increases. Also, by lowering

      max-process-limit in the system the likelihood increases.<br>

      <br>

      We think we have a solution for this and initial tests show no

      evidence of uniqueness problem after the fix. I think we will have

      a fix out in maint next week.<br>

      <br>

      Using R16B01 together with the "+P legacy" is a workaround for

      this issue. The legacy option uses the old way and does not suffer

      from this problem.<br>

      <br>

      Thank you Scott, and to the rest of you at Basho for reporting

      this.<br>

      <br>

      Regards,<br>

      Björn-Egil<br>

      <br>

      <br>

      On 2013-07-23 23:19, Björn-Egil Dahlberg wrote:<br>

    </div>

    <blockquote

cite="mid:CAMjYFoOeWYOoO1w4Wx11td5FrOMp5xssO-nZHihCvphix6z_Aw@mail.gmail.com"

      type="cite">

      <div dir="ltr">True, that seems suspicious.

        <div><br>

        </div>

        <div>The vacation for Rickard is going great I think. Last I

          heard from him, he was diving round Öland (literally

          "island-land") in south-eastern sweden. It will be a few weeks

          before he's back.</div>

        <div><br>

        </div>

        <div>In the meanwhile it is fairly lonely here at OTP, today we

          were two persons at the office, and there is a lot of stuff to

          do. I will have a quick look at it and verify but will

          probably let Rickard deal with it when he comes back.</div>

        <div><br>

        </div>

        <div>Thanks for a great summary and drill down of the problem!</div>

        <div><br>

        </div>

        <div>Regards,</div>

        <div>Björn-Egil</div>

      </div>

      <div class="gmail_extra"><br>

        <br>

        <div class="gmail_quote">2013/7/23 Scott Lystig Fritchie <span

            dir="ltr"><<a moz-do-not-send="true"

              href="mailto:fritchie@snookles.com" target="_blank">fritchie@snookles.com</a>></span><br>

          <blockquote class="gmail_quote" style="margin:0 0 0

            .8ex;border-left:1px #ccc solid;padding-left:1ex">Hi,

            everyone.  Hope your summer vacations are going well.  I

            have some<br>

            bad news for Rickard, at least.<br>

            <br>

                SHA:         e794251f8e54d6697e1bcc360471fd76b20c7748<br>

                Author:      Rickard Green <<a moz-do-not-send="true"

              href="mailto:rickard@erlang.org">rickard@erlang.org</a>><br>

                Date:        Thu May 30 2013 07:56:31 GMT-0500 (CDT)<br>

                Subject: Merge branch 'rickard/ptab-id-alloc/OTP-11077'

            into maint<br>

                Parent:      22685099ace9802016bf6203c525702084717d72<br>

                Parent:      5c039a1fb4979314912dc3af6626d8d7a1c73993<br>

                Merge branch 'rickard/ptab-id-alloc/OTP-11077' into

            maint<br>

            <br>

                * rickard/ptab-id-alloc/OTP-11077:<br>

                  Introduce a better id allocation algorithm for PTabs<br>

            <br>

            This commit appears to break monitor delivery?  And it may

            or may not be<br>

            causing processes to die for reasons that we cannot see or

            understand.<br>

            <br>

            Run with R15B03-1, the example code in test6.erl is merely

            slow:<br>

            <br>

                <a moz-do-not-send="true"

href="https://gist.github.com/jtuple/aa4830a0ff0a94f69484/raw/02adc518e225f263a7e25d339ec7200ef2dda491/test6.erl"

              target="_blank">https://gist.github.com/jtuple/aa4830a0ff0a94f69484/raw/02adc518e225f263a7e25d339ec7200ef2dda491/test6.erl</a><br>

            <br>

            On my 4 core/8 HT core MacBook Pro, R15B03-1 cannot go above

            200% CPU<br>

            utilization, and the execution time is correspondingly

            slooow.  But it<br>

            appears to work correctly.<br>

            <br>

                erl -eval '[begin io:format("Iteration ~p at ~p\n",

            [X,time()]), test6:go() end || X <- lists:seq(1, 240)].'<br>

            <br>

            When run with R16B, it's *much* faster.  CPU utilization

            above 750%<br>

            confirms that it's going faster.  And it appears to work

            correctly.<br>

            <br>

            However, when run with R16B01, we see non-deterministic

            hangs on both OS<br>

            X and various Linux platforms.  CPU consumption by the

            "beam.smp"<br>

            process drops to 0, and the next cycle of the list

            comprehension never<br>

            starts.<br>

            <br>

            Thanks to the magic of Git, it's pretty clear that the

            commit above is<br>

            broken.  The commit before it appears to work well (i.e.,

            does not<br>

            hang).<br>

            <br>

                SHA:         22685099ace9802016bf6203c525702084717d72<br>

                Author:      Anders Svensson <<a

              moz-do-not-send="true" href="mailto:anders@erlang.org">anders@erlang.org</a>><br>

                Date:        Wed May 29 2013 11:46:10 GMT-0500 (CDT)<br>

                Subject: Merge branch

            'anders/diameter/watchdog_function_clause/OTP-11115' into

            maint<br>

            <br>

            Using R16B01 together with the "+P legacy" flag does not

            hang.  But this<br>

            problem has given us at Basho enough ... caution ... that we

            will be<br>

            much more cautious about moving our app packaging from R15B*

            to R16B*.<br>

            <br>

            Several seconds after CPU consumption drops to 0%, then I

            trigger the<br>

            creation of a "erl_crash.dump" file using

            erlang:halt("bummer").  If I<br>

            look at that file, then the process "Spawned as:

            test6:do_work2/0" says<br>

            that there are active unidirectional links (i.e., monitors),

            but there<br>

            is one process on that list that does not have a

            corresponding<br>

            "=proc:<pid.number.here>" entry in the dump ... which

            strongly suggests<br>

            to me that the process is dead.  Using DTrace, I've been

            able to<br>

            establish that the dead process is indeed alive at one time

            and has been<br>

            scheduled & descheduled at least once.  So there are

            really two<br>

            mysteries:<br>

            <br>

            1. Why is one of the test6:indirect_proxy/1 processes dying<br>

            unexpectedly?  (The monitor doesn't fire, SASL isn't logging

            any errors,<br>

            etc.)<br>

            <br>

            2. Why isn't a monitor message being delivered?<br>

            <br>

            Many thanks to Joe Blomstedt, Evan Vigil-McClanahan, Andrew

            Thompson,<br>

            Steve Vinoski, and Sean Cribbs for their sleuthing work.<br>

            <br>

            -Scott<br>

            <br>

            --- snip --- snip --- snip --- snip --- snip ---<br>

            <br>

            R15B03 lock count analysis, FWIW:<br>

            <br>

                    lock     id  #tries  #collisions  collisions [%]

             time [us]  duration [%]<br>

                   -----    --- ------- ------------ ---------------

            ---------- -------------<br>

                proc_tab      1 1280032      1266133         98.9142  

            60642804      557.0583<br>

               run_queue      8 3617608        12874          0.3559    

            261722        2.4042<br>

             sys_tracers      1 1280042         6445          0.5035    

             19365        0.1779<br>

                pix_lock    256 4480284         1213          0.0271    

              9777        0.0898<br>

               timeofday      1  709955         1187          0.1672    

              3216        0.0295<br>

            [......]<br>

            <br>

            --- snip --- snip --- snip --- snip --- snip ---<br>

            <br>

            =proc:<0.29950.154><br>

            State: Waiting<br>

            Spawned as: test6:do_work2/0<br>

            Spawned by: <0.48.0><br>

            Started: Tue Jul 23 04:50:54 2013<br>

            Message queue length: 0<br>

            Number of heap fragments: 0<br>

            Heap fragment data: 0<br>

            Link list: [{from,<0.48.0>,#Ref<0.0.19.96773>},

            {to,<0.32497.154>,#Ref<0.0.19.96797>},

            {to,<0.1184.155>,#Ref<0.0.19.96796>},

            {to,<0.31361.154>,#Ref<0.0.19.96799>},

            {to,<0.32019.154>,#Ref<0.0.19.96801>},

            {to,<0.32501.154>,#Ref<0.0.19.96800>},

            {to,<0.1352.155>,#Ref<0.0.19.96803>},

            {to,<0.32415.154>,#Ref<0.0.19.96805>},

            {to,<0.504.155>,#Ref<0.0.19.96804>},

            {to,<0.87.155>,#Ref<0.0.19.96802>},

            {to,<0.776.155>,#Ref<0.0.19.96798>}]<br>

            Reductions: 45<br>

            Stack+heap: 233<br>

            OldHeap: 0<br>

            Heap unused: 155<br>

            OldHeap unused: 0<br>

            Memory: 3472<br>

            Program counter: 0x000000001e1504d0 (test6:do_work2/0 + 184)<br>

            CP: 0x0000000000000000 (invalid)<br>

            arity = 0<br>

            _______________________________________________<br>

            erlang-bugs mailing list<br>

            <a moz-do-not-send="true"

              href="mailto:erlang-bugs@erlang.org">erlang-bugs@erlang.org</a><br>

            <a moz-do-not-send="true"

              href="http://erlang.org/mailman/listinfo/erlang-bugs"

              target="_blank">http://erlang.org/mailman/listinfo/erlang-bugs</a><br>

          </blockquote>

        </div>

        <br>

      </div>

      <br>

      <fieldset class="mimeAttachmentHeader"></fieldset>

      <br>

      <pre wrap="">_______________________________________________

erlang-bugs mailing list

<a class="moz-txt-link-abbreviated" href="mailto:erlang-bugs@erlang.org">erlang-bugs@erlang.org</a>

<a class="moz-txt-link-freetext" href="http://erlang.org/mailman/listinfo/erlang-bugs">http://erlang.org/mailman/listinfo/erlang-bugs</a>

</pre>

    </blockquote>

    <br>

  </body>

</html>