<div dir="ltr">Been running the test program all day in the same scenario as before.  No time jumps!  Looking good...<br></div><div class="gmail_extra"><br><br><div class="gmail_quote">On Mon, Mar 11, 2013 at 9:34 AM, Garret Smith <span dir="ltr"><<a href="mailto:garret.smith@gmail.com" target="_blank">garret.smith@gmail.com</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div><div><div>Patrik,<br></div><div><br></div>Our production systems are on R15B1/2, so I won't be able to verify against that, but I'll let you know what I see running my test program against R16B.<br>

<br></div>Will you be able to generate a patched R15x version?  If not, I'll try to set up a build system and apply the patch locally.<span class="HOEnZb"><font color="#888888"><br><br></font></span></div><span class="HOEnZb"><font color="#888888">-Garret<br>
</font></span></div><div class="HOEnZb"><div class="h5"><div class="gmail_extra"><br><br><div class="gmail_quote">
On Mon, Mar 11, 2013 at 9:26 AM, Patrik Nyblom <span dir="ltr"><<a href="mailto:pan@erlang.org" target="_blank">pan@erlang.org</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">


  
    
  
  <div bgcolor="#FFFFFF" text="#000000">
    <div>Hi again!<br>
      <br>
      I think I've found it. At least I've found one error, hopefully
      that's the one you've also found :)<br>
      <br>
      The sys_gethrtime function has gon new uses in R15 and on, uses
      where it is no longer protected by the  erts_timeofday_mtx. So -
      it simply needs a lock of it's own. This gives a slight
      performance loss, but that could be fixed by using GetTickCount64
      on win7 and win2008 at least. <br>
      <br>
      Can you try a version of beam.smp.dll with a lock and see if the
      error is gone on your machines? If that works, I would also like
      you to try an optimized version, but let's first make sure we have
      the bug nailed down :)<br>
      <br>
      In my dropbox, there's a beam.smp.dll. If you replace
      $ERL_ROOT/erts-5.10.1/bin/beam.smp.dll with that one and then
      start werl, the slogan should contain [source-be0da3e]. It is for
      64bit windows. The public dropbox URL is:<br>
      <a href="http://dl.dropbox.com/u/17212223/beam.smp.dll" target="_blank">http://dl.dropbox.com/u/17212223/beam.smp.dll</a><br>
      <br>
      This should work without any special messages or such, giving a
      working erlang:now/0. If it starts sending strange ERROR REPORT's
      about ticks moving slightly backwards, we have a more complicated
      bug, but I haven't seen any such messages since i added proper
      locking.<br>
      <br>
      If it's possible for you to test this, I would be immensely
      grateful!<br>
      <br>
      Cheers,<br>
      /Patrik<div><div><br>
      On 03/07/2013 04:37 PM, Patrik Nyblom wrote:<br>
    </div></div></div><div><div>
    <blockquote type="cite">
      
      <div>Hi Garret!<br>
        <br>
        I've been able to reproduce it on my freshly installed Win2008
        machine! Great, now I only need to debug it and find the error
        :)<br>
        <br>
        I'll get back to you as soon as I feel I have a fix - it might
        take a few days, given the relatively long turn around time, but
        we'll get there!<br>
        <br>
        Thank you for all the help and information!<br>
        <br>
        Cheers,<br>
        /Patrik<br>
        <br>
        On 03/05/2013 09:10 PM, Garret Smith wrote:<br>
      </div>
      <blockquote type="cite">
        <div dir="ltr">
          <div>On the same machine with the same steps as previous, I
            reproduced the time jump on R16B.<br>
            This time the jump happened with a <5 sec delta btw now()
            and os:timestamp().<br>
            Still jumping ~2126000 seconds.<br>
            <br>
          </div>
          -Garret<br>
        </div>
        <div class="gmail_extra"><br>
          <br>
          <div class="gmail_quote">On Tue, Mar 5, 2013 at 11:20 AM,
            Garret Smith <span dir="ltr"><<a href="mailto:garret.smith@gmail.com" target="_blank">garret.smith@gmail.com</a>></span>
            wrote:<br>
            <blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
              <div dir="ltr">
                <div>
                  <div>
                    <div>
                      <div>The gist <a href="https://gist.github.com/garret-smith/5087169" target="_blank">https://gist.github.com/garret-smith/5087169</a>
                        is updated with a slightly better version.  I
                        was able to reproduce the jump in less than an
                        hour.  I also did some more things to perturb
                        the timing code while the test program was
                        running.<br>
                        <br>
                      </div>
                      <div>Here is the latest info, everything I can
                        think of that may have the slightest effect:<br>
                      </div>
                      <div> * R15B01 64-bit build<br>
                      </div>
                      <div> * Pacific time zone (GMT -8)<br>
                      </div>
                      <div> * Xeon E5405 in an HP DL160<br>
                      </div>
                      <div> * no arguments to erl.exe<br>
                      </div>
                      <div> * bursty, high CPU load, >75% memory used
                        by other software<br>
                      </div>
                      <div> * running Observer on the test VM displaying
                        the "Load Charts" tab<br>
                      </div>
                      <div>  * made some small adjustments (~ 60
                        seconds) to the system clock while running the
                        tests - now() and os:timestamp() behaved as
                        expected, initially showing a delta and slowly
                        converging<br>
                      </div>
                      <div> * w32tm /resync to fix the system clock some
                        time after perturbing it<br>
                      </div>
                      <br>
                    </div>
                    The time jump in now() occurred when now() was ~9
                    seconds behind os:timestamp() as reported by the new
                    test program.<br>
                    <br>
                  </div>
                  I'm starting to look at R16B now.<span><font color="#888888"><br>
                      <br>
                    </font></span></div>
                <span><font color="#888888">-Garret Smith<br>
                  </font></span></div>
              <div>
                <div>
                  <div class="gmail_extra"> <br>
                    <br>
                    <div class="gmail_quote">On Tue, Mar 5, 2013 at 8:37
                      AM, Garret Smith <span dir="ltr"><<a href="mailto:garret.smith@gmail.com" target="_blank">garret.smith@gmail.com</a>></span>
                      wrote:<br>
                      <blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
                        <div dir="ltr">
                          <div>
                            <div>
                              <div>
                                <div>I haven't seen anything unexpected
                                  in os:timestamp().  No jumps at all.<br>
                                  <br>
                                </div>
                                <div>CPU is an Intel Xeon X3430.<br>
                                  <br>
                                  I have reproduced it in the
                                  LosAngeles/Pacific Time (GMT -8) and
                                  US East coast time zone (GMT -5).<br>
                                </div>
                                <br>
                              </div>
                              I have not yet tried R16B.  I'll be
                              starting that today.  I'm also trying to
                              improve the test program, since it's
                              taking quite a long time between jumps for
                              me as well.  I'll let you know as soon as
                              I have a better one.<br>
                              <br>
                            </div>
                            You have no idea how relieved I am that you
                            are looking into this!<br>
                            <br>
                          </div>
                          <div>Thanks,<br>
                          </div>
                          Garret Smith<br>
                        </div>
                        <div>
                          <div>
                            <div class="gmail_extra"><br>
                              <br>
                              <div class="gmail_quote"> On Tue, Mar 5,
                                2013 at 3:06 AM, Patrik Nyblom <span dir="ltr"><<a href="mailto:pan@erlang.org" target="_blank">pan@erlang.org</a>></span>
                                wrote:<br>
                                <blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
                                  <div bgcolor="#FFFFFF" text="#000000">
                                    <div>Hi again...<br>
                                      <br>
                                      I'm not sure about one thing. What
                                      happens to os:timestamp() during
                                      these jumps? Does it stay on track
                                      or does it also jump around?<br>
                                      <br>
                                      I've tried to reproduce it with
                                      your program, but has not yet
                                      succeeded. Have you seen this on
                                      the R16B release as well?<br>
                                      <br>
                                      Is the hardware in any way fancy
                                      (like a lot of cores, some new
                                      processor I don't have or
                                      something else?) or is there
                                      anything else special about the
                                      machine? Also the time zone you're
                                      running in would be interesting,
                                      as there is some time zone
                                      specific code there...<br>
                                      <br>
                                      I would really like to be able to
                                      reproduce it so you don't have to
                                      do all the tests at your site, it
                                      might end up being really time
                                      consuming for you if I make to
                                      many mistakes :)  <br>
                                      <br>
                                      Cheers,<br>
                                      /Patrik
                                      <div>
                                        <div><br>
                                          <br>
                                          <br>
                                          On 03/05/2013 08:50 AM, Patrik
                                          Nyblom wrote:<br>
                                        </div>
                                      </div>
                                    </div>
                                    <div>
                                      <div>
                                        <blockquote type="cite">
                                          <div>Hi!<br>
                                            <br>
                                            On 03/05/2013 02:26 AM,
                                            Garret Smith wrote:<br>
                                          </div>
                                          <blockquote type="cite">
                                            <div dir="ltr">
                                              <div>
                                                <div>
                                                  <div>
                                                    <div>
                                                      <div>
                                                        <div>
                                                          <div>I have
                                                          been beating
                                                          my head
                                                          against a wall
                                                          for weeks
                                                          tracking down
                                                          spooky
                                                          behaviour[sic]
                                                          in one of our
                                                          production
                                                          systems.  I
                                                          finally
                                                          tracked it
                                                          down to
                                                          "jumps" in the
                                                          times returned
                                                          by
                                                          erlang:now(),
                                                          causing all
                                                          timers in the
                                                          system to
                                                          expire at
                                                          once.  I have
                                                          witnessed this
                                                          bug on R15B01,
                                                          both 64 and
                                                          32-bit
                                                          versions
                                                          running on
                                                          Windows Server
                                                          2008 R2, both
                                                          on bare metal
                                                          and VirtualBox
                                                          VM.<br>
                                                          <br>
                                                          </div>
                                                          The time jump
                                                          is always
                                                          around 2126000
                                                          seconds, or a
                                                          little over 24
                                                          days.  The
                                                          now() time
                                                          does not try
                                                          to converge
                                                          with
                                                          os:timestamp()
                                                          as the
                                                          documentation
                                                          suggests, and
                                                          as I confirmed
                                                          it does if you
                                                          just change
                                                          the system
                                                          clock.<br>
                                                          <br>
                                                        </div>
                                                        Another VM
                                                        running
                                                        concurrently on
                                                        the same machine
                                                        but with little
                                                        load (diagnostic
                                                        node &
                                                        production node)
                                                        did not time
                                                        jump.<br>
                                                        <br>
                                                      </div>
                                                      <div>Higher load
                                                        seems to make
                                                        the time jumps
                                                        happen more
                                                        often.<br>
                                                      </div>
                                                      <div><br>
                                                      </div>
                                                      Frequency between
                                                      time jumps varies
                                                      between seconds
                                                      and hours, but
                                                      when a jump
                                                      occurs, it is
                                                      always 2126000 +
                                                      (9 to 26) seconds.<br>
                                                      <br>
                                                    </div>
                                                    I never see the jump
                                                    in logfile
                                                    timestamps that use
                                                    os:timestamp() for
                                                    tagging log
                                                    messages.  I had to
                                                    start tracing a
                                                    production node
                                                    before I caught the
                                                    jump.  Here are some
                                                    lines from a trace,
                                                    where the timestamp
                                                    in trace_ts is
                                                    printed using
                                                    calendar:now_to_local_time()
                                                    and then in raw
                                                    tuple format:<br>
                                                    <br>
                                                    2013-4-16
                                                    21:40:1.993399|{1366,173601,993399}<br>
                                                    2013-4-16
                                                    21:40:1.993400|{1366,173601,993400}<br>
                                                    2013-5-11
                                                    12:13:41.986961|{1368,299621,986961}<br>
                                                    2013-5-11
                                                    12:13:41.986962|{1368,299621,986962}<br>
                                                    <br>
                                                  </div>
                                                  then a bit later...<br>
                                                  <br>
                                                  2013-5-11
                                                  12:36:19.955129|{1368,300979,955129}<br>
                                                  2013-5-11
                                                  12:36:19.955130|{1368,300979,955130}<br>
                                                  2013-6-5
                                                  3:9:49.538830|{1370,426989,538830}<br>
                                                  2013-6-5
                                                  3:9:49.538833|{1370,426989,538833}<br>
                                                  <br>
                                                </div>
                                              </div>
                                            </div>
                                          </blockquote>
                                          Gah! That's obviously not
                                          supposed to happen... <br>
                                          <blockquote type="cite">
                                            <div dir="ltr">
                                              <div>I captured many such
                                                jumps over the course of
                                                a day or so.  Obviously
                                                from the dates, 2 jumps
                                                happened before I
                                                started tracing.<br>
                                                <br>
                                              </div>
                                              <div>I was able to
                                                reproduce the bug,
                                                though not as
                                                efficiently as my
                                                production system, with
                                                the following sample
                                                program: <a href="https://gist.github.com/garret-smith/5087169" target="_blank">https://gist.github.com/garret-smith/5087169</a><br>
                                                <br>
                                                It took over an hour of
                                                runtime before the first
                                                time jump.  I am working
                                                on a better way to
                                                reproduce it at the
                                                moment, but it's hard to
                                                test the test with a bug
                                                so intermittent.<br>
                                              </div>
                                              <div><br>
                                              </div>
                                              <div>I am also testing
                                                various other VM
                                                versions.  My first hope
                                                was that this was
                                                limited to the 64-bit
                                                version where we first
                                                encountered the problem,
                                                but a change to the
                                                32-bit version has only
                                                made the problem happen
                                                less often, not
                                                eliminated it.<br>
                                                <br>
                                              </div>
                                              <div>We never saw this bug
                                                with R14B03 which we
                                                were running previously
                                                to R15B01.  However,
                                                system load is different
                                                so I can't make a direct
                                                comparison.  I did
                                                notice a few significant
                                                updates to the Windows
                                                time related code
                                                between R14B03 and R15:<br>
                                                <br>
                                              </div>
                                              <div>git log sys_time.c<br>
                                              </div>
                                              <div><br>
                                              </div>
                                              <div>commit
                                                46eb4359b05b220861453a869dc734480ec045a6<br>
                                                Author: Patrik Nyblom
                                                <<a href="mailto:pan@erlang.org" target="_blank">pan@erlang.org</a>><br>
                                                Date:   Tue Dec 6
                                                19:07:16 2011 +0100<br>
                                                <br>
                                                    Emulate localtime,
                                                gmtime and mktime to
                                                enable negative time_t<br>
                                                <br>
                                                commit
                                                913f05af100e98a8665bbb6168e89fbcfe4ece75<br>
                                                Author:
                                                Bj<C3><B6>rn-Egil
                                                Dahlberg <<a href="mailto:egil@erlang.org" target="_blank">egil@erlang.org</a>><br>
                                                Date:   Fri Dec 2
                                                15:25:06 2011 +0100<br>
                                                <br>
                                                    Teach windows
                                                sys_localtime_r<br>
                                                <br>
                                                <br>
                                              </div>
                                            </div>
                                          </blockquote>
                                          Yep, that's me... But even if
                                          I gave a totally weird time
                                          back from those, the
                                          erlang:now logic should have
                                          stopped this from happening.
                                          I'll try to reproduce using
                                          your example program. If
                                          nothing else helps, I'll
                                          instrument a VM that gives som
                                          traces in the time code...<br>
                                          <blockquote type="cite">
                                            <div dir="ltr">
                                              <div>I am completely
                                                stumped.  What can I do
                                                next to help track down
                                                the source of the bug?<br>
                                                <br>
                                              </div>
                                            </div>
                                          </blockquote>
                                          Unfortunately, so am I.
                                          Especially weird that it's
                                          load related... Maybe
                                          something is not locked as it
                                          should be...<br>
                                          <blockquote type="cite">
                                            <div dir="ltr">
                                              <div>Thanks,<br>
                                              </div>
                                              <div>Garret Smith<br>
                                              </div>
                                            </div>
                                          </blockquote>
                                          Thanks for reporting, I'll get
                                          back to you!<br>
                                          <br>
                                          Cheers,<br>
                                          /Patrik<br>
                                          <blockquote type="cite"> <br>
                                            <fieldset></fieldset>
                                            <br>
                                            <pre>_______________________________________________
erlang-bugs mailing list
<a href="mailto:erlang-bugs@erlang.org" target="_blank">erlang-bugs@erlang.org</a>
<a href="http://erlang.org/mailman/listinfo/erlang-bugs" target="_blank">http://erlang.org/mailman/listinfo/erlang-bugs</a>
</pre>
                                          </blockquote>
                                          <br>
                                          <br>
                                          <fieldset></fieldset>
                                          <br>
                                          <pre>_______________________________________________
erlang-bugs mailing list
<a href="mailto:erlang-bugs@erlang.org" target="_blank">erlang-bugs@erlang.org</a>
<a href="http://erlang.org/mailman/listinfo/erlang-bugs" target="_blank">http://erlang.org/mailman/listinfo/erlang-bugs</a>
</pre>
                                        </blockquote>
                                        <br>
                                      </div>
                                    </div>
                                  </div>
                                  <br>
_______________________________________________<br>
                                  erlang-bugs mailing list<br>
                                  <a href="mailto:erlang-bugs@erlang.org" target="_blank">erlang-bugs@erlang.org</a><br>
                                  <a href="http://erlang.org/mailman/listinfo/erlang-bugs" target="_blank">http://erlang.org/mailman/listinfo/erlang-bugs</a><br>
                                  <br>
                                </blockquote>
                              </div>
                              <br>
                            </div>
                          </div>
                        </div>
                      </blockquote>
                    </div>
                    <br>
                  </div>
                </div>
              </div>
            </blockquote>
          </div>
          <br>
        </div>
      </blockquote>
      <br>
      <br>
      <fieldset></fieldset>
      <br>
      <pre>_______________________________________________
erlang-bugs mailing list
<a href="mailto:erlang-bugs@erlang.org" target="_blank">erlang-bugs@erlang.org</a>
<a href="http://erlang.org/mailman/listinfo/erlang-bugs" target="_blank">http://erlang.org/mailman/listinfo/erlang-bugs</a>
</pre>
    </blockquote>
    <br>
  </div></div></div>

<br>_______________________________________________<br>
erlang-bugs mailing list<br>
<a href="mailto:erlang-bugs@erlang.org" target="_blank">erlang-bugs@erlang.org</a><br>
<a href="http://erlang.org/mailman/listinfo/erlang-bugs" target="_blank">http://erlang.org/mailman/listinfo/erlang-bugs</a><br>
<br></blockquote></div><br></div>
</div></div></blockquote></div><br></div>