<html>
  <head>
    <meta content="text/html; charset=ISO-8859-1"
      http-equiv="Content-Type">
  </head>
  <body bgcolor="#FFFFFF" text="#000000">
    <div class="moz-cite-prefix">Hi Garret!<br>
      <br>
      I've been able to reproduce it on my freshly installed Win2008
      machine! Great, now I only need to debug it and find the error :)<br>
      <br>
      I'll get back to you as soon as I feel I have a fix - it might
      take a few days, given the relatively long turn around time, but
      we'll get there!<br>
      <br>
      Thank you for all the help and information!<br>
      <br>
      Cheers,<br>
      /Patrik<br>
      <br>
      On 03/05/2013 09:10 PM, Garret Smith wrote:<br>
    </div>
    <blockquote
cite="mid:CAHmviK9G9kMfo_DAQrH+uXt+NouqgmXBLbtVF7+UQ2WP4fNBOA@mail.gmail.com"
      type="cite">
      <meta http-equiv="Content-Type" content="text/html;
        charset=ISO-8859-1">
      <div dir="ltr">
        <div>On the same machine with the same steps as previous, I
          reproduced the time jump on R16B.<br>
          This time the jump happened with a <5 sec delta btw now()
          and os:timestamp().<br>
          Still jumping ~2126000 seconds.<br>
          <br>
        </div>
        -Garret<br>
      </div>
      <div class="gmail_extra"><br>
        <br>
        <div class="gmail_quote">On Tue, Mar 5, 2013 at 11:20 AM, Garret
          Smith <span dir="ltr"><<a moz-do-not-send="true"
              href="mailto:garret.smith@gmail.com" target="_blank">garret.smith@gmail.com</a>></span>
          wrote:<br>
          <blockquote class="gmail_quote" style="margin:0 0 0
            .8ex;border-left:1px #ccc solid;padding-left:1ex">
            <div dir="ltr">
              <div>
                <div>
                  <div>
                    <div>The gist <a moz-do-not-send="true"
                        href="https://gist.github.com/garret-smith/5087169"
                        target="_blank">https://gist.github.com/garret-smith/5087169</a>
                      is updated with a slightly better version.  I was
                      able to reproduce the jump in less than an hour. 
                      I also did some more things to perturb the timing
                      code while the test program was running.<br>
                      <br>
                    </div>
                    <div>Here is the latest info, everything I can think
                      of that may have the slightest effect:<br>
                    </div>
                    <div> * R15B01 64-bit build<br>
                    </div>
                    <div> * Pacific time zone (GMT -8)<br>
                    </div>
                    <div> * Xeon E5405 in an HP DL160<br>
                    </div>
                    <div> * no arguments to erl.exe<br>
                    </div>
                    <div> * bursty, high CPU load, >75% memory used
                      by other software<br>
                    </div>
                    <div> * running Observer on the test VM displaying
                      the "Load Charts" tab<br>
                    </div>
                    <div>
                       * made some small adjustments (~ 60 seconds) to
                      the system clock while running the tests - now()
                      and os:timestamp() behaved as expected, initially
                      showing a delta and slowly converging<br>
                    </div>
                    <div> * w32tm /resync to fix the system clock some
                      time after perturbing it<br>
                    </div>
                    <br>
                  </div>
                  The time jump in now() occurred when now() was ~9
                  seconds behind os:timestamp() as reported by the new
                  test program.<br>
                  <br>
                </div>
                I'm starting to look at R16B now.<span class="HOEnZb"><font
                    color="#888888"><br>
                    <br>
                  </font></span></div>
              <span class="HOEnZb"><font color="#888888">-Garret Smith<br>
                </font></span></div>
            <div class="HOEnZb">
              <div class="h5">
                <div class="gmail_extra">
                  <br>
                  <br>
                  <div class="gmail_quote">On Tue, Mar 5, 2013 at 8:37
                    AM, Garret Smith <span dir="ltr"><<a
                        moz-do-not-send="true"
                        href="mailto:garret.smith@gmail.com"
                        target="_blank">garret.smith@gmail.com</a>></span>
                    wrote:<br>
                    <blockquote class="gmail_quote" style="margin:0 0 0
                      .8ex;border-left:1px #ccc solid;padding-left:1ex">
                      <div dir="ltr">
                        <div>
                          <div>
                            <div>
                              <div>I haven't seen anything unexpected in
                                os:timestamp().  No jumps at all.<br>
                                <br>
                              </div>
                              <div>CPU is an Intel Xeon X3430.<br>
                                <br>
                                I have reproduced it in the
                                LosAngeles/Pacific Time (GMT -8) and US
                                East coast time zone (GMT -5).<br>
                              </div>
                              <br>
                            </div>
                            I have not yet tried R16B.  I'll be starting
                            that today.  I'm also trying to improve the
                            test program, since it's taking quite a long
                            time between jumps for me as well.  I'll let
                            you know as soon as I have a better one.<br>
                            <br>
                          </div>
                          You have no idea how relieved I am that you
                          are looking into this!<br>
                          <br>
                        </div>
                        <div>Thanks,<br>
                        </div>
                        Garret Smith<br>
                      </div>
                      <div>
                        <div>
                          <div class="gmail_extra"><br>
                            <br>
                            <div class="gmail_quote">
                              On Tue, Mar 5, 2013 at 3:06 AM, Patrik
                              Nyblom <span dir="ltr"><<a
                                  moz-do-not-send="true"
                                  href="mailto:pan@erlang.org"
                                  target="_blank">pan@erlang.org</a>></span>
                              wrote:<br>
                              <blockquote class="gmail_quote"
                                style="margin:0 0 0 .8ex;border-left:1px
                                #ccc solid;padding-left:1ex">
                                <div bgcolor="#FFFFFF" text="#000000">
                                  <div>Hi again...<br>
                                    <br>
                                    I'm not sure about one thing. What
                                    happens to os:timestamp() during
                                    these jumps? Does it stay on track
                                    or does it also jump around?<br>
                                    <br>
                                    I've tried to reproduce it with your
                                    program, but has not yet succeeded.
                                    Have you seen this on the R16B
                                    release as well?<br>
                                    <br>
                                    Is the hardware in any way fancy
                                    (like a lot of cores, some new
                                    processor I don't have or something
                                    else?) or is there anything else
                                    special about the machine? Also the
                                    time zone you're running in would be
                                    interesting, as there is some time
                                    zone specific code there...<br>
                                    <br>
                                    I would really like to be able to
                                    reproduce it so you don't have to do
                                    all the tests at your site, it might
                                    end up being really time consuming
                                    for you if I make to many mistakes
                                    :)  <br>
                                    <br>
                                    Cheers,<br>
                                    /Patrik
                                    <div>
                                      <div><br>
                                        <br>
                                        <br>
                                        On 03/05/2013 08:50 AM, Patrik
                                        Nyblom wrote:<br>
                                      </div>
                                    </div>
                                  </div>
                                  <div>
                                    <div>
                                      <blockquote type="cite">
                                        <div>Hi!<br>
                                          <br>
                                          On 03/05/2013 02:26 AM, Garret
                                          Smith wrote:<br>
                                        </div>
                                        <blockquote type="cite">
                                          <div dir="ltr">
                                            <div>
                                              <div>
                                                <div>
                                                  <div>
                                                    <div>
                                                      <div>
                                                        <div>I have been
                                                          beating my
                                                          head against a
                                                          wall for weeks
                                                          tracking down
                                                          spooky
                                                          behaviour[sic]
                                                          in one of our
                                                          production
                                                          systems.  I
                                                          finally
                                                          tracked it
                                                          down to
                                                          "jumps" in the
                                                          times returned
                                                          by
                                                          erlang:now(),
                                                          causing all
                                                          timers in the
                                                          system to
                                                          expire at
                                                          once.  I have
                                                          witnessed this
                                                          bug on R15B01,
                                                          both 64 and
                                                          32-bit
                                                          versions
                                                          running on
                                                          Windows Server
                                                          2008 R2, both
                                                          on bare metal
                                                          and VirtualBox
                                                          VM.<br>
                                                          <br>
                                                        </div>
                                                        The time jump is
                                                        always around
                                                        2126000 seconds,
                                                        or a little over
                                                        24 days.  The
                                                        now() time does
                                                        not try to
                                                        converge with
                                                        os:timestamp()
                                                        as the
                                                        documentation
                                                        suggests, and as
                                                        I confirmed it
                                                        does if you just
                                                        change the
                                                        system clock.<br>
                                                        <br>
                                                      </div>
                                                      Another VM running
                                                      concurrently on
                                                      the same machine
                                                      but with little
                                                      load (diagnostic
                                                      node &
                                                      production node)
                                                      did not time jump.<br>
                                                      <br>
                                                    </div>
                                                    <div>Higher load
                                                      seems to make the
                                                      time jumps happen
                                                      more often.<br>
                                                    </div>
                                                    <div><br>
                                                    </div>
                                                    Frequency between
                                                    time jumps varies
                                                    between seconds and
                                                    hours, but when a
                                                    jump occurs, it is
                                                    always 2126000 + (9
                                                    to 26) seconds.<br>
                                                    <br>
                                                  </div>
                                                  I never see the jump
                                                  in logfile timestamps
                                                  that use
                                                  os:timestamp() for
                                                  tagging log messages. 
                                                  I had to start tracing
                                                  a production node
                                                  before I caught the
                                                  jump.  Here are some
                                                  lines from a trace,
                                                  where the timestamp in
                                                  trace_ts is printed
                                                  using
                                                  calendar:now_to_local_time()
                                                  and then in raw tuple
                                                  format:<br>
                                                  <br>
                                                  2013-4-16
                                                  21:40:1.993399|{1366,173601,993399}<br>
                                                  2013-4-16
                                                  21:40:1.993400|{1366,173601,993400}<br>
                                                  2013-5-11
                                                  12:13:41.986961|{1368,299621,986961}<br>
                                                  2013-5-11
                                                  12:13:41.986962|{1368,299621,986962}<br>
                                                  <br>
                                                </div>
                                                then a bit later...<br>
                                                <br>
                                                2013-5-11
                                                12:36:19.955129|{1368,300979,955129}<br>
                                                2013-5-11
                                                12:36:19.955130|{1368,300979,955130}<br>
                                                2013-6-5
                                                3:9:49.538830|{1370,426989,538830}<br>
                                                2013-6-5
                                                3:9:49.538833|{1370,426989,538833}<br>
                                                <br>
                                              </div>
                                            </div>
                                          </div>
                                        </blockquote>
                                        Gah! That's obviously not
                                        supposed to happen... <br>
                                        <blockquote type="cite">
                                          <div dir="ltr">
                                            <div>I captured many such
                                              jumps over the course of a
                                              day or so.  Obviously from
                                              the dates, 2 jumps
                                              happened before I started
                                              tracing.<br>
                                              <br>
                                            </div>
                                            <div>I was able to reproduce
                                              the bug, though not as
                                              efficiently as my
                                              production system, with
                                              the following sample
                                              program: <a
                                                moz-do-not-send="true"
                                                href="https://gist.github.com/garret-smith/5087169"
                                                target="_blank">https://gist.github.com/garret-smith/5087169</a><br>
                                              <br>
                                              It took over an hour of
                                              runtime before the first
                                              time jump.  I am working
                                              on a better way to
                                              reproduce it at the
                                              moment, but it's hard to
                                              test the test with a bug
                                              so intermittent.<br>
                                            </div>
                                            <div><br>
                                            </div>
                                            <div>I am also testing
                                              various other VM
                                              versions.  My first hope
                                              was that this was limited
                                              to the 64-bit version
                                              where we first encountered
                                              the problem, but a change
                                              to the 32-bit version has
                                              only made the problem
                                              happen less often, not
                                              eliminated it.<br>
                                              <br>
                                            </div>
                                            <div>We never saw this bug
                                              with R14B03 which we were
                                              running previously to
                                              R15B01.  However, system
                                              load is different so I
                                              can't make a direct
                                              comparison.  I did notice
                                              a few significant updates
                                              to the Windows time
                                              related code between
                                              R14B03 and R15:<br>
                                              <br>
                                            </div>
                                            <div>git log sys_time.c<br>
                                            </div>
                                            <div><br>
                                            </div>
                                            <div>commit
                                              46eb4359b05b220861453a869dc734480ec045a6<br>
                                              Author: Patrik Nyblom <<a
                                                moz-do-not-send="true"
                                                href="mailto:pan@erlang.org"
                                                target="_blank">pan@erlang.org</a>><br>
                                              Date:   Tue Dec 6 19:07:16
                                              2011 +0100<br>
                                              <br>
                                                  Emulate localtime,
                                              gmtime and mktime to
                                              enable negative time_t<br>
                                              <br>
                                              commit
                                              913f05af100e98a8665bbb6168e89fbcfe4ece75<br>
                                              Author:
                                              Bj<C3><B6>rn-Egil
                                              Dahlberg <<a
                                                moz-do-not-send="true"
                                                href="mailto:egil@erlang.org"
                                                target="_blank">egil@erlang.org</a>><br>
                                              Date:   Fri Dec 2 15:25:06
                                              2011 +0100<br>
                                              <br>
                                                  Teach windows
                                              sys_localtime_r<br>
                                              <br>
                                              <br>
                                            </div>
                                          </div>
                                        </blockquote>
                                        Yep, that's me... But even if I
                                        gave a totally weird time back
                                        from those, the erlang:now logic
                                        should have stopped this from
                                        happening. I'll try to reproduce
                                        using your example program. If
                                        nothing else helps, I'll
                                        instrument a VM that gives som
                                        traces in the time code...<br>
                                        <blockquote type="cite">
                                          <div dir="ltr">
                                            <div>I am completely
                                              stumped.  What can I do
                                              next to help track down
                                              the source of the bug?<br>
                                              <br>
                                            </div>
                                          </div>
                                        </blockquote>
                                        Unfortunately, so am I.
                                        Especially weird that it's load
                                        related... Maybe something is
                                        not locked as it should be...<br>
                                        <blockquote type="cite">
                                          <div dir="ltr">
                                            <div>Thanks,<br>
                                            </div>
                                            <div>Garret Smith<br>
                                            </div>
                                          </div>
                                        </blockquote>
                                        Thanks for reporting, I'll get
                                        back to you!<br>
                                        <br>
                                        Cheers,<br>
                                        /Patrik<br>
                                        <blockquote type="cite"> <br>
                                          <fieldset></fieldset>
                                          <br>
                                          <pre>_______________________________________________
erlang-bugs mailing list
<a moz-do-not-send="true" href="mailto:erlang-bugs@erlang.org" target="_blank">erlang-bugs@erlang.org</a>
<a moz-do-not-send="true" href="http://erlang.org/mailman/listinfo/erlang-bugs" target="_blank">http://erlang.org/mailman/listinfo/erlang-bugs</a>
</pre>
                                        </blockquote>
                                        <br>
                                        <br>
                                        <fieldset></fieldset>
                                        <br>
                                        <pre>_______________________________________________
erlang-bugs mailing list
<a moz-do-not-send="true" href="mailto:erlang-bugs@erlang.org" target="_blank">erlang-bugs@erlang.org</a>
<a moz-do-not-send="true" href="http://erlang.org/mailman/listinfo/erlang-bugs" target="_blank">http://erlang.org/mailman/listinfo/erlang-bugs</a>
</pre>
                                      </blockquote>
                                      <br>
                                    </div>
                                  </div>
                                </div>
                                <br>
_______________________________________________<br>
                                erlang-bugs mailing list<br>
                                <a moz-do-not-send="true"
                                  href="mailto:erlang-bugs@erlang.org"
                                  target="_blank">erlang-bugs@erlang.org</a><br>
                                <a moz-do-not-send="true"
                                  href="http://erlang.org/mailman/listinfo/erlang-bugs"
                                  target="_blank">http://erlang.org/mailman/listinfo/erlang-bugs</a><br>
                                <br>
                              </blockquote>
                            </div>
                            <br>
                          </div>
                        </div>
                      </div>
                    </blockquote>
                  </div>
                  <br>
                </div>
              </div>
            </div>
          </blockquote>
        </div>
        <br>
      </div>
    </blockquote>
    <br>
  </body>
</html>