<html>
<head>
<meta content="text/html; charset=ISO-8859-1"
http-equiv="Content-Type">
</head>
<body bgcolor="#FFFFFF" text="#000000">
<div class="moz-cite-prefix">Hi!<br>
<br>
On 03/05/2013 02:26 AM, Garret Smith wrote:<br>
</div>
<blockquote
cite="mid:CAHmviK9hbY9peOE25=SDEpN1=G==GUPzAmmYq0YX-c5yL4KhTQ@mail.gmail.com"
type="cite">
<meta http-equiv="Content-Type" content="text/html;
charset=ISO-8859-1">
<div dir="ltr">
<div>
<div>
<div>
<div>
<div>
<div>
<div>I have been beating my head against a wall for
weeks tracking down spooky behaviour[sic] in one
of our production systems. I finally tracked it
down to "jumps" in the times returned by
erlang:now(), causing all timers in the system to
expire at once. I have witnessed this bug on
R15B01, both 64 and 32-bit versions running on
Windows Server 2008 R2, both on bare metal and
VirtualBox VM.<br>
<br>
</div>
The time jump is always around 2126000 seconds, or a
little over 24 days. The now() time does not try to
converge with os:timestamp() as the documentation
suggests, and as I confirmed it does if you just
change the system clock.<br>
<br>
</div>
Another VM running concurrently on the same machine
but with little load (diagnostic node & production
node) did not time jump.<br>
<br>
</div>
<div>Higher load seems to make the time jumps happen
more often.<br>
</div>
<div><br>
</div>
Frequency between time jumps varies between seconds and
hours, but when a jump occurs, it is always 2126000 + (9
to 26) seconds.<br>
<br>
</div>
I never see the jump in logfile timestamps that use
os:timestamp() for tagging log messages. I had to start
tracing a production node before I caught the jump. Here
are some lines from a trace, where the timestamp in
trace_ts is printed using calendar:now_to_local_time() and
then in raw tuple format:<br>
<br>
2013-4-16 21:40:1.993399|{1366,173601,993399}<br>
2013-4-16 21:40:1.993400|{1366,173601,993400}<br>
2013-5-11 12:13:41.986961|{1368,299621,986961}<br>
2013-5-11 12:13:41.986962|{1368,299621,986962}<br>
<br>
</div>
then a bit later...<br>
<br>
2013-5-11 12:36:19.955129|{1368,300979,955129}<br>
2013-5-11 12:36:19.955130|{1368,300979,955130}<br>
2013-6-5 3:9:49.538830|{1370,426989,538830}<br>
2013-6-5 3:9:49.538833|{1370,426989,538833}<br>
<br>
</div>
</div>
</div>
</blockquote>
Gah! That's obviously not supposed to happen... <br>
<blockquote
cite="mid:CAHmviK9hbY9peOE25=SDEpN1=G==GUPzAmmYq0YX-c5yL4KhTQ@mail.gmail.com"
type="cite">
<div dir="ltr">
<div>I captured many such jumps over the course of a day or so.
Obviously from the dates, 2 jumps happened before I started
tracing.<br>
<br>
</div>
<div>I was able to reproduce the bug, though not as efficiently
as my production system, with the following sample program: <a
moz-do-not-send="true"
href="https://gist.github.com/garret-smith/5087169">https://gist.github.com/garret-smith/5087169</a><br>
<br>
It took over an hour of runtime before the first time jump. I
am working on a better way to reproduce it at the moment, but
it's hard to test the test with a bug so intermittent.<br>
</div>
<div><br>
</div>
<div>I am also testing various other VM versions. My first hope
was that this was limited to the 64-bit version where we first
encountered the problem, but a change to the 32-bit version
has only made the problem happen less often, not eliminated
it.<br>
<br>
</div>
<div>We never saw this bug with R14B03 which we were running
previously to R15B01. However, system load is different so I
can't make a direct comparison. I did notice a few
significant updates to the Windows time related code between
R14B03 and R15:<br>
<br>
</div>
<div>git log sys_time.c<br>
</div>
<div><br>
</div>
<div>commit 46eb4359b05b220861453a869dc734480ec045a6<br>
Author: Patrik Nyblom <<a moz-do-not-send="true"
href="mailto:pan@erlang.org">pan@erlang.org</a>><br>
Date: Tue Dec 6 19:07:16 2011 +0100<br>
<br>
Emulate localtime, gmtime and mktime to enable negative
time_t<br>
<br>
commit 913f05af100e98a8665bbb6168e89fbcfe4ece75<br>
Author: Bj<C3><B6>rn-Egil Dahlberg <<a
moz-do-not-send="true" href="mailto:egil@erlang.org">egil@erlang.org</a>><br>
Date: Fri Dec 2 15:25:06 2011 +0100<br>
<br>
Teach windows sys_localtime_r<br>
<br>
<br>
</div>
</div>
</blockquote>
Yep, that's me... But even if I gave a totally weird time back from
those, the erlang:now logic should have stopped this from happening.
I'll try to reproduce using your example program. If nothing else
helps, I'll instrument a VM that gives som traces in the time
code...<br>
<blockquote
cite="mid:CAHmviK9hbY9peOE25=SDEpN1=G==GUPzAmmYq0YX-c5yL4KhTQ@mail.gmail.com"
type="cite">
<div dir="ltr">
<div>I am completely stumped. What can I do next to help track
down the source of the bug?<br>
<br>
</div>
</div>
</blockquote>
Unfortunately, so am I. Especially weird that it's load related...
Maybe something is not locked as it should be...<br>
<blockquote
cite="mid:CAHmviK9hbY9peOE25=SDEpN1=G==GUPzAmmYq0YX-c5yL4KhTQ@mail.gmail.com"
type="cite">
<div dir="ltr">
<div>Thanks,<br>
</div>
<div>Garret Smith<br>
</div>
</div>
</blockquote>
Thanks for reporting, I'll get back to you!<br>
<br>
Cheers,<br>
/Patrik<br>
<blockquote
cite="mid:CAHmviK9hbY9peOE25=SDEpN1=G==GUPzAmmYq0YX-c5yL4KhTQ@mail.gmail.com"
type="cite">
<br>
<fieldset class="mimeAttachmentHeader"></fieldset>
<br>
<pre wrap="">_______________________________________________
erlang-bugs mailing list
<a class="moz-txt-link-abbreviated" href="mailto:erlang-bugs@erlang.org">erlang-bugs@erlang.org</a>
<a class="moz-txt-link-freetext" href="http://erlang.org/mailman/listinfo/erlang-bugs">http://erlang.org/mailman/listinfo/erlang-bugs</a>
</pre>
</blockquote>
<br>
</body>
</html>