<html>
<head>
<meta content="text/html; charset=ISO-8859-1"
http-equiv="Content-Type">
</head>
<body bgcolor="#FFFFFF" text="#000000">
<div class="moz-cite-prefix">Hi Garret!<br>
<br>
I've been able to reproduce it on my freshly installed Win2008
machine! Great, now I only need to debug it and find the error :)<br>
<br>
I'll get back to you as soon as I feel I have a fix - it might
take a few days, given the relatively long turn around time, but
we'll get there!<br>
<br>
Thank you for all the help and information!<br>
<br>
Cheers,<br>
/Patrik<br>
<br>
On 03/05/2013 09:10 PM, Garret Smith wrote:<br>
</div>
<blockquote
cite="mid:CAHmviK9G9kMfo_DAQrH+uXt+NouqgmXBLbtVF7+UQ2WP4fNBOA@mail.gmail.com"
type="cite">
<meta http-equiv="Content-Type" content="text/html;
charset=ISO-8859-1">
<div dir="ltr">
<div>On the same machine with the same steps as previous, I
reproduced the time jump on R16B.<br>
This time the jump happened with a <5 sec delta btw now()
and os:timestamp().<br>
Still jumping ~2126000 seconds.<br>
<br>
</div>
-Garret<br>
</div>
<div class="gmail_extra"><br>
<br>
<div class="gmail_quote">On Tue, Mar 5, 2013 at 11:20 AM, Garret
Smith <span dir="ltr"><<a moz-do-not-send="true"
href="mailto:garret.smith@gmail.com" target="_blank">garret.smith@gmail.com</a>></span>
wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0
.8ex;border-left:1px #ccc solid;padding-left:1ex">
<div dir="ltr">
<div>
<div>
<div>
<div>The gist <a moz-do-not-send="true"
href="https://gist.github.com/garret-smith/5087169"
target="_blank">https://gist.github.com/garret-smith/5087169</a>
is updated with a slightly better version. I was
able to reproduce the jump in less than an hour.
I also did some more things to perturb the timing
code while the test program was running.<br>
<br>
</div>
<div>Here is the latest info, everything I can think
of that may have the slightest effect:<br>
</div>
<div> * R15B01 64-bit build<br>
</div>
<div> * Pacific time zone (GMT -8)<br>
</div>
<div> * Xeon E5405 in an HP DL160<br>
</div>
<div> * no arguments to erl.exe<br>
</div>
<div> * bursty, high CPU load, >75% memory used
by other software<br>
</div>
<div> * running Observer on the test VM displaying
the "Load Charts" tab<br>
</div>
<div>
* made some small adjustments (~ 60 seconds) to
the system clock while running the tests - now()
and os:timestamp() behaved as expected, initially
showing a delta and slowly converging<br>
</div>
<div> * w32tm /resync to fix the system clock some
time after perturbing it<br>
</div>
<br>
</div>
The time jump in now() occurred when now() was ~9
seconds behind os:timestamp() as reported by the new
test program.<br>
<br>
</div>
I'm starting to look at R16B now.<span class="HOEnZb"><font
color="#888888"><br>
<br>
</font></span></div>
<span class="HOEnZb"><font color="#888888">-Garret Smith<br>
</font></span></div>
<div class="HOEnZb">
<div class="h5">
<div class="gmail_extra">
<br>
<br>
<div class="gmail_quote">On Tue, Mar 5, 2013 at 8:37
AM, Garret Smith <span dir="ltr"><<a
moz-do-not-send="true"
href="mailto:garret.smith@gmail.com"
target="_blank">garret.smith@gmail.com</a>></span>
wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0
.8ex;border-left:1px #ccc solid;padding-left:1ex">
<div dir="ltr">
<div>
<div>
<div>
<div>I haven't seen anything unexpected in
os:timestamp(). No jumps at all.<br>
<br>
</div>
<div>CPU is an Intel Xeon X3430.<br>
<br>
I have reproduced it in the
LosAngeles/Pacific Time (GMT -8) and US
East coast time zone (GMT -5).<br>
</div>
<br>
</div>
I have not yet tried R16B. I'll be starting
that today. I'm also trying to improve the
test program, since it's taking quite a long
time between jumps for me as well. I'll let
you know as soon as I have a better one.<br>
<br>
</div>
You have no idea how relieved I am that you
are looking into this!<br>
<br>
</div>
<div>Thanks,<br>
</div>
Garret Smith<br>
</div>
<div>
<div>
<div class="gmail_extra"><br>
<br>
<div class="gmail_quote">
On Tue, Mar 5, 2013 at 3:06 AM, Patrik
Nyblom <span dir="ltr"><<a
moz-do-not-send="true"
href="mailto:pan@erlang.org"
target="_blank">pan@erlang.org</a>></span>
wrote:<br>
<blockquote class="gmail_quote"
style="margin:0 0 0 .8ex;border-left:1px
#ccc solid;padding-left:1ex">
<div bgcolor="#FFFFFF" text="#000000">
<div>Hi again...<br>
<br>
I'm not sure about one thing. What
happens to os:timestamp() during
these jumps? Does it stay on track
or does it also jump around?<br>
<br>
I've tried to reproduce it with your
program, but has not yet succeeded.
Have you seen this on the R16B
release as well?<br>
<br>
Is the hardware in any way fancy
(like a lot of cores, some new
processor I don't have or something
else?) or is there anything else
special about the machine? Also the
time zone you're running in would be
interesting, as there is some time
zone specific code there...<br>
<br>
I would really like to be able to
reproduce it so you don't have to do
all the tests at your site, it might
end up being really time consuming
for you if I make to many mistakes
:) <br>
<br>
Cheers,<br>
/Patrik
<div>
<div><br>
<br>
<br>
On 03/05/2013 08:50 AM, Patrik
Nyblom wrote:<br>
</div>
</div>
</div>
<div>
<div>
<blockquote type="cite">
<div>Hi!<br>
<br>
On 03/05/2013 02:26 AM, Garret
Smith wrote:<br>
</div>
<blockquote type="cite">
<div dir="ltr">
<div>
<div>
<div>
<div>
<div>
<div>
<div>I have been
beating my
head against a
wall for weeks
tracking down
spooky
behaviour[sic]
in one of our
production
systems. I
finally
tracked it
down to
"jumps" in the
times returned
by
erlang:now(),
causing all
timers in the
system to
expire at
once. I have
witnessed this
bug on R15B01,
both 64 and
32-bit
versions
running on
Windows Server
2008 R2, both
on bare metal
and VirtualBox
VM.<br>
<br>
</div>
The time jump is
always around
2126000 seconds,
or a little over
24 days. The
now() time does
not try to
converge with
os:timestamp()
as the
documentation
suggests, and as
I confirmed it
does if you just
change the
system clock.<br>
<br>
</div>
Another VM running
concurrently on
the same machine
but with little
load (diagnostic
node &
production node)
did not time jump.<br>
<br>
</div>
<div>Higher load
seems to make the
time jumps happen
more often.<br>
</div>
<div><br>
</div>
Frequency between
time jumps varies
between seconds and
hours, but when a
jump occurs, it is
always 2126000 + (9
to 26) seconds.<br>
<br>
</div>
I never see the jump
in logfile timestamps
that use
os:timestamp() for
tagging log messages.
I had to start tracing
a production node
before I caught the
jump. Here are some
lines from a trace,
where the timestamp in
trace_ts is printed
using
calendar:now_to_local_time()
and then in raw tuple
format:<br>
<br>
2013-4-16
21:40:1.993399|{1366,173601,993399}<br>
2013-4-16
21:40:1.993400|{1366,173601,993400}<br>
2013-5-11
12:13:41.986961|{1368,299621,986961}<br>
2013-5-11
12:13:41.986962|{1368,299621,986962}<br>
<br>
</div>
then a bit later...<br>
<br>
2013-5-11
12:36:19.955129|{1368,300979,955129}<br>
2013-5-11
12:36:19.955130|{1368,300979,955130}<br>
2013-6-5
3:9:49.538830|{1370,426989,538830}<br>
2013-6-5
3:9:49.538833|{1370,426989,538833}<br>
<br>
</div>
</div>
</div>
</blockquote>
Gah! That's obviously not
supposed to happen... <br>
<blockquote type="cite">
<div dir="ltr">
<div>I captured many such
jumps over the course of a
day or so. Obviously from
the dates, 2 jumps
happened before I started
tracing.<br>
<br>
</div>
<div>I was able to reproduce
the bug, though not as
efficiently as my
production system, with
the following sample
program: <a
moz-do-not-send="true"
href="https://gist.github.com/garret-smith/5087169"
target="_blank">https://gist.github.com/garret-smith/5087169</a><br>
<br>
It took over an hour of
runtime before the first
time jump. I am working
on a better way to
reproduce it at the
moment, but it's hard to
test the test with a bug
so intermittent.<br>
</div>
<div><br>
</div>
<div>I am also testing
various other VM
versions. My first hope
was that this was limited
to the 64-bit version
where we first encountered
the problem, but a change
to the 32-bit version has
only made the problem
happen less often, not
eliminated it.<br>
<br>
</div>
<div>We never saw this bug
with R14B03 which we were
running previously to
R15B01. However, system
load is different so I
can't make a direct
comparison. I did notice
a few significant updates
to the Windows time
related code between
R14B03 and R15:<br>
<br>
</div>
<div>git log sys_time.c<br>
</div>
<div><br>
</div>
<div>commit
46eb4359b05b220861453a869dc734480ec045a6<br>
Author: Patrik Nyblom <<a
moz-do-not-send="true"
href="mailto:pan@erlang.org"
target="_blank">pan@erlang.org</a>><br>
Date: Tue Dec 6 19:07:16
2011 +0100<br>
<br>
Emulate localtime,
gmtime and mktime to
enable negative time_t<br>
<br>
commit
913f05af100e98a8665bbb6168e89fbcfe4ece75<br>
Author:
Bj<C3><B6>rn-Egil
Dahlberg <<a
moz-do-not-send="true"
href="mailto:egil@erlang.org"
target="_blank">egil@erlang.org</a>><br>
Date: Fri Dec 2 15:25:06
2011 +0100<br>
<br>
Teach windows
sys_localtime_r<br>
<br>
<br>
</div>
</div>
</blockquote>
Yep, that's me... But even if I
gave a totally weird time back
from those, the erlang:now logic
should have stopped this from
happening. I'll try to reproduce
using your example program. If
nothing else helps, I'll
instrument a VM that gives som
traces in the time code...<br>
<blockquote type="cite">
<div dir="ltr">
<div>I am completely
stumped. What can I do
next to help track down
the source of the bug?<br>
<br>
</div>
</div>
</blockquote>
Unfortunately, so am I.
Especially weird that it's load
related... Maybe something is
not locked as it should be...<br>
<blockquote type="cite">
<div dir="ltr">
<div>Thanks,<br>
</div>
<div>Garret Smith<br>
</div>
</div>
</blockquote>
Thanks for reporting, I'll get
back to you!<br>
<br>
Cheers,<br>
/Patrik<br>
<blockquote type="cite"> <br>
<fieldset></fieldset>
<br>
<pre>_______________________________________________
erlang-bugs mailing list
<a moz-do-not-send="true" href="mailto:erlang-bugs@erlang.org" target="_blank">erlang-bugs@erlang.org</a>
<a moz-do-not-send="true" href="http://erlang.org/mailman/listinfo/erlang-bugs" target="_blank">http://erlang.org/mailman/listinfo/erlang-bugs</a>
</pre>
</blockquote>
<br>
<br>
<fieldset></fieldset>
<br>
<pre>_______________________________________________
erlang-bugs mailing list
<a moz-do-not-send="true" href="mailto:erlang-bugs@erlang.org" target="_blank">erlang-bugs@erlang.org</a>
<a moz-do-not-send="true" href="http://erlang.org/mailman/listinfo/erlang-bugs" target="_blank">http://erlang.org/mailman/listinfo/erlang-bugs</a>
</pre>
</blockquote>
<br>
</div>
</div>
</div>
<br>
_______________________________________________<br>
erlang-bugs mailing list<br>
<a moz-do-not-send="true"
href="mailto:erlang-bugs@erlang.org"
target="_blank">erlang-bugs@erlang.org</a><br>
<a moz-do-not-send="true"
href="http://erlang.org/mailman/listinfo/erlang-bugs"
target="_blank">http://erlang.org/mailman/listinfo/erlang-bugs</a><br>
<br>
</blockquote>
</div>
<br>
</div>
</div>
</div>
</blockquote>
</div>
<br>
</div>
</div>
</div>
</blockquote>
</div>
<br>
</div>
</blockquote>
<br>
</body>
</html>