<div dir="ltr"><div><div><div>Patrik,<br></div><div><br></div>Our production systems are on R15B1/2, so I won't be able to verify against that, but I'll let you know what I see running my test program against R16B.<br>
<br></div>Will you be able to generate a patched R15x version? If not, I'll try to set up a build system and apply the patch locally.<br><br></div>-Garret<br></div><div class="gmail_extra"><br><br><div class="gmail_quote">
On Mon, Mar 11, 2013 at 9:26 AM, Patrik Nyblom <span dir="ltr"><<a href="mailto:pan@erlang.org" target="_blank">pan@erlang.org</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div bgcolor="#FFFFFF" text="#000000">
<div>Hi again!<br>
<br>
I think I've found it. At least I've found one error, hopefully
that's the one you've also found :)<br>
<br>
The sys_gethrtime function has gon new uses in R15 and on, uses
where it is no longer protected by the erts_timeofday_mtx. So -
it simply needs a lock of it's own. This gives a slight
performance loss, but that could be fixed by using GetTickCount64
on win7 and win2008 at least. <br>
<br>
Can you try a version of beam.smp.dll with a lock and see if the
error is gone on your machines? If that works, I would also like
you to try an optimized version, but let's first make sure we have
the bug nailed down :)<br>
<br>
In my dropbox, there's a beam.smp.dll. If you replace
$ERL_ROOT/erts-5.10.1/bin/beam.smp.dll with that one and then
start werl, the slogan should contain [source-be0da3e]. It is for
64bit windows. The public dropbox URL is:<br>
<a href="http://dl.dropbox.com/u/17212223/beam.smp.dll" target="_blank">http://dl.dropbox.com/u/17212223/beam.smp.dll</a><br>
<br>
This should work without any special messages or such, giving a
working erlang:now/0. If it starts sending strange ERROR REPORT's
about ticks moving slightly backwards, we have a more complicated
bug, but I haven't seen any such messages since i added proper
locking.<br>
<br>
If it's possible for you to test this, I would be immensely
grateful!<br>
<br>
Cheers,<br>
/Patrik<div><div class="h5"><br>
On 03/07/2013 04:37 PM, Patrik Nyblom wrote:<br>
</div></div></div><div><div class="h5">
<blockquote type="cite">
<div>Hi Garret!<br>
<br>
I've been able to reproduce it on my freshly installed Win2008
machine! Great, now I only need to debug it and find the error
:)<br>
<br>
I'll get back to you as soon as I feel I have a fix - it might
take a few days, given the relatively long turn around time, but
we'll get there!<br>
<br>
Thank you for all the help and information!<br>
<br>
Cheers,<br>
/Patrik<br>
<br>
On 03/05/2013 09:10 PM, Garret Smith wrote:<br>
</div>
<blockquote type="cite">
<div dir="ltr">
<div>On the same machine with the same steps as previous, I
reproduced the time jump on R16B.<br>
This time the jump happened with a <5 sec delta btw now()
and os:timestamp().<br>
Still jumping ~2126000 seconds.<br>
<br>
</div>
-Garret<br>
</div>
<div class="gmail_extra"><br>
<br>
<div class="gmail_quote">On Tue, Mar 5, 2013 at 11:20 AM,
Garret Smith <span dir="ltr"><<a href="mailto:garret.smith@gmail.com" target="_blank">garret.smith@gmail.com</a>></span>
wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div dir="ltr">
<div>
<div>
<div>
<div>The gist <a href="https://gist.github.com/garret-smith/5087169" target="_blank">https://gist.github.com/garret-smith/5087169</a>
is updated with a slightly better version. I
was able to reproduce the jump in less than an
hour. I also did some more things to perturb
the timing code while the test program was
running.<br>
<br>
</div>
<div>Here is the latest info, everything I can
think of that may have the slightest effect:<br>
</div>
<div> * R15B01 64-bit build<br>
</div>
<div> * Pacific time zone (GMT -8)<br>
</div>
<div> * Xeon E5405 in an HP DL160<br>
</div>
<div> * no arguments to erl.exe<br>
</div>
<div> * bursty, high CPU load, >75% memory used
by other software<br>
</div>
<div> * running Observer on the test VM displaying
the "Load Charts" tab<br>
</div>
<div> * made some small adjustments (~ 60
seconds) to the system clock while running the
tests - now() and os:timestamp() behaved as
expected, initially showing a delta and slowly
converging<br>
</div>
<div> * w32tm /resync to fix the system clock some
time after perturbing it<br>
</div>
<br>
</div>
The time jump in now() occurred when now() was ~9
seconds behind os:timestamp() as reported by the new
test program.<br>
<br>
</div>
I'm starting to look at R16B now.<span><font color="#888888"><br>
<br>
</font></span></div>
<span><font color="#888888">-Garret Smith<br>
</font></span></div>
<div>
<div>
<div class="gmail_extra"> <br>
<br>
<div class="gmail_quote">On Tue, Mar 5, 2013 at 8:37
AM, Garret Smith <span dir="ltr"><<a href="mailto:garret.smith@gmail.com" target="_blank">garret.smith@gmail.com</a>></span>
wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div dir="ltr">
<div>
<div>
<div>
<div>I haven't seen anything unexpected
in os:timestamp(). No jumps at all.<br>
<br>
</div>
<div>CPU is an Intel Xeon X3430.<br>
<br>
I have reproduced it in the
LosAngeles/Pacific Time (GMT -8) and
US East coast time zone (GMT -5).<br>
</div>
<br>
</div>
I have not yet tried R16B. I'll be
starting that today. I'm also trying to
improve the test program, since it's
taking quite a long time between jumps for
me as well. I'll let you know as soon as
I have a better one.<br>
<br>
</div>
You have no idea how relieved I am that you
are looking into this!<br>
<br>
</div>
<div>Thanks,<br>
</div>
Garret Smith<br>
</div>
<div>
<div>
<div class="gmail_extra"><br>
<br>
<div class="gmail_quote"> On Tue, Mar 5,
2013 at 3:06 AM, Patrik Nyblom <span dir="ltr"><<a href="mailto:pan@erlang.org" target="_blank">pan@erlang.org</a>></span>
wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div bgcolor="#FFFFFF" text="#000000">
<div>Hi again...<br>
<br>
I'm not sure about one thing. What
happens to os:timestamp() during
these jumps? Does it stay on track
or does it also jump around?<br>
<br>
I've tried to reproduce it with
your program, but has not yet
succeeded. Have you seen this on
the R16B release as well?<br>
<br>
Is the hardware in any way fancy
(like a lot of cores, some new
processor I don't have or
something else?) or is there
anything else special about the
machine? Also the time zone you're
running in would be interesting,
as there is some time zone
specific code there...<br>
<br>
I would really like to be able to
reproduce it so you don't have to
do all the tests at your site, it
might end up being really time
consuming for you if I make to
many mistakes :) <br>
<br>
Cheers,<br>
/Patrik
<div>
<div><br>
<br>
<br>
On 03/05/2013 08:50 AM, Patrik
Nyblom wrote:<br>
</div>
</div>
</div>
<div>
<div>
<blockquote type="cite">
<div>Hi!<br>
<br>
On 03/05/2013 02:26 AM,
Garret Smith wrote:<br>
</div>
<blockquote type="cite">
<div dir="ltr">
<div>
<div>
<div>
<div>
<div>
<div>
<div>I have
been beating
my head
against a wall
for weeks
tracking down
spooky
behaviour[sic]
in one of our
production
systems. I
finally
tracked it
down to
"jumps" in the
times returned
by
erlang:now(),
causing all
timers in the
system to
expire at
once. I have
witnessed this
bug on R15B01,
both 64 and
32-bit
versions
running on
Windows Server
2008 R2, both
on bare metal
and VirtualBox
VM.<br>
<br>
</div>
The time jump
is always
around 2126000
seconds, or a
little over 24
days. The
now() time
does not try
to converge
with
os:timestamp()
as the
documentation
suggests, and
as I confirmed
it does if you
just change
the system
clock.<br>
<br>
</div>
Another VM
running
concurrently on
the same machine
but with little
load (diagnostic
node &
production node)
did not time
jump.<br>
<br>
</div>
<div>Higher load
seems to make
the time jumps
happen more
often.<br>
</div>
<div><br>
</div>
Frequency between
time jumps varies
between seconds
and hours, but
when a jump
occurs, it is
always 2126000 +
(9 to 26) seconds.<br>
<br>
</div>
I never see the jump
in logfile
timestamps that use
os:timestamp() for
tagging log
messages. I had to
start tracing a
production node
before I caught the
jump. Here are some
lines from a trace,
where the timestamp
in trace_ts is
printed using
calendar:now_to_local_time()
and then in raw
tuple format:<br>
<br>
2013-4-16
21:40:1.993399|{1366,173601,993399}<br>
2013-4-16
21:40:1.993400|{1366,173601,993400}<br>
2013-5-11
12:13:41.986961|{1368,299621,986961}<br>
2013-5-11
12:13:41.986962|{1368,299621,986962}<br>
<br>
</div>
then a bit later...<br>
<br>
2013-5-11
12:36:19.955129|{1368,300979,955129}<br>
2013-5-11
12:36:19.955130|{1368,300979,955130}<br>
2013-6-5
3:9:49.538830|{1370,426989,538830}<br>
2013-6-5
3:9:49.538833|{1370,426989,538833}<br>
<br>
</div>
</div>
</div>
</blockquote>
Gah! That's obviously not
supposed to happen... <br>
<blockquote type="cite">
<div dir="ltr">
<div>I captured many such
jumps over the course of
a day or so. Obviously
from the dates, 2 jumps
happened before I
started tracing.<br>
<br>
</div>
<div>I was able to
reproduce the bug,
though not as
efficiently as my
production system, with
the following sample
program: <a href="https://gist.github.com/garret-smith/5087169" target="_blank">https://gist.github.com/garret-smith/5087169</a><br>
<br>
It took over an hour of
runtime before the first
time jump. I am working
on a better way to
reproduce it at the
moment, but it's hard to
test the test with a bug
so intermittent.<br>
</div>
<div><br>
</div>
<div>I am also testing
various other VM
versions. My first hope
was that this was
limited to the 64-bit
version where we first
encountered the problem,
but a change to the
32-bit version has only
made the problem happen
less often, not
eliminated it.<br>
<br>
</div>
<div>We never saw this bug
with R14B03 which we
were running previously
to R15B01. However,
system load is different
so I can't make a direct
comparison. I did
notice a few significant
updates to the Windows
time related code
between R14B03 and R15:<br>
<br>
</div>
<div>git log sys_time.c<br>
</div>
<div><br>
</div>
<div>commit
46eb4359b05b220861453a869dc734480ec045a6<br>
Author: Patrik Nyblom
<<a href="mailto:pan@erlang.org" target="_blank">pan@erlang.org</a>><br>
Date: Tue Dec 6
19:07:16 2011 +0100<br>
<br>
Emulate localtime,
gmtime and mktime to
enable negative time_t<br>
<br>
commit
913f05af100e98a8665bbb6168e89fbcfe4ece75<br>
Author:
Bj<C3><B6>rn-Egil
Dahlberg <<a href="mailto:egil@erlang.org" target="_blank">egil@erlang.org</a>><br>
Date: Fri Dec 2
15:25:06 2011 +0100<br>
<br>
Teach windows
sys_localtime_r<br>
<br>
<br>
</div>
</div>
</blockquote>
Yep, that's me... But even if
I gave a totally weird time
back from those, the
erlang:now logic should have
stopped this from happening.
I'll try to reproduce using
your example program. If
nothing else helps, I'll
instrument a VM that gives som
traces in the time code...<br>
<blockquote type="cite">
<div dir="ltr">
<div>I am completely
stumped. What can I do
next to help track down
the source of the bug?<br>
<br>
</div>
</div>
</blockquote>
Unfortunately, so am I.
Especially weird that it's
load related... Maybe
something is not locked as it
should be...<br>
<blockquote type="cite">
<div dir="ltr">
<div>Thanks,<br>
</div>
<div>Garret Smith<br>
</div>
</div>
</blockquote>
Thanks for reporting, I'll get
back to you!<br>
<br>
Cheers,<br>
/Patrik<br>
<blockquote type="cite"> <br>
<fieldset></fieldset>
<br>
<pre>_______________________________________________
erlang-bugs mailing list
<a href="mailto:erlang-bugs@erlang.org" target="_blank">erlang-bugs@erlang.org</a>
<a href="http://erlang.org/mailman/listinfo/erlang-bugs" target="_blank">http://erlang.org/mailman/listinfo/erlang-bugs</a>
</pre>
</blockquote>
<br>
<br>
<fieldset></fieldset>
<br>
<pre>_______________________________________________
erlang-bugs mailing list
<a href="mailto:erlang-bugs@erlang.org" target="_blank">erlang-bugs@erlang.org</a>
<a href="http://erlang.org/mailman/listinfo/erlang-bugs" target="_blank">http://erlang.org/mailman/listinfo/erlang-bugs</a>
</pre>
</blockquote>
<br>
</div>
</div>
</div>
<br>
_______________________________________________<br>
erlang-bugs mailing list<br>
<a href="mailto:erlang-bugs@erlang.org" target="_blank">erlang-bugs@erlang.org</a><br>
<a href="http://erlang.org/mailman/listinfo/erlang-bugs" target="_blank">http://erlang.org/mailman/listinfo/erlang-bugs</a><br>
<br>
</blockquote>
</div>
<br>
</div>
</div>
</div>
</blockquote>
</div>
<br>
</div>
</div>
</div>
</blockquote>
</div>
<br>
</div>
</blockquote>
<br>
<br>
<fieldset></fieldset>
<br>
<pre>_______________________________________________
erlang-bugs mailing list
<a href="mailto:erlang-bugs@erlang.org" target="_blank">erlang-bugs@erlang.org</a>
<a href="http://erlang.org/mailman/listinfo/erlang-bugs" target="_blank">http://erlang.org/mailman/listinfo/erlang-bugs</a>
</pre>
</blockquote>
<br>
</div></div></div>
<br>_______________________________________________<br>
erlang-bugs mailing list<br>
<a href="mailto:erlang-bugs@erlang.org">erlang-bugs@erlang.org</a><br>
<a href="http://erlang.org/mailman/listinfo/erlang-bugs" target="_blank">http://erlang.org/mailman/listinfo/erlang-bugs</a><br>
<br></blockquote></div><br></div>