<html>
<head>
<meta content="text/html; charset=ISO-8859-1"
http-equiv="Content-Type">
</head>
<body bgcolor="#FFFFFF" text="#000000">
<div class="moz-cite-prefix">Hi!<br>
<br>
Good! Thanks!<br>
<br>
I can build a patched R15 beam.dll for you, easiest is R15B03, but
i can do a patched beam for R15B02 if that's really needed. In the
end I'll probably build some kind of R15B03-2 and a R16B00-1 or
something, so whoever wants the patch can get binaries. However I
would like to have something tested in your real system, if that's
OK with you. So - which version is best to patch for? R15B02?<br>
<br>
/Patrik<br>
<br>
On 03/12/2013 12:48 AM, Garret Smith wrote:<br>
</div>
<blockquote
cite="mid:CAHmviK-V1WYZhDuFQ4yLR80oVPwtEb6Zd5fTSua5a2b1QRv-cg@mail.gmail.com"
type="cite">
<meta http-equiv="Content-Type" content="text/html;
charset=ISO-8859-1">
<div dir="ltr">Been running the test program all day in the same
scenario as before. No time jumps! Looking good...<br>
</div>
<div class="gmail_extra"><br>
<br>
<div class="gmail_quote">On Mon, Mar 11, 2013 at 9:34 AM, Garret
Smith <span dir="ltr"><<a moz-do-not-send="true"
href="mailto:garret.smith@gmail.com" target="_blank">garret.smith@gmail.com</a>></span>
wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0
.8ex;border-left:1px #ccc solid;padding-left:1ex">
<div dir="ltr">
<div>
<div>
<div>Patrik,<br>
</div>
<div><br>
</div>
Our production systems are on R15B1/2, so I won't be
able to verify against that, but I'll let you know
what I see running my test program against R16B.<br>
<br>
</div>
Will you be able to generate a patched R15x version? If
not, I'll try to set up a build system and apply the
patch locally.<span class="HOEnZb"><font color="#888888"><br>
<br>
</font></span></div>
<span class="HOEnZb"><font color="#888888">-Garret<br>
</font></span></div>
<div class="HOEnZb">
<div class="h5">
<div class="gmail_extra"><br>
<br>
<div class="gmail_quote">
On Mon, Mar 11, 2013 at 9:26 AM, Patrik Nyblom <span
dir="ltr"><<a moz-do-not-send="true"
href="mailto:pan@erlang.org" target="_blank">pan@erlang.org</a>></span>
wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0
.8ex;border-left:1px #ccc solid;padding-left:1ex">
<div bgcolor="#FFFFFF" text="#000000">
<div>Hi again!<br>
<br>
I think I've found it. At least I've found one
error, hopefully that's the one you've also
found :)<br>
<br>
The sys_gethrtime function has gon new uses in
R15 and on, uses where it is no longer
protected by the erts_timeofday_mtx. So - it
simply needs a lock of it's own. This gives a
slight performance loss, but that could be
fixed by using GetTickCount64 on win7 and
win2008 at least. <br>
<br>
Can you try a version of beam.smp.dll with a
lock and see if the error is gone on your
machines? If that works, I would also like you
to try an optimized version, but let's first
make sure we have the bug nailed down :)<br>
<br>
In my dropbox, there's a beam.smp.dll. If you
replace $ERL_ROOT/erts-5.10.1/bin/beam.smp.dll
with that one and then start werl, the slogan
should contain [source-be0da3e]. It is for
64bit windows. The public dropbox URL is:<br>
<a moz-do-not-send="true"
href="http://dl.dropbox.com/u/17212223/beam.smp.dll"
target="_blank">http://dl.dropbox.com/u/17212223/beam.smp.dll</a><br>
<br>
This should work without any special messages
or such, giving a working erlang:now/0. If it
starts sending strange ERROR REPORT's about
ticks moving slightly backwards, we have a
more complicated bug, but I haven't seen any
such messages since i added proper locking.<br>
<br>
If it's possible for you to test this, I would
be immensely grateful!<br>
<br>
Cheers,<br>
/Patrik
<div>
<div><br>
On 03/07/2013 04:37 PM, Patrik Nyblom
wrote:<br>
</div>
</div>
</div>
<div>
<div>
<blockquote type="cite">
<div>Hi Garret!<br>
<br>
I've been able to reproduce it on my
freshly installed Win2008 machine!
Great, now I only need to debug it and
find the error :)<br>
<br>
I'll get back to you as soon as I feel I
have a fix - it might take a few days,
given the relatively long turn around
time, but we'll get there!<br>
<br>
Thank you for all the help and
information!<br>
<br>
Cheers,<br>
/Patrik<br>
<br>
On 03/05/2013 09:10 PM, Garret Smith
wrote:<br>
</div>
<blockquote type="cite">
<div dir="ltr">
<div>On the same machine with the same
steps as previous, I reproduced the
time jump on R16B.<br>
This time the jump happened with a
<5 sec delta btw now() and
os:timestamp().<br>
Still jumping ~2126000 seconds.<br>
<br>
</div>
-Garret<br>
</div>
<div class="gmail_extra"><br>
<br>
<div class="gmail_quote">On Tue, Mar
5, 2013 at 11:20 AM, Garret Smith <span
dir="ltr"><<a
moz-do-not-send="true"
href="mailto:garret.smith@gmail.com"
target="_blank">garret.smith@gmail.com</a>></span>
wrote:<br>
<blockquote class="gmail_quote"
style="margin:0 0 0
.8ex;border-left:1px #ccc
solid;padding-left:1ex">
<div dir="ltr">
<div>
<div>
<div>
<div>The gist <a
moz-do-not-send="true"
href="https://gist.github.com/garret-smith/5087169" target="_blank">https://gist.github.com/garret-smith/5087169</a>
is updated with a
slightly better
version. I was able to
reproduce the jump in
less than an hour. I
also did some more
things to perturb the
timing code while the
test program was
running.<br>
<br>
</div>
<div>Here is the latest
info, everything I can
think of that may have
the slightest effect:<br>
</div>
<div> * R15B01 64-bit
build<br>
</div>
<div> * Pacific time zone
(GMT -8)<br>
</div>
<div> * Xeon E5405 in an
HP DL160<br>
</div>
<div> * no arguments to
erl.exe<br>
</div>
<div> * bursty, high CPU
load, >75% memory
used by other software<br>
</div>
<div> * running Observer
on the test VM
displaying the "Load
Charts" tab<br>
</div>
<div> * made some small
adjustments (~ 60
seconds) to the system
clock while running the
tests - now() and
os:timestamp() behaved
as expected, initially
showing a delta and
slowly converging<br>
</div>
<div> * w32tm /resync to
fix the system clock
some time after
perturbing it<br>
</div>
<br>
</div>
The time jump in now()
occurred when now() was ~9
seconds behind
os:timestamp() as reported
by the new test program.<br>
<br>
</div>
I'm starting to look at R16B
now.<span><font
color="#888888"><br>
<br>
</font></span></div>
<span><font color="#888888">-Garret
Smith<br>
</font></span></div>
<div>
<div>
<div class="gmail_extra"> <br>
<br>
<div class="gmail_quote">On
Tue, Mar 5, 2013 at 8:37
AM, Garret Smith <span
dir="ltr"><<a
moz-do-not-send="true"
href="mailto:garret.smith@gmail.com" target="_blank">garret.smith@gmail.com</a>></span>
wrote:<br>
<blockquote
class="gmail_quote"
style="margin:0 0 0
.8ex;border-left:1px
#ccc
solid;padding-left:1ex">
<div dir="ltr">
<div>
<div>
<div>
<div>I haven't
seen anything
unexpected in
os:timestamp().
No jumps at
all.<br>
<br>
</div>
<div>CPU is an
Intel Xeon
X3430.<br>
<br>
I have
reproduced it
in the
LosAngeles/Pacific
Time (GMT -8)
and US East
coast time
zone (GMT -5).<br>
</div>
<br>
</div>
I have not yet
tried R16B. I'll
be starting that
today. I'm also
trying to improve
the test program,
since it's taking
quite a long time
between jumps for
me as well. I'll
let you know as
soon as I have a
better one.<br>
<br>
</div>
You have no idea how
relieved I am that
you are looking into
this!<br>
<br>
</div>
<div>Thanks,<br>
</div>
Garret Smith<br>
</div>
<div>
<div>
<div
class="gmail_extra"><br>
<br>
<div
class="gmail_quote">
On Tue, Mar 5,
2013 at 3:06 AM,
Patrik Nyblom <span
dir="ltr"><<a
moz-do-not-send="true" href="mailto:pan@erlang.org" target="_blank">pan@erlang.org</a>></span>
wrote:<br>
<blockquote
class="gmail_quote"
style="margin:0
0 0
.8ex;border-left:1px
#ccc
solid;padding-left:1ex">
<div
bgcolor="#FFFFFF"
text="#000000">
<div>Hi
again...<br>
<br>
I'm not sure
about one
thing. What
happens to
os:timestamp()
during these
jumps? Does it
stay on track
or does it
also jump
around?<br>
<br>
I've tried to
reproduce it
with your
program, but
has not yet
succeeded.
Have you seen
this on the
R16B release
as well?<br>
<br>
Is the
hardware in
any way fancy
(like a lot of
cores, some
new processor
I don't have
or something
else?) or is
there anything
else special
about the
machine? Also
the time zone
you're running
in would be
interesting,
as there is
some time zone
specific code
there...<br>
<br>
I would really
like to be
able to
reproduce it
so you don't
have to do all
the tests at
your site, it
might end up
being really
time consuming
for you if I
make to many
mistakes :) <br>
<br>
Cheers,<br>
/Patrik
<div>
<div><br>
<br>
<br>
On 03/05/2013
08:50 AM,
Patrik Nyblom
wrote:<br>
</div>
</div>
</div>
<div>
<div>
<blockquote
type="cite">
<div>Hi!<br>
<br>
On 03/05/2013
02:26 AM,
Garret Smith
wrote:<br>
</div>
<blockquote
type="cite">
<div dir="ltr">
<div>
<div>
<div>
<div>
<div>
<div>
<div>I have
been beating
my head
against a wall
for weeks
tracking down
spooky
behaviour[sic]
in one of our
production
systems. I
finally
tracked it
down to
"jumps" in the
times returned
by
erlang:now(),
causing all
timers in the
system to
expire at
once. I have
witnessed this
bug on R15B01,
both 64 and
32-bit
versions
running on
Windows Server
2008 R2, both
on bare metal
and VirtualBox
VM.<br>
<br>
</div>
The time jump
is always
around 2126000
seconds, or a
little over 24
days. The
now() time
does not try
to converge
with
os:timestamp()
as the
documentation
suggests, and
as I confirmed
it does if you
just change
the system
clock.<br>
<br>
</div>
Another VM
running
concurrently
on the same
machine but
with little
load
(diagnostic
node &
production
node) did not
time jump.<br>
<br>
</div>
<div>Higher
load seems to
make the time
jumps happen
more often.<br>
</div>
<div><br>
</div>
Frequency
between time
jumps varies
between
seconds and
hours, but
when a jump
occurs, it is
always 2126000
+ (9 to 26)
seconds.<br>
<br>
</div>
I never see
the jump in
logfile
timestamps
that use
os:timestamp()
for tagging
log messages.
I had to start
tracing a
production
node before I
caught the
jump. Here
are some lines
from a trace,
where the
timestamp in
trace_ts is
printed using
calendar:now_to_local_time()
and then in
raw tuple
format:<br>
<br>
2013-4-16
21:40:1.993399|{1366,173601,993399}<br>
2013-4-16
21:40:1.993400|{1366,173601,993400}<br>
2013-5-11
12:13:41.986961|{1368,299621,986961}<br>
2013-5-11
12:13:41.986962|{1368,299621,986962}<br>
<br>
</div>
then a bit
later...<br>
<br>
2013-5-11
12:36:19.955129|{1368,300979,955129}<br>
2013-5-11
12:36:19.955130|{1368,300979,955130}<br>
2013-6-5
3:9:49.538830|{1370,426989,538830}<br>
2013-6-5
3:9:49.538833|{1370,426989,538833}<br>
<br>
</div>
</div>
</div>
</blockquote>
Gah! That's
obviously not
supposed to
happen... <br>
<blockquote
type="cite">
<div dir="ltr">
<div>I
captured many
such jumps
over the
course of a
day or so.
Obviously from
the dates, 2
jumps happened
before I
started
tracing.<br>
<br>
</div>
<div>I was
able to
reproduce the
bug, though
not as
efficiently as
my production
system, with
the following
sample
program: <a
moz-do-not-send="true"
href="https://gist.github.com/garret-smith/5087169" target="_blank">https://gist.github.com/garret-smith/5087169</a><br>
<br>
It took over
an hour of
runtime before
the first time
jump. I am
working on a
better way to
reproduce it
at the moment,
but it's hard
to test the
test with a
bug so
intermittent.<br>
</div>
<div><br>
</div>
<div>I am also
testing
various other
VM versions.
My first hope
was that this
was limited to
the 64-bit
version where
we first
encountered
the problem,
but a change
to the 32-bit
version has
only made the
problem happen
less often,
not eliminated
it.<br>
<br>
</div>
<div>We never
saw this bug
with R14B03
which we were
running
previously to
R15B01.
However,
system load is
different so I
can't make a
direct
comparison. I
did notice a
few
significant
updates to the
Windows time
related code
between R14B03
and R15:<br>
<br>
</div>
<div>git log
sys_time.c<br>
</div>
<div><br>
</div>
<div>commit
46eb4359b05b220861453a869dc734480ec045a6<br>
Author: Patrik
Nyblom <<a
moz-do-not-send="true" href="mailto:pan@erlang.org" target="_blank">pan@erlang.org</a>><br>
Date: Tue
Dec 6 19:07:16
2011 +0100<br>
<br>
Emulate
localtime,
gmtime and
mktime to
enable
negative
time_t<br>
<br>
commit
913f05af100e98a8665bbb6168e89fbcfe4ece75<br>
Author:
Bj<C3><B6>rn-Egil
Dahlberg <<a
moz-do-not-send="true" href="mailto:egil@erlang.org" target="_blank">egil@erlang.org</a>><br>
Date: Fri
Dec 2 15:25:06
2011 +0100<br>
<br>
Teach
windows
sys_localtime_r<br>
<br>
<br>
</div>
</div>
</blockquote>
Yep, that's
me... But even
if I gave a
totally weird
time back from
those, the
erlang:now
logic should
have stopped
this from
happening.
I'll try to
reproduce
using your
example
program. If
nothing else
helps, I'll
instrument a
VM that gives
som traces in
the time
code...<br>
<blockquote
type="cite">
<div dir="ltr">
<div>I am
completely
stumped. What
can I do next
to help track
down the
source of the
bug?<br>
<br>
</div>
</div>
</blockquote>
Unfortunately,
so am I.
Especially
weird that
it's load
related...
Maybe
something is
not locked as
it should
be...<br>
<blockquote
type="cite">
<div dir="ltr">
<div>Thanks,<br>
</div>
<div>Garret
Smith<br>
</div>
</div>
</blockquote>
Thanks for
reporting,
I'll get back
to you!<br>
<br>
Cheers,<br>
/Patrik<br>
<blockquote
type="cite"> <br>
<fieldset></fieldset>
<br>
<pre>_______________________________________________
erlang-bugs mailing list
<a moz-do-not-send="true" href="mailto:erlang-bugs@erlang.org" target="_blank">erlang-bugs@erlang.org</a>
<a moz-do-not-send="true" href="http://erlang.org/mailman/listinfo/erlang-bugs" target="_blank">http://erlang.org/mailman/listinfo/erlang-bugs</a>
</pre>
</blockquote>
<br>
<br>
<fieldset></fieldset>
<br>
<pre>_______________________________________________
erlang-bugs mailing list
<a moz-do-not-send="true" href="mailto:erlang-bugs@erlang.org" target="_blank">erlang-bugs@erlang.org</a>
<a moz-do-not-send="true" href="http://erlang.org/mailman/listinfo/erlang-bugs" target="_blank">http://erlang.org/mailman/listinfo/erlang-bugs</a>
</pre>
</blockquote>
<br>
</div>
</div>
</div>
<br>
_______________________________________________<br>
erlang-bugs
mailing list<br>
<a
moz-do-not-send="true"
href="mailto:erlang-bugs@erlang.org" target="_blank">erlang-bugs@erlang.org</a><br>
<a
moz-do-not-send="true"
href="http://erlang.org/mailman/listinfo/erlang-bugs" target="_blank">http://erlang.org/mailman/listinfo/erlang-bugs</a><br>
<br>
</blockquote>
</div>
<br>
</div>
</div>
</div>
</blockquote>
</div>
<br>
</div>
</div>
</div>
</blockquote>
</div>
<br>
</div>
</blockquote>
<br>
<br>
<fieldset></fieldset>
<br>
<pre>_______________________________________________
erlang-bugs mailing list
<a moz-do-not-send="true" href="mailto:erlang-bugs@erlang.org" target="_blank">erlang-bugs@erlang.org</a>
<a moz-do-not-send="true" href="http://erlang.org/mailman/listinfo/erlang-bugs" target="_blank">http://erlang.org/mailman/listinfo/erlang-bugs</a>
</pre>
</blockquote>
<br>
</div>
</div>
</div>
<br>
_______________________________________________<br>
erlang-bugs mailing list<br>
<a moz-do-not-send="true"
href="mailto:erlang-bugs@erlang.org"
target="_blank">erlang-bugs@erlang.org</a><br>
<a moz-do-not-send="true"
href="http://erlang.org/mailman/listinfo/erlang-bugs"
target="_blank">http://erlang.org/mailman/listinfo/erlang-bugs</a><br>
<br>
</blockquote>
</div>
<br>
</div>
</div>
</div>
</blockquote>
</div>
<br>
</div>
</blockquote>
<br>
</body>
</html>