[erlang-bugs] R15B01 erlang:now() jumping ~24 days into the future

Tue Mar 5 02:26:49 CET 2013

I have been beating my head against a wall for weeks tracking down spooky
behaviour[sic] in one of our production systems.  I finally tracked it down
to "jumps" in the times returned by erlang:now(), causing all timers in the
system to expire at once.  I have witnessed this bug on R15B01, both 64 and
32-bit versions running on Windows Server 2008 R2, both on bare metal and
VirtualBox VM.

The time jump is always around 2126000 seconds, or a little over 24 days.
The now() time does not try to converge with os:timestamp() as the
documentation suggests, and as I confirmed it does if you just change the
system clock.

Another VM running concurrently on the same machine but with little load
(diagnostic node & production node) did not time jump.

Higher load seems to make the time jumps happen more often.

Frequency between time jumps varies between seconds and hours, but when a
jump occurs, it is always 2126000 + (9 to 26) seconds.

I never see the jump in logfile timestamps that use os:timestamp() for
tagging log messages.  I had to start tracing a production node before I
caught the jump.  Here are some lines from a trace, where the timestamp in
trace_ts is printed using calendar:now_to_local_time() and then in raw
tuple format:

2013-4-16 21:40:1.993399|{1366,173601,993399}
2013-4-16 21:40:1.993400|{1366,173601,993400}
2013-5-11 12:13:41.986961|{1368,299621,986961}
2013-5-11 12:13:41.986962|{1368,299621,986962}

then a bit later...

2013-5-11 12:36:19.955129|{1368,300979,955129}
2013-5-11 12:36:19.955130|{1368,300979,955130}
2013-6-5 3:9:49.538830|{1370,426989,538830}
2013-6-5 3:9:49.538833|{1370,426989,538833}

I captured many such jumps over the course of a day or so.  Obviously from
the dates, 2 jumps happened before I started tracing.

I was able to reproduce the bug, though not as efficiently as my production
system, with the following sample program:
https://gist.github.com/garret-smith/5087169

It took over an hour of runtime before the first time jump.  I am working
on a better way to reproduce it at the moment, but it's hard to test the
test with a bug so intermittent.

I am also testing various other VM versions.  My first hope was that this
was limited to the 64-bit version where we first encountered the problem,
but a change to the 32-bit version has only made the problem happen less
often, not eliminated it.

We never saw this bug with R14B03 which we were running previously to
R15B01.  However, system load is different so I can't make a direct
comparison.  I did notice a few significant updates to the Windows time
related code between R14B03 and R15:

git log sys_time.c

commit 46eb4359b05b220861453a869dc734480ec045a6
Author: Patrik Nyblom <pan@REDACTED>
Date:   Tue Dec 6 19:07:16 2011 +0100

    Emulate localtime, gmtime and mktime to enable negative time_t

commit 913f05af100e98a8665bbb6168e89fbcfe4ece75
Author: Bj<C3><B6>rn-Egil Dahlberg <egil@REDACTED>
Date:   Fri Dec 2 15:25:06 2011 +0100

    Teach windows sys_localtime_r

I am completely stumped.  What can I do next to help track down the source
of the bug?

Thanks,
Garret Smith
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-bugs/attachments/20130304/9a32d5fa/attachment.htm>