[erlang-bugs] R15B01 erlang:now() jumping ~24 days into the future
Garret Smith
garret.smith@REDACTED
Tue Mar 5 17:37:19 CET 2013
I haven't seen anything unexpected in os:timestamp(). No jumps at all.
CPU is an Intel Xeon X3430.
I have reproduced it in the LosAngeles/Pacific Time (GMT -8) and US East
coast time zone (GMT -5).
I have not yet tried R16B. I'll be starting that today. I'm also trying
to improve the test program, since it's taking quite a long time between
jumps for me as well. I'll let you know as soon as I have a better one.
You have no idea how relieved I am that you are looking into this!
Thanks,
Garret Smith
On Tue, Mar 5, 2013 at 3:06 AM, Patrik Nyblom <pan@REDACTED> wrote:
> Hi again...
>
> I'm not sure about one thing. What happens to os:timestamp() during these
> jumps? Does it stay on track or does it also jump around?
>
> I've tried to reproduce it with your program, but has not yet succeeded.
> Have you seen this on the R16B release as well?
>
> Is the hardware in any way fancy (like a lot of cores, some new processor
> I don't have or something else?) or is there anything else special about
> the machine? Also the time zone you're running in would be interesting, as
> there is some time zone specific code there...
>
> I would really like to be able to reproduce it so you don't have to do all
> the tests at your site, it might end up being really time consuming for you
> if I make to many mistakes :)
>
> Cheers,
> /Patrik
>
>
>
> On 03/05/2013 08:50 AM, Patrik Nyblom wrote:
>
> Hi!
>
> On 03/05/2013 02:26 AM, Garret Smith wrote:
>
> I have been beating my head against a wall for weeks tracking down
> spooky behaviour[sic] in one of our production systems. I finally tracked
> it down to "jumps" in the times returned by erlang:now(), causing all
> timers in the system to expire at once. I have witnessed this bug on
> R15B01, both 64 and 32-bit versions running on Windows Server 2008 R2, both
> on bare metal and VirtualBox VM.
>
> The time jump is always around 2126000 seconds, or a little over 24
> days. The now() time does not try to converge with os:timestamp() as the
> documentation suggests, and as I confirmed it does if you just change the
> system clock.
>
> Another VM running concurrently on the same machine but with little load
> (diagnostic node & production node) did not time jump.
>
> Higher load seems to make the time jumps happen more often.
>
> Frequency between time jumps varies between seconds and hours, but when a
> jump occurs, it is always 2126000 + (9 to 26) seconds.
>
> I never see the jump in logfile timestamps that use os:timestamp() for
> tagging log messages. I had to start tracing a production node before I
> caught the jump. Here are some lines from a trace, where the timestamp in
> trace_ts is printed using calendar:now_to_local_time() and then in raw
> tuple format:
>
> 2013-4-16 21:40:1.993399|{1366,173601,993399}
> 2013-4-16 21:40:1.993400|{1366,173601,993400}
> 2013-5-11 12:13:41.986961|{1368,299621,986961}
> 2013-5-11 12:13:41.986962|{1368,299621,986962}
>
> then a bit later...
>
> 2013-5-11 12:36:19.955129|{1368,300979,955129}
> 2013-5-11 12:36:19.955130|{1368,300979,955130}
> 2013-6-5 3:9:49.538830|{1370,426989,538830}
> 2013-6-5 3:9:49.538833|{1370,426989,538833}
>
> Gah! That's obviously not supposed to happen...
>
> I captured many such jumps over the course of a day or so. Obviously
> from the dates, 2 jumps happened before I started tracing.
>
> I was able to reproduce the bug, though not as efficiently as my
> production system, with the following sample program:
> https://gist.github.com/garret-smith/5087169
>
> It took over an hour of runtime before the first time jump. I am working
> on a better way to reproduce it at the moment, but it's hard to test the
> test with a bug so intermittent.
>
> I am also testing various other VM versions. My first hope was that
> this was limited to the 64-bit version where we first encountered the
> problem, but a change to the 32-bit version has only made the problem
> happen less often, not eliminated it.
>
> We never saw this bug with R14B03 which we were running previously to
> R15B01. However, system load is different so I can't make a direct
> comparison. I did notice a few significant updates to the Windows time
> related code between R14B03 and R15:
>
> git log sys_time.c
>
> commit 46eb4359b05b220861453a869dc734480ec045a6
> Author: Patrik Nyblom <pan@REDACTED>
> Date: Tue Dec 6 19:07:16 2011 +0100
>
> Emulate localtime, gmtime and mktime to enable negative time_t
>
> commit 913f05af100e98a8665bbb6168e89fbcfe4ece75
> Author: Bj<C3><B6>rn-Egil Dahlberg <egil@REDACTED>
> Date: Fri Dec 2 15:25:06 2011 +0100
>
> Teach windows sys_localtime_r
>
>
> Yep, that's me... But even if I gave a totally weird time back from
> those, the erlang:now logic should have stopped this from happening. I'll
> try to reproduce using your example program. If nothing else helps, I'll
> instrument a VM that gives som traces in the time code...
>
> I am completely stumped. What can I do next to help track down the
> source of the bug?
>
> Unfortunately, so am I. Especially weird that it's load related...
> Maybe something is not locked as it should be...
>
> Thanks,
> Garret Smith
>
> Thanks for reporting, I'll get back to you!
>
> Cheers,
> /Patrik
>
>
>
> _______________________________________________
> erlang-bugs mailing listerlang-bugs@REDACTED://erlang.org/mailman/listinfo/erlang-bugs
>
>
>
>
> _______________________________________________
> erlang-bugs mailing listerlang-bugs@REDACTED://erlang.org/mailman/listinfo/erlang-bugs
>
>
>
> _______________________________________________
> erlang-bugs mailing list
> erlang-bugs@REDACTED
> http://erlang.org/mailman/listinfo/erlang-bugs
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-bugs/attachments/20130305/b84b04c2/attachment.htm>
More information about the erlang-bugs
mailing list