[erlang-bugs] R15B01 erlang:now() jumping ~24 days into the future

Patrik Nyblom pan@REDACTED
Thu Mar 7 16:37:13 CET 2013


Hi Garret!

I've been able to reproduce it on my freshly installed Win2008 machine! 
Great, now I only need to debug it and find the error :)

I'll get back to you as soon as I feel I have a fix - it might take a 
few days, given the relatively long turn around time, but we'll get there!

Thank you for all the help and information!

Cheers,
/Patrik

On 03/05/2013 09:10 PM, Garret Smith wrote:
> On the same machine with the same steps as previous, I reproduced the 
> time jump on R16B.
> This time the jump happened with a <5 sec delta btw now() and 
> os:timestamp().
> Still jumping ~2126000 seconds.
>
> -Garret
>
>
> On Tue, Mar 5, 2013 at 11:20 AM, Garret Smith <garret.smith@REDACTED 
> <mailto:garret.smith@REDACTED>> wrote:
>
>     The gist https://gist.github.com/garret-smith/5087169 is updated
>     with a slightly better version.  I was able to reproduce the jump
>     in less than an hour. I also did some more things to perturb the
>     timing code while the test program was running.
>
>     Here is the latest info, everything I can think of that may have
>     the slightest effect:
>      * R15B01 64-bit build
>      * Pacific time zone (GMT -8)
>      * Xeon E5405 in an HP DL160
>      * no arguments to erl.exe
>      * bursty, high CPU load, >75% memory used by other software
>      * running Observer on the test VM displaying the "Load Charts" tab
>      * made some small adjustments (~ 60 seconds) to the system clock
>     while running the tests - now() and os:timestamp() behaved as
>     expected, initially showing a delta and slowly converging
>      * w32tm /resync to fix the system clock some time after perturbing it
>
>     The time jump in now() occurred when now() was ~9 seconds behind
>     os:timestamp() as reported by the new test program.
>
>     I'm starting to look at R16B now.
>
>     -Garret Smith
>
>
>     On Tue, Mar 5, 2013 at 8:37 AM, Garret Smith
>     <garret.smith@REDACTED <mailto:garret.smith@REDACTED>> wrote:
>
>         I haven't seen anything unexpected in os:timestamp().  No
>         jumps at all.
>
>         CPU is an Intel Xeon X3430.
>
>         I have reproduced it in the LosAngeles/Pacific Time (GMT -8)
>         and US East coast time zone (GMT -5).
>
>         I have not yet tried R16B.  I'll be starting that today.  I'm
>         also trying to improve the test program, since it's taking
>         quite a long time between jumps for me as well.  I'll let you
>         know as soon as I have a better one.
>
>         You have no idea how relieved I am that you are looking into this!
>
>         Thanks,
>         Garret Smith
>
>
>         On Tue, Mar 5, 2013 at 3:06 AM, Patrik Nyblom <pan@REDACTED
>         <mailto:pan@REDACTED>> wrote:
>
>             Hi again...
>
>             I'm not sure about one thing. What happens to
>             os:timestamp() during these jumps? Does it stay on track
>             or does it also jump around?
>
>             I've tried to reproduce it with your program, but has not
>             yet succeeded. Have you seen this on the R16B release as well?
>
>             Is the hardware in any way fancy (like a lot of cores,
>             some new processor I don't have or something else?) or is
>             there anything else special about the machine? Also the
>             time zone you're running in would be interesting, as there
>             is some time zone specific code there...
>
>             I would really like to be able to reproduce it so you
>             don't have to do all the tests at your site, it might end
>             up being really time consuming for you if I make to many
>             mistakes :)
>
>             Cheers,
>             /Patrik
>
>
>
>             On 03/05/2013 08:50 AM, Patrik Nyblom wrote:
>>             Hi!
>>
>>             On 03/05/2013 02:26 AM, Garret Smith wrote:
>>>             I have been beating my head against a wall for weeks
>>>             tracking down spooky behaviour[sic] in one of our
>>>             production systems.  I finally tracked it down to
>>>             "jumps" in the times returned by erlang:now(), causing
>>>             all timers in the system to expire at once.  I have
>>>             witnessed this bug on R15B01, both 64 and 32-bit
>>>             versions running on Windows Server 2008 R2, both on bare
>>>             metal and VirtualBox VM.
>>>
>>>             The time jump is always around 2126000 seconds, or a
>>>             little over 24 days.  The now() time does not try to
>>>             converge with os:timestamp() as the documentation
>>>             suggests, and as I confirmed it does if you just change
>>>             the system clock.
>>>
>>>             Another VM running concurrently on the same machine but
>>>             with little load (diagnostic node & production node) did
>>>             not time jump.
>>>
>>>             Higher load seems to make the time jumps happen more often.
>>>
>>>             Frequency between time jumps varies between seconds and
>>>             hours, but when a jump occurs, it is always 2126000 + (9
>>>             to 26) seconds.
>>>
>>>             I never see the jump in logfile timestamps that use
>>>             os:timestamp() for tagging log messages. I had to start
>>>             tracing a production node before I caught the jump. 
>>>             Here are some lines from a trace, where the timestamp in
>>>             trace_ts is printed using calendar:now_to_local_time()
>>>             and then in raw tuple format:
>>>
>>>             2013-4-16 21:40:1.993399|{1366,173601,993399}
>>>             2013-4-16 21:40:1.993400|{1366,173601,993400}
>>>             2013-5-11 12:13:41.986961|{1368,299621,986961}
>>>             2013-5-11 12:13:41.986962|{1368,299621,986962}
>>>
>>>             then a bit later...
>>>
>>>             2013-5-11 12:36:19.955129|{1368,300979,955129}
>>>             2013-5-11 12:36:19.955130|{1368,300979,955130}
>>>             2013-6-5 3:9:49.538830|{1370,426989,538830}
>>>             2013-6-5 3:9:49.538833|{1370,426989,538833}
>>>
>>             Gah! That's obviously not supposed to happen...
>>>             I captured many such jumps over the course of a day or
>>>             so.  Obviously from the dates, 2 jumps happened before I
>>>             started tracing.
>>>
>>>             I was able to reproduce the bug, though not as
>>>             efficiently as my production system, with the following
>>>             sample program: https://gist.github.com/garret-smith/5087169
>>>
>>>             It took over an hour of runtime before the first time
>>>             jump.  I am working on a better way to reproduce it at
>>>             the moment, but it's hard to test the test with a bug so
>>>             intermittent.
>>>
>>>             I am also testing various other VM versions.  My first
>>>             hope was that this was limited to the 64-bit version
>>>             where we first encountered the problem, but a change to
>>>             the 32-bit version has only made the problem happen less
>>>             often, not eliminated it.
>>>
>>>             We never saw this bug with R14B03 which we were running
>>>             previously to R15B01.  However, system load is different
>>>             so I can't make a direct comparison.  I did notice a few
>>>             significant updates to the Windows time related code
>>>             between R14B03 and R15:
>>>
>>>             git log sys_time.c
>>>
>>>             commit 46eb4359b05b220861453a869dc734480ec045a6
>>>             Author: Patrik Nyblom <pan@REDACTED
>>>             <mailto:pan@REDACTED>>
>>>             Date:   Tue Dec 6 19:07:16 2011 +0100
>>>
>>>                 Emulate localtime, gmtime and mktime to enable
>>>             negative time_t
>>>
>>>             commit 913f05af100e98a8665bbb6168e89fbcfe4ece75
>>>             Author: Bj<C3><B6>rn-Egil Dahlberg <egil@REDACTED
>>>             <mailto:egil@REDACTED>>
>>>             Date:   Fri Dec 2 15:25:06 2011 +0100
>>>
>>>                 Teach windows sys_localtime_r
>>>
>>>
>>             Yep, that's me... But even if I gave a totally weird time
>>             back from those, the erlang:now logic should have stopped
>>             this from happening. I'll try to reproduce using your
>>             example program. If nothing else helps, I'll instrument a
>>             VM that gives som traces in the time code...
>>>             I am completely stumped.  What can I do next to help
>>>             track down the source of the bug?
>>>
>>             Unfortunately, so am I. Especially weird that it's load
>>             related... Maybe something is not locked as it should be...
>>>             Thanks,
>>>             Garret Smith
>>             Thanks for reporting, I'll get back to you!
>>
>>             Cheers,
>>             /Patrik
>>>
>>>
>>>             _______________________________________________
>>>             erlang-bugs mailing list
>>>             erlang-bugs@REDACTED  <mailto:erlang-bugs@REDACTED>
>>>             http://erlang.org/mailman/listinfo/erlang-bugs
>>
>>
>>
>>             _______________________________________________
>>             erlang-bugs mailing list
>>             erlang-bugs@REDACTED  <mailto:erlang-bugs@REDACTED>
>>             http://erlang.org/mailman/listinfo/erlang-bugs
>
>
>             _______________________________________________
>             erlang-bugs mailing list
>             erlang-bugs@REDACTED <mailto:erlang-bugs@REDACTED>
>             http://erlang.org/mailman/listinfo/erlang-bugs
>
>
>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-bugs/attachments/20130307/63f6815b/attachment.htm>


More information about the erlang-bugs mailing list