[erlang-bugs] R15B01 erlang:now() jumping ~24 days into the future

Tue Mar 5 21:10:45 CET 2013

On the same machine with the same steps as previous, I reproduced the time
jump on R16B.
This time the jump happened with a <5 sec delta btw now() and
os:timestamp().
Still jumping ~2126000 seconds.

-Garret

On Tue, Mar 5, 2013 at 11:20 AM, Garret Smith <garret.smith@REDACTED>wrote:

> The gist https://gist.github.com/garret-smith/5087169 is updated with a
> slightly better version.  I was able to reproduce the jump in less than an
> hour.  I also did some more things to perturb the timing code while the
> test program was running.
>
> Here is the latest info, everything I can think of that may have the
> slightest effect:
>  * R15B01 64-bit build
>  * Pacific time zone (GMT -8)
>  * Xeon E5405 in an HP DL160
>  * no arguments to erl.exe
>  * bursty, high CPU load, >75% memory used by other software
>  * running Observer on the test VM displaying the "Load Charts" tab
>  * made some small adjustments (~ 60 seconds) to the system clock while
> running the tests - now() and os:timestamp() behaved as expected, initially
> showing a delta and slowly converging
>  * w32tm /resync to fix the system clock some time after perturbing it
>
> The time jump in now() occurred when now() was ~9 seconds behind
> os:timestamp() as reported by the new test program.
>
> I'm starting to look at R16B now.
>
> -Garret Smith
>
>
> On Tue, Mar 5, 2013 at 8:37 AM, Garret Smith <garret.smith@REDACTED>wrote:
>
>> I haven't seen anything unexpected in os:timestamp().  No jumps at all.
>>
>> CPU is an Intel Xeon X3430.
>>
>> I have reproduced it in the LosAngeles/Pacific Time (GMT -8) and US East
>> coast time zone (GMT -5).
>>
>> I have not yet tried R16B.  I'll be starting that today.  I'm also trying
>> to improve the test program, since it's taking quite a long time between
>> jumps for me as well.  I'll let you know as soon as I have a better one.
>>
>> You have no idea how relieved I am that you are looking into this!
>>
>> Thanks,
>> Garret Smith
>>
>>
>> On Tue, Mar 5, 2013 at 3:06 AM, Patrik Nyblom <pan@REDACTED> wrote:
>>
>>>  Hi again...
>>>
>>> I'm not sure about one thing. What happens to os:timestamp() during
>>> these jumps? Does it stay on track or does it also jump around?
>>>
>>> I've tried to reproduce it with your program, but has not yet succeeded.
>>> Have you seen this on the R16B release as well?
>>>
>>> Is the hardware in any way fancy (like a lot of cores, some new
>>> processor I don't have or something else?) or is there anything else
>>> special about the machine? Also the time zone you're running in would be
>>> interesting, as there is some time zone specific code there...
>>>
>>> I would really like to be able to reproduce it so you don't have to do
>>> all the tests at your site, it might end up being really time consuming for
>>> you if I make to many mistakes :)
>>>
>>> Cheers,
>>> /Patrik
>>>
>>>
>>>
>>> On 03/05/2013 08:50 AM, Patrik Nyblom wrote:
>>>
>>> Hi!
>>>
>>> On 03/05/2013 02:26 AM, Garret Smith wrote:
>>>
>>>     I have been beating my head against a wall for weeks tracking down
>>> spooky behaviour[sic] in one of our production systems.  I finally tracked
>>> it down to "jumps" in the times returned by erlang:now(), causing all
>>> timers in the system to expire at once.  I have witnessed this bug on
>>> R15B01, both 64 and 32-bit versions running on Windows Server 2008 R2, both
>>> on bare metal and VirtualBox VM.
>>>
>>>  The time jump is always around 2126000 seconds, or a little over 24
>>> days.  The now() time does not try to converge with os:timestamp() as the
>>> documentation suggests, and as I confirmed it does if you just change the
>>> system clock.
>>>
>>>  Another VM running concurrently on the same machine but with little
>>> load (diagnostic node & production node) did not time jump.
>>>
>>>  Higher load seems to make the time jumps happen more often.
>>>
>>>  Frequency between time jumps varies between seconds and hours, but when
>>> a jump occurs, it is always 2126000 + (9 to 26) seconds.
>>>
>>>  I never see the jump in logfile timestamps that use os:timestamp() for
>>> tagging log messages.  I had to start tracing a production node before I
>>> caught the jump.  Here are some lines from a trace, where the timestamp in
>>> trace_ts is printed using calendar:now_to_local_time() and then in raw
>>> tuple format:
>>>
>>> 2013-4-16 21:40:1.993399|{1366,173601,993399}
>>> 2013-4-16 21:40:1.993400|{1366,173601,993400}
>>> 2013-5-11 12:13:41.986961|{1368,299621,986961}
>>> 2013-5-11 12:13:41.986962|{1368,299621,986962}
>>>
>>>  then a bit later...
>>>
>>> 2013-5-11 12:36:19.955129|{1368,300979,955129}
>>> 2013-5-11 12:36:19.955130|{1368,300979,955130}
>>> 2013-6-5 3:9:49.538830|{1370,426989,538830}
>>> 2013-6-5 3:9:49.538833|{1370,426989,538833}
>>>
>>>   Gah! That's obviously not supposed to happen...
>>>
>>>  I captured many such jumps over the course of a day or so.  Obviously
>>> from the dates, 2 jumps happened before I started tracing.
>>>
>>>  I was able to reproduce the bug, though not as efficiently as my
>>> production system, with the following sample program:
>>> https://gist.github.com/garret-smith/5087169
>>>
>>> It took over an hour of runtime before the first time jump.  I am
>>> working on a better way to reproduce it at the moment, but it's hard to
>>> test the test with a bug so intermittent.
>>>
>>>  I am also testing various other VM versions.  My first hope was that
>>> this was limited to the 64-bit version where we first encountered the
>>> problem, but a change to the 32-bit version has only made the problem
>>> happen less often, not eliminated it.
>>>
>>>  We never saw this bug with R14B03 which we were running previously to
>>> R15B01.  However, system load is different so I can't make a direct
>>> comparison.  I did notice a few significant updates to the Windows time
>>> related code between R14B03 and R15:
>>>
>>>  git log sys_time.c
>>>
>>>  commit 46eb4359b05b220861453a869dc734480ec045a6
>>> Author: Patrik Nyblom <pan@REDACTED>
>>> Date:   Tue Dec 6 19:07:16 2011 +0100
>>>
>>>     Emulate localtime, gmtime and mktime to enable negative time_t
>>>
>>> commit 913f05af100e98a8665bbb6168e89fbcfe4ece75
>>> Author: Bj<C3><B6>rn-Egil Dahlberg <egil@REDACTED>
>>> Date:   Fri Dec 2 15:25:06 2011 +0100
>>>
>>>     Teach windows sys_localtime_r
>>>
>>>
>>>   Yep, that's me... But even if I gave a totally weird time back from
>>> those, the erlang:now logic should have stopped this from happening. I'll
>>> try to reproduce using your example program. If nothing else helps, I'll
>>> instrument a VM that gives som traces in the time code...
>>>
>>>  I am completely stumped.  What can I do next to help track down the
>>> source of the bug?
>>>
>>>   Unfortunately, so am I. Especially weird that it's load related...
>>> Maybe something is not locked as it should be...
>>>
>>>  Thanks,
>>>  Garret Smith
>>>
>>> Thanks for reporting, I'll get back to you!
>>>
>>> Cheers,
>>> /Patrik
>>>
>>>
>>>
>>> _______________________________________________
>>> erlang-bugs mailing listerlang-bugs@REDACTED://erlang.org/mailman/listinfo/erlang-bugs
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> erlang-bugs mailing listerlang-bugs@REDACTED://erlang.org/mailman/listinfo/erlang-bugs
>>>
>>>
>>>
>>> _______________________________________________
>>> erlang-bugs mailing list
>>> erlang-bugs@REDACTED
>>> http://erlang.org/mailman/listinfo/erlang-bugs
>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-bugs/attachments/20130305/6ac49118/attachment.htm>