[erlang-bugs] R15B01 erlang:now() jumping ~24 days into the future

Tue Mar 5 12:06:27 CET 2013

Hi again...

I'm not sure about one thing. What happens to os:timestamp() during 
these jumps? Does it stay on track or does it also jump around?

I've tried to reproduce it with your program, but has not yet succeeded. 
Have you seen this on the R16B release as well?

Is the hardware in any way fancy (like a lot of cores, some new 
processor I don't have or something else?) or is there anything else 
special about the machine? Also the time zone you're running in would be 
interesting, as there is some time zone specific code there...

I would really like to be able to reproduce it so you don't have to do 
all the tests at your site, it might end up being really time consuming 
for you if I make to many mistakes :)

Cheers,
/Patrik

On 03/05/2013 08:50 AM, Patrik Nyblom wrote:
> Hi!
>
> On 03/05/2013 02:26 AM, Garret Smith wrote:
>> I have been beating my head against a wall for weeks tracking down 
>> spooky behaviour[sic] in one of our production systems.  I finally 
>> tracked it down to "jumps" in the times returned by erlang:now(), 
>> causing all timers in the system to expire at once.  I have witnessed 
>> this bug on R15B01, both 64 and 32-bit versions running on Windows 
>> Server 2008 R2, both on bare metal and VirtualBox VM.
>>
>> The time jump is always around 2126000 seconds, or a little over 24 
>> days.  The now() time does not try to converge with os:timestamp() as 
>> the documentation suggests, and as I confirmed it does if you just 
>> change the system clock.
>>
>> Another VM running concurrently on the same machine but with little 
>> load (diagnostic node & production node) did not time jump.
>>
>> Higher load seems to make the time jumps happen more often.
>>
>> Frequency between time jumps varies between seconds and hours, but 
>> when a jump occurs, it is always 2126000 + (9 to 26) seconds.
>>
>> I never see the jump in logfile timestamps that use os:timestamp() 
>> for tagging log messages.  I had to start tracing a production node 
>> before I caught the jump. Here are some lines from a trace, where the 
>> timestamp in trace_ts is printed using calendar:now_to_local_time() 
>> and then in raw tuple format:
>>
>> 2013-4-16 21:40:1.993399|{1366,173601,993399}
>> 2013-4-16 21:40:1.993400|{1366,173601,993400}
>> 2013-5-11 12:13:41.986961|{1368,299621,986961}
>> 2013-5-11 12:13:41.986962|{1368,299621,986962}
>>
>> then a bit later...
>>
>> 2013-5-11 12:36:19.955129|{1368,300979,955129}
>> 2013-5-11 12:36:19.955130|{1368,300979,955130}
>> 2013-6-5 3:9:49.538830|{1370,426989,538830}
>> 2013-6-5 3:9:49.538833|{1370,426989,538833}
>>
> Gah! That's obviously not supposed to happen...
>> I captured many such jumps over the course of a day or so.  Obviously 
>> from the dates, 2 jumps happened before I started tracing.
>>
>> I was able to reproduce the bug, though not as efficiently as my 
>> production system, with the following sample program: 
>> https://gist.github.com/garret-smith/5087169
>>
>> It took over an hour of runtime before the first time jump. I am 
>> working on a better way to reproduce it at the moment, but it's hard 
>> to test the test with a bug so intermittent.
>>
>> I am also testing various other VM versions.  My first hope was that 
>> this was limited to the 64-bit version where we first encountered the 
>> problem, but a change to the 32-bit version has only made the problem 
>> happen less often, not eliminated it.
>>
>> We never saw this bug with R14B03 which we were running previously to 
>> R15B01.  However, system load is different so I can't make a direct 
>> comparison.  I did notice a few significant updates to the Windows 
>> time related code between R14B03 and R15:
>>
>> git log sys_time.c
>>
>> commit 46eb4359b05b220861453a869dc734480ec045a6
>> Author: Patrik Nyblom <pan@REDACTED <mailto:pan@REDACTED>>
>> Date:   Tue Dec 6 19:07:16 2011 +0100
>>
>>     Emulate localtime, gmtime and mktime to enable negative time_t
>>
>> commit 913f05af100e98a8665bbb6168e89fbcfe4ece75
>> Author: Bj<C3><B6>rn-Egil Dahlberg <egil@REDACTED 
>> <mailto:egil@REDACTED>>
>> Date:   Fri Dec 2 15:25:06 2011 +0100
>>
>>     Teach windows sys_localtime_r
>>
>>
> Yep, that's me... But even if I gave a totally weird time back from 
> those, the erlang:now logic should have stopped this from happening. 
> I'll try to reproduce using your example program. If nothing else 
> helps, I'll instrument a VM that gives som traces in the time code...
>> I am completely stumped.  What can I do next to help track down the 
>> source of the bug?
>>
> Unfortunately, so am I. Especially weird that it's load related... 
> Maybe something is not locked as it should be...
>> Thanks,
>> Garret Smith
> Thanks for reporting, I'll get back to you!
>
> Cheers,
> /Patrik
>>
>>
>> _______________________________________________
>> erlang-bugs mailing list
>> erlang-bugs@REDACTED
>> http://erlang.org/mailman/listinfo/erlang-bugs
>
>
>
> _______________________________________________
> erlang-bugs mailing list
> erlang-bugs@REDACTED
> http://erlang.org/mailman/listinfo/erlang-bugs

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-bugs/attachments/20130305/4afb8b06/attachment.htm>