[erlang-bugs] R15B01 erlang:now() jumping ~24 days into the future

Mon Mar 11 17:26:05 CET 2013

Hi again!

I think I've found it. At least I've found one error, hopefully that's 
the one you've also found :)

The sys_gethrtime function has gon new uses in R15 and on, uses where it 
is no longer protected by the  erts_timeofday_mtx. So - it simply needs 
a lock of it's own. This gives a slight performance loss, but that could 
be fixed by using GetTickCount64 on win7 and win2008 at least.

Can you try a version of beam.smp.dll with a lock and see if the error 
is gone on your machines? If that works, I would also like you to try an 
optimized version, but let's first make sure we have the bug nailed down :)

In my dropbox, there's a beam.smp.dll. If you replace 
$ERL_ROOT/erts-5.10.1/bin/beam.smp.dll with that one and then start 
werl, the slogan should contain [source-be0da3e]. It is for 64bit 
windows. The public dropbox URL is:
http://dl.dropbox.com/u/17212223/beam.smp.dll

This should work without any special messages or such, giving a working 
erlang:now/0. If it starts sending strange ERROR REPORT's about ticks 
moving slightly backwards, we have a more complicated bug, but I haven't 
seen any such messages since i added proper locking.

If it's possible for you to test this, I would be immensely grateful!

Cheers,
/Patrik
On 03/07/2013 04:37 PM, Patrik Nyblom wrote:
> Hi Garret!
>
> I've been able to reproduce it on my freshly installed Win2008 
> machine! Great, now I only need to debug it and find the error :)
>
> I'll get back to you as soon as I feel I have a fix - it might take a 
> few days, given the relatively long turn around time, but we'll get there!
>
> Thank you for all the help and information!
>
> Cheers,
> /Patrik
>
> On 03/05/2013 09:10 PM, Garret Smith wrote:
>> On the same machine with the same steps as previous, I reproduced the 
>> time jump on R16B.
>> This time the jump happened with a <5 sec delta btw now() and 
>> os:timestamp().
>> Still jumping ~2126000 seconds.
>>
>> -Garret
>>
>>
>> On Tue, Mar 5, 2013 at 11:20 AM, Garret Smith <garret.smith@REDACTED 
>> <mailto:garret.smith@REDACTED>> wrote:
>>
>>     The gist https://gist.github.com/garret-smith/5087169 is updated
>>     with a slightly better version.  I was able to reproduce the jump
>>     in less than an hour.  I also did some more things to perturb the
>>     timing code while the test program was running.
>>
>>     Here is the latest info, everything I can think of that may have
>>     the slightest effect:
>>      * R15B01 64-bit build
>>      * Pacific time zone (GMT -8)
>>      * Xeon E5405 in an HP DL160
>>      * no arguments to erl.exe
>>      * bursty, high CPU load, >75% memory used by other software
>>      * running Observer on the test VM displaying the "Load Charts" tab
>>      * made some small adjustments (~ 60 seconds) to the system clock
>>     while running the tests - now() and os:timestamp() behaved as
>>     expected, initially showing a delta and slowly converging
>>      * w32tm /resync to fix the system clock some time after
>>     perturbing it
>>
>>     The time jump in now() occurred when now() was ~9 seconds behind
>>     os:timestamp() as reported by the new test program.
>>
>>     I'm starting to look at R16B now.
>>
>>     -Garret Smith
>>
>>
>>     On Tue, Mar 5, 2013 at 8:37 AM, Garret Smith
>>     <garret.smith@REDACTED <mailto:garret.smith@REDACTED>> wrote:
>>
>>         I haven't seen anything unexpected in os:timestamp().  No
>>         jumps at all.
>>
>>         CPU is an Intel Xeon X3430.
>>
>>         I have reproduced it in the LosAngeles/Pacific Time (GMT -8)
>>         and US East coast time zone (GMT -5).
>>
>>         I have not yet tried R16B.  I'll be starting that today.  I'm
>>         also trying to improve the test program, since it's taking
>>         quite a long time between jumps for me as well.  I'll let you
>>         know as soon as I have a better one.
>>
>>         You have no idea how relieved I am that you are looking into
>>         this!
>>
>>         Thanks,
>>         Garret Smith
>>
>>
>>         On Tue, Mar 5, 2013 at 3:06 AM, Patrik Nyblom <pan@REDACTED
>>         <mailto:pan@REDACTED>> wrote:
>>
>>             Hi again...
>>
>>             I'm not sure about one thing. What happens to
>>             os:timestamp() during these jumps? Does it stay on track
>>             or does it also jump around?
>>
>>             I've tried to reproduce it with your program, but has not
>>             yet succeeded. Have you seen this on the R16B release as
>>             well?
>>
>>             Is the hardware in any way fancy (like a lot of cores,
>>             some new processor I don't have or something else?) or is
>>             there anything else special about the machine? Also the
>>             time zone you're running in would be interesting, as
>>             there is some time zone specific code there...
>>
>>             I would really like to be able to reproduce it so you
>>             don't have to do all the tests at your site, it might end
>>             up being really time consuming for you if I make to many
>>             mistakes :)
>>
>>             Cheers,
>>             /Patrik
>>
>>
>>
>>             On 03/05/2013 08:50 AM, Patrik Nyblom wrote:
>>>             Hi!
>>>
>>>             On 03/05/2013 02:26 AM, Garret Smith wrote:
>>>>             I have been beating my head against a wall for weeks
>>>>             tracking down spooky behaviour[sic] in one of our
>>>>             production systems.  I finally tracked it down to
>>>>             "jumps" in the times returned by erlang:now(), causing
>>>>             all timers in the system to expire at once.  I have
>>>>             witnessed this bug on R15B01, both 64 and 32-bit
>>>>             versions running on Windows Server 2008 R2, both on
>>>>             bare metal and VirtualBox VM.
>>>>
>>>>             The time jump is always around 2126000 seconds, or a
>>>>             little over 24 days.  The now() time does not try to
>>>>             converge with os:timestamp() as the documentation
>>>>             suggests, and as I confirmed it does if you just change
>>>>             the system clock.
>>>>
>>>>             Another VM running concurrently on the same machine but
>>>>             with little load (diagnostic node & production node)
>>>>             did not time jump.
>>>>
>>>>             Higher load seems to make the time jumps happen more often.
>>>>
>>>>             Frequency between time jumps varies between seconds and
>>>>             hours, but when a jump occurs, it is always 2126000 +
>>>>             (9 to 26) seconds.
>>>>
>>>>             I never see the jump in logfile timestamps that use
>>>>             os:timestamp() for tagging log messages.  I had to
>>>>             start tracing a production node before I caught the
>>>>             jump.  Here are some lines from a trace, where the
>>>>             timestamp in trace_ts is printed using
>>>>             calendar:now_to_local_time() and then in raw tuple format:
>>>>
>>>>             2013-4-16 21:40:1.993399|{1366,173601,993399}
>>>>             2013-4-16 21:40:1.993400|{1366,173601,993400}
>>>>             2013-5-11 12:13:41.986961|{1368,299621,986961}
>>>>             2013-5-11 12:13:41.986962|{1368,299621,986962}
>>>>
>>>>             then a bit later...
>>>>
>>>>             2013-5-11 12:36:19.955129|{1368,300979,955129}
>>>>             2013-5-11 12:36:19.955130|{1368,300979,955130}
>>>>             2013-6-5 3:9:49.538830|{1370,426989,538830}
>>>>             2013-6-5 3:9:49.538833|{1370,426989,538833}
>>>>
>>>             Gah! That's obviously not supposed to happen...
>>>>             I captured many such jumps over the course of a day or
>>>>             so.  Obviously from the dates, 2 jumps happened before
>>>>             I started tracing.
>>>>
>>>>             I was able to reproduce the bug, though not as
>>>>             efficiently as my production system, with the following
>>>>             sample program:
>>>>             https://gist.github.com/garret-smith/5087169
>>>>
>>>>             It took over an hour of runtime before the first time
>>>>             jump.  I am working on a better way to reproduce it at
>>>>             the moment, but it's hard to test the test with a bug
>>>>             so intermittent.
>>>>
>>>>             I am also testing various other VM versions.  My first
>>>>             hope was that this was limited to the 64-bit version
>>>>             where we first encountered the problem, but a change to
>>>>             the 32-bit version has only made the problem happen
>>>>             less often, not eliminated it.
>>>>
>>>>             We never saw this bug with R14B03 which we were running
>>>>             previously to R15B01.  However, system load is
>>>>             different so I can't make a direct comparison.  I did
>>>>             notice a few significant updates to the Windows time
>>>>             related code between R14B03 and R15:
>>>>
>>>>             git log sys_time.c
>>>>
>>>>             commit 46eb4359b05b220861453a869dc734480ec045a6
>>>>             Author: Patrik Nyblom <pan@REDACTED
>>>>             <mailto:pan@REDACTED>>
>>>>             Date:   Tue Dec 6 19:07:16 2011 +0100
>>>>
>>>>                 Emulate localtime, gmtime and mktime to enable
>>>>             negative time_t
>>>>
>>>>             commit 913f05af100e98a8665bbb6168e89fbcfe4ece75
>>>>             Author: Bj<C3><B6>rn-Egil Dahlberg <egil@REDACTED
>>>>             <mailto:egil@REDACTED>>
>>>>             Date:   Fri Dec 2 15:25:06 2011 +0100
>>>>
>>>>                 Teach windows sys_localtime_r
>>>>
>>>>
>>>             Yep, that's me... But even if I gave a totally weird
>>>             time back from those, the erlang:now logic should have
>>>             stopped this from happening. I'll try to reproduce using
>>>             your example program. If nothing else helps, I'll
>>>             instrument a VM that gives som traces in the time code...
>>>>             I am completely stumped.  What can I do next to help
>>>>             track down the source of the bug?
>>>>
>>>             Unfortunately, so am I. Especially weird that it's load
>>>             related... Maybe something is not locked as it should be...
>>>>             Thanks,
>>>>             Garret Smith
>>>             Thanks for reporting, I'll get back to you!
>>>
>>>             Cheers,
>>>             /Patrik
>>>>
>>>>
>>>>             _______________________________________________
>>>>             erlang-bugs mailing list
>>>>             erlang-bugs@REDACTED  <mailto:erlang-bugs@REDACTED>
>>>>             http://erlang.org/mailman/listinfo/erlang-bugs
>>>
>>>
>>>
>>>             _______________________________________________
>>>             erlang-bugs mailing list
>>>             erlang-bugs@REDACTED  <mailto:erlang-bugs@REDACTED>
>>>             http://erlang.org/mailman/listinfo/erlang-bugs
>>
>>
>>             _______________________________________________
>>             erlang-bugs mailing list
>>             erlang-bugs@REDACTED <mailto:erlang-bugs@REDACTED>
>>             http://erlang.org/mailman/listinfo/erlang-bugs
>>
>>
>>
>>
>
>
>
> _______________________________________________
> erlang-bugs mailing list
> erlang-bugs@REDACTED
> http://erlang.org/mailman/listinfo/erlang-bugs

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-bugs/attachments/20130311/9c46cc6d/attachment.htm>