[erlang-bugs] R15B01 erlang:now() jumping ~24 days into the future

Tue Mar 12 10:57:11 CET 2013

Hi!

Good! Thanks!

I can build a patched R15 beam.dll for you, easiest is R15B03, but i can 
do a patched beam for R15B02 if that's really needed. In the end I'll 
probably build some kind of R15B03-2 and a R16B00-1 or something, so 
whoever wants the patch can get binaries. However I would like to have 
something tested in your real system, if that's OK with you. So - which 
version is best to patch for? R15B02?

/Patrik

On 03/12/2013 12:48 AM, Garret Smith wrote:
> Been running the test program all day in the same scenario as before.  
> No time jumps!  Looking good...
>
>
> On Mon, Mar 11, 2013 at 9:34 AM, Garret Smith <garret.smith@REDACTED 
> <mailto:garret.smith@REDACTED>> wrote:
>
>     Patrik,
>
>     Our production systems are on R15B1/2, so I won't be able to
>     verify against that, but I'll let you know what I see running my
>     test program against R16B.
>
>     Will you be able to generate a patched R15x version?  If not, I'll
>     try to set up a build system and apply the patch locally.
>
>     -Garret
>
>
>     On Mon, Mar 11, 2013 at 9:26 AM, Patrik Nyblom <pan@REDACTED
>     <mailto:pan@REDACTED>> wrote:
>
>         Hi again!
>
>         I think I've found it. At least I've found one error,
>         hopefully that's the one you've also found :)
>
>         The sys_gethrtime function has gon new uses in R15 and on,
>         uses where it is no longer protected by the 
>         erts_timeofday_mtx. So - it simply needs a lock of it's own.
>         This gives a slight performance loss, but that could be fixed
>         by using GetTickCount64 on win7 and win2008 at least.
>
>         Can you try a version of beam.smp.dll with a lock and see if
>         the error is gone on your machines? If that works, I would
>         also like you to try an optimized version, but let's first
>         make sure we have the bug nailed down :)
>
>         In my dropbox, there's a beam.smp.dll. If you replace
>         $ERL_ROOT/erts-5.10.1/bin/beam.smp.dll with that one and then
>         start werl, the slogan should contain [source-be0da3e]. It is
>         for 64bit windows. The public dropbox URL is:
>         http://dl.dropbox.com/u/17212223/beam.smp.dll
>
>         This should work without any special messages or such, giving
>         a working erlang:now/0. If it starts sending strange ERROR
>         REPORT's about ticks moving slightly backwards, we have a more
>         complicated bug, but I haven't seen any such messages since i
>         added proper locking.
>
>         If it's possible for you to test this, I would be immensely
>         grateful!
>
>         Cheers,
>         /Patrik
>
>         On 03/07/2013 04:37 PM, Patrik Nyblom wrote:
>>         Hi Garret!
>>
>>         I've been able to reproduce it on my freshly installed
>>         Win2008 machine! Great, now I only need to debug it and find
>>         the error :)
>>
>>         I'll get back to you as soon as I feel I have a fix - it
>>         might take a few days, given the relatively long turn around
>>         time, but we'll get there!
>>
>>         Thank you for all the help and information!
>>
>>         Cheers,
>>         /Patrik
>>
>>         On 03/05/2013 09:10 PM, Garret Smith wrote:
>>>         On the same machine with the same steps as previous, I
>>>         reproduced the time jump on R16B.
>>>         This time the jump happened with a <5 sec delta btw now()
>>>         and os:timestamp().
>>>         Still jumping ~2126000 seconds.
>>>
>>>         -Garret
>>>
>>>
>>>         On Tue, Mar 5, 2013 at 11:20 AM, Garret Smith
>>>         <garret.smith@REDACTED <mailto:garret.smith@REDACTED>> wrote:
>>>
>>>             The gist https://gist.github.com/garret-smith/5087169 is
>>>             updated with a slightly better version.  I was able to
>>>             reproduce the jump in less than an hour.  I also did
>>>             some more things to perturb the timing code while the
>>>             test program was running.
>>>
>>>             Here is the latest info, everything I can think of that
>>>             may have the slightest effect:
>>>              * R15B01 64-bit build
>>>              * Pacific time zone (GMT -8)
>>>              * Xeon E5405 in an HP DL160
>>>              * no arguments to erl.exe
>>>              * bursty, high CPU load, >75% memory used by other software
>>>              * running Observer on the test VM displaying the "Load
>>>             Charts" tab
>>>              * made some small adjustments (~ 60 seconds) to the
>>>             system clock while running the tests - now() and
>>>             os:timestamp() behaved as expected, initially showing a
>>>             delta and slowly converging
>>>              * w32tm /resync to fix the system clock some time after
>>>             perturbing it
>>>
>>>             The time jump in now() occurred when now() was ~9
>>>             seconds behind os:timestamp() as reported by the new
>>>             test program.
>>>
>>>             I'm starting to look at R16B now.
>>>
>>>             -Garret Smith
>>>
>>>
>>>             On Tue, Mar 5, 2013 at 8:37 AM, Garret Smith
>>>             <garret.smith@REDACTED <mailto:garret.smith@REDACTED>>
>>>             wrote:
>>>
>>>                 I haven't seen anything unexpected in
>>>                 os:timestamp(). No jumps at all.
>>>
>>>                 CPU is an Intel Xeon X3430.
>>>
>>>                 I have reproduced it in the LosAngeles/Pacific Time
>>>                 (GMT -8) and US East coast time zone (GMT -5).
>>>
>>>                 I have not yet tried R16B.  I'll be starting that
>>>                 today.  I'm also trying to improve the test program,
>>>                 since it's taking quite a long time between jumps
>>>                 for me as well.  I'll let you know as soon as I have
>>>                 a better one.
>>>
>>>                 You have no idea how relieved I am that you are
>>>                 looking into this!
>>>
>>>                 Thanks,
>>>                 Garret Smith
>>>
>>>
>>>                 On Tue, Mar 5, 2013 at 3:06 AM, Patrik Nyblom
>>>                 <pan@REDACTED <mailto:pan@REDACTED>> wrote:
>>>
>>>                     Hi again...
>>>
>>>                     I'm not sure about one thing. What happens to
>>>                     os:timestamp() during these jumps? Does it stay
>>>                     on track or does it also jump around?
>>>
>>>                     I've tried to reproduce it with your program,
>>>                     but has not yet succeeded. Have you seen this on
>>>                     the R16B release as well?
>>>
>>>                     Is the hardware in any way fancy (like a lot of
>>>                     cores, some new processor I don't have or
>>>                     something else?) or is there anything else
>>>                     special about the machine? Also the time zone
>>>                     you're running in would be interesting, as there
>>>                     is some time zone specific code there...
>>>
>>>                     I would really like to be able to reproduce it
>>>                     so you don't have to do all the tests at your
>>>                     site, it might end up being really time
>>>                     consuming for you if I make to many mistakes :)
>>>
>>>                     Cheers,
>>>                     /Patrik
>>>
>>>
>>>
>>>                     On 03/05/2013 08:50 AM, Patrik Nyblom wrote:
>>>>                     Hi!
>>>>
>>>>                     On 03/05/2013 02:26 AM, Garret Smith wrote:
>>>>>                     I have been beating my head against a wall for
>>>>>                     weeks tracking down spooky behaviour[sic] in
>>>>>                     one of our production systems.  I finally
>>>>>                     tracked it down to "jumps" in the times
>>>>>                     returned by erlang:now(), causing all timers
>>>>>                     in the system to expire at once.  I have
>>>>>                     witnessed this bug on R15B01, both 64 and
>>>>>                     32-bit versions running on Windows Server 2008
>>>>>                     R2, both on bare metal and VirtualBox VM.
>>>>>
>>>>>                     The time jump is always around 2126000
>>>>>                     seconds, or a little over 24 days.  The now()
>>>>>                     time does not try to converge with
>>>>>                     os:timestamp() as the documentation suggests,
>>>>>                     and as I confirmed it does if you just change
>>>>>                     the system clock.
>>>>>
>>>>>                     Another VM running concurrently on the same
>>>>>                     machine but with little load (diagnostic node
>>>>>                     & production node) did not time jump.
>>>>>
>>>>>                     Higher load seems to make the time jumps
>>>>>                     happen more often.
>>>>>
>>>>>                     Frequency between time jumps varies between
>>>>>                     seconds and hours, but when a jump occurs, it
>>>>>                     is always 2126000 + (9 to 26) seconds.
>>>>>
>>>>>                     I never see the jump in logfile timestamps
>>>>>                     that use os:timestamp() for tagging log
>>>>>                     messages. I had to start tracing a production
>>>>>                     node before I caught the jump.  Here are some
>>>>>                     lines from a trace, where the timestamp in
>>>>>                     trace_ts is printed using
>>>>>                     calendar:now_to_local_time() and then in raw
>>>>>                     tuple format:
>>>>>
>>>>>                     2013-4-16 21:40:1.993399|{1366,173601,993399}
>>>>>                     2013-4-16 21:40:1.993400|{1366,173601,993400}
>>>>>                     2013-5-11 12:13:41.986961|{1368,299621,986961}
>>>>>                     2013-5-11 12:13:41.986962|{1368,299621,986962}
>>>>>
>>>>>                     then a bit later...
>>>>>
>>>>>                     2013-5-11 12:36:19.955129|{1368,300979,955129}
>>>>>                     2013-5-11 12:36:19.955130|{1368,300979,955130}
>>>>>                     2013-6-5 3:9:49.538830|{1370,426989,538830}
>>>>>                     2013-6-5 3:9:49.538833|{1370,426989,538833}
>>>>>
>>>>                     Gah! That's obviously not supposed to happen...
>>>>>                     I captured many such jumps over the course of
>>>>>                     a day or so. Obviously from the dates, 2 jumps
>>>>>                     happened before I started tracing.
>>>>>
>>>>>                     I was able to reproduce the bug, though not as
>>>>>                     efficiently as my production system, with the
>>>>>                     following sample program:
>>>>>                     https://gist.github.com/garret-smith/5087169
>>>>>
>>>>>                     It took over an hour of runtime before the
>>>>>                     first time jump.  I am working on a better way
>>>>>                     to reproduce it at the moment, but it's hard
>>>>>                     to test the test with a bug so intermittent.
>>>>>
>>>>>                     I am also testing various other VM versions.
>>>>>                     My first hope was that this was limited to the
>>>>>                     64-bit version where we first encountered the
>>>>>                     problem, but a change to the 32-bit version
>>>>>                     has only made the problem happen less often,
>>>>>                     not eliminated it.
>>>>>
>>>>>                     We never saw this bug with R14B03 which we
>>>>>                     were running previously to R15B01. However,
>>>>>                     system load is different so I can't make a
>>>>>                     direct comparison.  I did notice a few
>>>>>                     significant updates to the Windows time
>>>>>                     related code between R14B03 and R15:
>>>>>
>>>>>                     git log sys_time.c
>>>>>
>>>>>                     commit 46eb4359b05b220861453a869dc734480ec045a6
>>>>>                     Author: Patrik Nyblom <pan@REDACTED
>>>>>                     <mailto:pan@REDACTED>>
>>>>>                     Date:   Tue Dec 6 19:07:16 2011 +0100
>>>>>
>>>>>                         Emulate localtime, gmtime and mktime to
>>>>>                     enable negative time_t
>>>>>
>>>>>                     commit 913f05af100e98a8665bbb6168e89fbcfe4ece75
>>>>>                     Author: Bj<C3><B6>rn-Egil Dahlberg
>>>>>                     <egil@REDACTED <mailto:egil@REDACTED>>
>>>>>                     Date:   Fri Dec 2 15:25:06 2011 +0100
>>>>>
>>>>>                         Teach windows sys_localtime_r
>>>>>
>>>>>
>>>>                     Yep, that's me... But even if I gave a totally
>>>>                     weird time back from those, the erlang:now
>>>>                     logic should have stopped this from happening.
>>>>                     I'll try to reproduce using your example
>>>>                     program. If nothing else helps, I'll instrument
>>>>                     a VM that gives som traces in the time code...
>>>>>                     I am completely stumped.  What can I do next
>>>>>                     to help track down the source of the bug?
>>>>>
>>>>                     Unfortunately, so am I. Especially weird that
>>>>                     it's load related... Maybe something is not
>>>>                     locked as it should be...
>>>>>                     Thanks,
>>>>>                     Garret Smith
>>>>                     Thanks for reporting, I'll get back to you!
>>>>
>>>>                     Cheers,
>>>>                     /Patrik
>>>>>
>>>>>
>>>>>                     _______________________________________________
>>>>>                     erlang-bugs mailing list
>>>>>                     erlang-bugs@REDACTED  <mailto:erlang-bugs@REDACTED>
>>>>>                     http://erlang.org/mailman/listinfo/erlang-bugs
>>>>
>>>>
>>>>
>>>>                     _______________________________________________
>>>>                     erlang-bugs mailing list
>>>>                     erlang-bugs@REDACTED  <mailto:erlang-bugs@REDACTED>
>>>>                     http://erlang.org/mailman/listinfo/erlang-bugs
>>>
>>>
>>>                     _______________________________________________
>>>                     erlang-bugs mailing list
>>>                     erlang-bugs@REDACTED
>>>                     <mailto:erlang-bugs@REDACTED>
>>>                     http://erlang.org/mailman/listinfo/erlang-bugs
>>>
>>>
>>>
>>>
>>
>>
>>
>>         _______________________________________________
>>         erlang-bugs mailing list
>>         erlang-bugs@REDACTED  <mailto:erlang-bugs@REDACTED>
>>         http://erlang.org/mailman/listinfo/erlang-bugs
>
>
>         _______________________________________________
>         erlang-bugs mailing list
>         erlang-bugs@REDACTED <mailto:erlang-bugs@REDACTED>
>         http://erlang.org/mailman/listinfo/erlang-bugs
>
>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-bugs/attachments/20130312/0dfc7cdf/attachment.htm>