[erlang-bugs] R15B01 erlang:now() jumping ~24 days into the future

Garret Smith garret.smith@REDACTED
Tue Mar 12 00:48:28 CET 2013


Been running the test program all day in the same scenario as before.  No
time jumps!  Looking good...


On Mon, Mar 11, 2013 at 9:34 AM, Garret Smith <garret.smith@REDACTED>wrote:

> Patrik,
>
> Our production systems are on R15B1/2, so I won't be able to verify
> against that, but I'll let you know what I see running my test program
> against R16B.
>
> Will you be able to generate a patched R15x version?  If not, I'll try to
> set up a build system and apply the patch locally.
>
> -Garret
>
>
> On Mon, Mar 11, 2013 at 9:26 AM, Patrik Nyblom <pan@REDACTED> wrote:
>
>>  Hi again!
>>
>> I think I've found it. At least I've found one error, hopefully that's
>> the one you've also found :)
>>
>> The sys_gethrtime function has gon new uses in R15 and on, uses where it
>> is no longer protected by the  erts_timeofday_mtx. So - it simply needs a
>> lock of it's own. This gives a slight performance loss, but that could be
>> fixed by using GetTickCount64 on win7 and win2008 at least.
>>
>> Can you try a version of beam.smp.dll with a lock and see if the error is
>> gone on your machines? If that works, I would also like you to try an
>> optimized version, but let's first make sure we have the bug nailed down :)
>>
>> In my dropbox, there's a beam.smp.dll. If you replace
>> $ERL_ROOT/erts-5.10.1/bin/beam.smp.dll with that one and then start werl,
>> the slogan should contain [source-be0da3e]. It is for 64bit windows. The
>> public dropbox URL is:
>> http://dl.dropbox.com/u/17212223/beam.smp.dll
>>
>> This should work without any special messages or such, giving a working
>> erlang:now/0. If it starts sending strange ERROR REPORT's about ticks
>> moving slightly backwards, we have a more complicated bug, but I haven't
>> seen any such messages since i added proper locking.
>>
>> If it's possible for you to test this, I would be immensely grateful!
>>
>> Cheers,
>> /Patrik
>>
>> On 03/07/2013 04:37 PM, Patrik Nyblom wrote:
>>
>> Hi Garret!
>>
>> I've been able to reproduce it on my freshly installed Win2008 machine!
>> Great, now I only need to debug it and find the error :)
>>
>> I'll get back to you as soon as I feel I have a fix - it might take a few
>> days, given the relatively long turn around time, but we'll get there!
>>
>> Thank you for all the help and information!
>>
>> Cheers,
>> /Patrik
>>
>> On 03/05/2013 09:10 PM, Garret Smith wrote:
>>
>>  On the same machine with the same steps as previous, I reproduced the
>> time jump on R16B.
>> This time the jump happened with a <5 sec delta btw now() and
>> os:timestamp().
>> Still jumping ~2126000 seconds.
>>
>>  -Garret
>>
>>
>> On Tue, Mar 5, 2013 at 11:20 AM, Garret Smith <garret.smith@REDACTED>wrote:
>>
>>>   The gist https://gist.github.com/garret-smith/5087169 is updated with
>>> a slightly better version.  I was able to reproduce the jump in less than
>>> an hour.  I also did some more things to perturb the timing code while the
>>> test program was running.
>>>
>>>  Here is the latest info, everything I can think of that may have the
>>> slightest effect:
>>>   * R15B01 64-bit build
>>>   * Pacific time zone (GMT -8)
>>>   * Xeon E5405 in an HP DL160
>>>   * no arguments to erl.exe
>>>   * bursty, high CPU load, >75% memory used by other software
>>>   * running Observer on the test VM displaying the "Load Charts" tab
>>>   * made some small adjustments (~ 60 seconds) to the system clock
>>> while running the tests - now() and os:timestamp() behaved as expected,
>>> initially showing a delta and slowly converging
>>>   * w32tm /resync to fix the system clock some time after perturbing it
>>>
>>>  The time jump in now() occurred when now() was ~9 seconds behind
>>> os:timestamp() as reported by the new test program.
>>>
>>>  I'm starting to look at R16B now.
>>>
>>>  -Garret Smith
>>>
>>>
>>> On Tue, Mar 5, 2013 at 8:37 AM, Garret Smith <garret.smith@REDACTED>wrote:
>>>
>>>>   I haven't seen anything unexpected in os:timestamp().  No jumps at
>>>> all.
>>>>
>>>>  CPU is an Intel Xeon X3430.
>>>>
>>>> I have reproduced it in the LosAngeles/Pacific Time (GMT -8) and US
>>>> East coast time zone (GMT -5).
>>>>
>>>>  I have not yet tried R16B.  I'll be starting that today.  I'm also
>>>> trying to improve the test program, since it's taking quite a long time
>>>> between jumps for me as well.  I'll let you know as soon as I have a better
>>>> one.
>>>>
>>>>  You have no idea how relieved I am that you are looking into this!
>>>>
>>>>  Thanks,
>>>>  Garret Smith
>>>>
>>>>
>>>>  On Tue, Mar 5, 2013 at 3:06 AM, Patrik Nyblom <pan@REDACTED> wrote:
>>>>
>>>>>  Hi again...
>>>>>
>>>>> I'm not sure about one thing. What happens to os:timestamp() during
>>>>> these jumps? Does it stay on track or does it also jump around?
>>>>>
>>>>> I've tried to reproduce it with your program, but has not yet
>>>>> succeeded. Have you seen this on the R16B release as well?
>>>>>
>>>>> Is the hardware in any way fancy (like a lot of cores, some new
>>>>> processor I don't have or something else?) or is there anything else
>>>>> special about the machine? Also the time zone you're running in would be
>>>>> interesting, as there is some time zone specific code there...
>>>>>
>>>>> I would really like to be able to reproduce it so you don't have to do
>>>>> all the tests at your site, it might end up being really time consuming for
>>>>> you if I make to many mistakes :)
>>>>>
>>>>> Cheers,
>>>>> /Patrik
>>>>>
>>>>>
>>>>>
>>>>> On 03/05/2013 08:50 AM, Patrik Nyblom wrote:
>>>>>
>>>>> Hi!
>>>>>
>>>>> On 03/05/2013 02:26 AM, Garret Smith wrote:
>>>>>
>>>>>     I have been beating my head against a wall for weeks tracking
>>>>> down spooky behaviour[sic] in one of our production systems.  I finally
>>>>> tracked it down to "jumps" in the times returned by erlang:now(), causing
>>>>> all timers in the system to expire at once.  I have witnessed this bug on
>>>>> R15B01, both 64 and 32-bit versions running on Windows Server 2008 R2, both
>>>>> on bare metal and VirtualBox VM.
>>>>>
>>>>>  The time jump is always around 2126000 seconds, or a little over 24
>>>>> days.  The now() time does not try to converge with os:timestamp() as the
>>>>> documentation suggests, and as I confirmed it does if you just change the
>>>>> system clock.
>>>>>
>>>>>  Another VM running concurrently on the same machine but with little
>>>>> load (diagnostic node & production node) did not time jump.
>>>>>
>>>>>  Higher load seems to make the time jumps happen more often.
>>>>>
>>>>>  Frequency between time jumps varies between seconds and hours, but
>>>>> when a jump occurs, it is always 2126000 + (9 to 26) seconds.
>>>>>
>>>>>  I never see the jump in logfile timestamps that use os:timestamp()
>>>>> for tagging log messages.  I had to start tracing a production node before
>>>>> I caught the jump.  Here are some lines from a trace, where the timestamp
>>>>> in trace_ts is printed using calendar:now_to_local_time() and then in raw
>>>>> tuple format:
>>>>>
>>>>> 2013-4-16 21:40:1.993399|{1366,173601,993399}
>>>>> 2013-4-16 21:40:1.993400|{1366,173601,993400}
>>>>> 2013-5-11 12:13:41.986961|{1368,299621,986961}
>>>>> 2013-5-11 12:13:41.986962|{1368,299621,986962}
>>>>>
>>>>>  then a bit later...
>>>>>
>>>>> 2013-5-11 12:36:19.955129|{1368,300979,955129}
>>>>> 2013-5-11 12:36:19.955130|{1368,300979,955130}
>>>>> 2013-6-5 3:9:49.538830|{1370,426989,538830}
>>>>> 2013-6-5 3:9:49.538833|{1370,426989,538833}
>>>>>
>>>>>   Gah! That's obviously not supposed to happen...
>>>>>
>>>>>  I captured many such jumps over the course of a day or so.
>>>>> Obviously from the dates, 2 jumps happened before I started tracing.
>>>>>
>>>>>  I was able to reproduce the bug, though not as efficiently as my
>>>>> production system, with the following sample program:
>>>>> https://gist.github.com/garret-smith/5087169
>>>>>
>>>>> It took over an hour of runtime before the first time jump.  I am
>>>>> working on a better way to reproduce it at the moment, but it's hard to
>>>>> test the test with a bug so intermittent.
>>>>>
>>>>>  I am also testing various other VM versions.  My first hope was that
>>>>> this was limited to the 64-bit version where we first encountered the
>>>>> problem, but a change to the 32-bit version has only made the problem
>>>>> happen less often, not eliminated it.
>>>>>
>>>>>  We never saw this bug with R14B03 which we were running previously
>>>>> to R15B01.  However, system load is different so I can't make a direct
>>>>> comparison.  I did notice a few significant updates to the Windows time
>>>>> related code between R14B03 and R15:
>>>>>
>>>>>  git log sys_time.c
>>>>>
>>>>>  commit 46eb4359b05b220861453a869dc734480ec045a6
>>>>> Author: Patrik Nyblom <pan@REDACTED>
>>>>> Date:   Tue Dec 6 19:07:16 2011 +0100
>>>>>
>>>>>     Emulate localtime, gmtime and mktime to enable negative time_t
>>>>>
>>>>> commit 913f05af100e98a8665bbb6168e89fbcfe4ece75
>>>>> Author: Bj<C3><B6>rn-Egil Dahlberg <egil@REDACTED>
>>>>> Date:   Fri Dec 2 15:25:06 2011 +0100
>>>>>
>>>>>     Teach windows sys_localtime_r
>>>>>
>>>>>
>>>>>   Yep, that's me... But even if I gave a totally weird time back from
>>>>> those, the erlang:now logic should have stopped this from happening. I'll
>>>>> try to reproduce using your example program. If nothing else helps, I'll
>>>>> instrument a VM that gives som traces in the time code...
>>>>>
>>>>>  I am completely stumped.  What can I do next to help track down the
>>>>> source of the bug?
>>>>>
>>>>>   Unfortunately, so am I. Especially weird that it's load related...
>>>>> Maybe something is not locked as it should be...
>>>>>
>>>>>  Thanks,
>>>>>  Garret Smith
>>>>>
>>>>> Thanks for reporting, I'll get back to you!
>>>>>
>>>>> Cheers,
>>>>> /Patrik
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> erlang-bugs mailing listerlang-bugs@REDACTED://erlang.org/mailman/listinfo/erlang-bugs
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> erlang-bugs mailing listerlang-bugs@REDACTED://erlang.org/mailman/listinfo/erlang-bugs
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> erlang-bugs mailing list
>>>>> erlang-bugs@REDACTED
>>>>> http://erlang.org/mailman/listinfo/erlang-bugs
>>>>>
>>>>>
>>>>
>>>
>>
>>
>>
>> _______________________________________________
>> erlang-bugs mailing listerlang-bugs@REDACTED://erlang.org/mailman/listinfo/erlang-bugs
>>
>>
>>
>> _______________________________________________
>> erlang-bugs mailing list
>> erlang-bugs@REDACTED
>> http://erlang.org/mailman/listinfo/erlang-bugs
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-bugs/attachments/20130311/a0ef84f5/attachment.htm>


More information about the erlang-bugs mailing list