<div dir="ltr"><div><div><div><div><div><div><div>I have been beating my head against a wall for weeks tracking down spooky behaviour[sic] in one of our production systems. I finally tracked it down to "jumps" in the times returned by erlang:now(), causing all timers in the system to expire at once. I have witnessed this bug on R15B01, both 64 and 32-bit versions running on Windows Server 2008 R2, both on bare metal and VirtualBox VM.<br>
<br></div>The time jump is always around 2126000 seconds, or a little over 24 days. The now() time does not try to converge with os:timestamp() as the documentation suggests, and as I confirmed it does if you just change the system clock.<br>
<br></div>Another VM running concurrently on the same machine but with little load (diagnostic node & production node) did not time jump.<br><br></div><div>Higher load seems to make the time jumps happen more often.<br>
</div><div><br></div>Frequency between time jumps varies between seconds and hours, but when a jump occurs, it is always 2126000 + (9 to 26) seconds.<br><br></div>I never see the jump in logfile timestamps that use os:timestamp() for tagging log messages. I had to start tracing a production node before I caught the jump. Here are some lines from a trace, where the timestamp in trace_ts is printed using calendar:now_to_local_time() and then in raw tuple format:<br>
<br>2013-4-16 21:40:1.993399|{1366,173601,993399}<br>2013-4-16 21:40:1.993400|{1366,173601,993400}<br>2013-5-11 12:13:41.986961|{1368,299621,986961}<br>2013-5-11 12:13:41.986962|{1368,299621,986962}<br><br></div>then a bit later...<br>
<br>2013-5-11 12:36:19.955129|{1368,300979,955129}<br>2013-5-11 12:36:19.955130|{1368,300979,955130}<br>2013-6-5 3:9:49.538830|{1370,426989,538830}<br>2013-6-5 3:9:49.538833|{1370,426989,538833}<br><br></div>I captured many such jumps over the course of a day or so. Obviously from the dates, 2 jumps happened before I started tracing.<br>
<br></div><div>I was able to reproduce the bug, though not as efficiently as my production system, with the following sample program: <a href="https://gist.github.com/garret-smith/5087169">https://gist.github.com/garret-smith/5087169</a><br>
<br>It took over an hour of runtime before the first time jump. I am working on a better way to reproduce it at the moment, but it's hard to test the test with a bug so intermittent.<br></div><div><br></div><div>I am also testing various other VM versions. My first hope was that this was limited to the 64-bit version where we first encountered the problem, but a change to the 32-bit version has only made the problem happen less often, not eliminated it.<br>
<br></div><div>We never saw this bug with R14B03 which we were running previously to R15B01. However, system load is different so I can't make a direct comparison. I did notice a few significant updates to the Windows time related code between R14B03 and R15:<br>
<br></div><div>git log sys_time.c<br></div><div><br></div><div>commit 46eb4359b05b220861453a869dc734480ec045a6<br>Author: Patrik Nyblom <<a href="mailto:pan@erlang.org">pan@erlang.org</a>><br>Date: Tue Dec 6 19:07:16 2011 +0100<br>
<br> Emulate localtime, gmtime and mktime to enable negative time_t<br><br>commit 913f05af100e98a8665bbb6168e89fbcfe4ece75<br>Author: Bj<C3><B6>rn-Egil Dahlberg <<a href="mailto:egil@erlang.org">egil@erlang.org</a>><br>
Date: Fri Dec 2 15:25:06 2011 +0100<br><br> Teach windows sys_localtime_r<br><br><br></div><div>I am completely stumped. What can I do next to help track down the source of the bug?<br><br></div><div>Thanks,<br></div>
<div>Garret Smith<br></div></div>