<p>A shot from the hip: are these servers by any chance running on a virtualized machine?<br>
I've on two different occasions had problems with a bug in VMWare which causes the high-resolution timer to return bad values, very very rarely but also very confusing for timers.</p>
<div class="gmail_quote">Den 18/07/2012 14.26 skrev "Fred Hebert" <<a href="mailto:mononcqc@ferd.ca">mononcqc@ferd.ca</a>>:<br type="attribution"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
I'm having the weirdest of issues right now. If you bear with me for the read, you'll know what I mean.<br>
<br>
I recently pushed an experimental version of vmstats online on our production servers, and it mostly works fine.<br>
<br>
The code in question is a gen_server sending itself messages here:<br>
<a href="https://github.com/ferd/vmstats/blob/master/src/vmstats_server.erl" target="_blank">https://github.com/ferd/<u></u>vmstats/blob/master/src/<u></u>vmstats_server.erl</a><br>
<br>
And it sends through UDP sockets with the help of <a href="https://github.com/lpgauth/statsderl/blob/master/src/statsderl.erl" target="_blank">https://github.com/lpgauth/<u></u>statsderl/blob/master/src/<u></u>statsderl.erl</a><br>
<br>
Now I said it 'mostly' works fine. The problem is that after a few hours, two of the servers' statistics related to vmstats go dead. By that, I mean that in our reports, all I see is a flat line (see <a href="http://i.imgur.com/LIJ1J.png" target="_blank">http://i.imgur.com/LIJ1J.png</a>, where h019 stops sending data near 19:16)<br>
<br>
There is no error log, no exception, everything else runs fine. ngrep shows nothing coming from vmstats on UDP, but other parts of the server stack keep sending data through the UDP port fine. This is where I would suspect things to be related to the server being locked up, but, here's the weird part.<br>
<br>
As soon as I connect to the virtual machine (^G -> r 'nodename@host' <Enter>) to inspect the vmstats_server, data starts pouring in again (this is what happens on the previous image at 19:28). Any mere attempt to connect to the node to understand what the problem is causes it to be resolved.<br>
<br>
I had the bug happen to me again a bit past 4 this morning. When I unstuck it around 8, I got the following data on the next few loops of the server (caught with "ngrep 'vmstats.scheduler_wall_time.<u></u>1\.' udp"):<br>
<br>
##### 1<br>
vmstats.scheduler_wall_time.1.<u></u>active: 189374595<br>
vmstats.scheduler_wall_time.1.<u></u>total: 1002022817<br>
##### 2<br>
vmstats.scheduler_wall_time.1.<u></u>active:2293308912394<br>
vmstats.scheduler_wall_time.1.<u></u>total:12460747343114<br>
##### 3<br>
vmstats.scheduler_wall_time.1.<u></u>active:186326615<br>
vmstats.scheduler_wall_time.1.<u></u>total:1004246720<br>
<br>
The interesting thing with the numbers above is that the server loops every second or so. The numbers are coming from a subtractions in scheduler_wall_time values as of R15B01 between slices of 1 second for each loop. Thus, the values '1002022817' and '1004246720' are averaging the equivalent of one second (the Erlang VM says the unit is undefined).<br>
<br>
If I make a ratio of them with '12460747343114' (the second value), I get 12435 seconds, or roughly 3h45, equivalent to the time the server sent nothing. I sadly have no absolute timestamp.<br>
<br>
This tells me that when things get stuck, the gen_server itself stops polling (or at least pushing data), up until the point I connect to the virtual machine, and it gets back to work. This wouldn't be related to UDP. Hypocritical thing. It takes breaks until I notice it then goes back to work.<br>
<br>
I'm running R15B01, only 2 out of 6 servers we deployed it to show this issue (so far), it happens a couple of times a day in unpredictable manners.<br>
<br>
Can anyone spot an obvious bug in the gen_server? Can this be some weird scheduling behavior? Or maybe a bug with erlang:statistics(scheduler_<u></u>wall_time)? Anyone else ever had similar issues?<br>
(should I have sent this to erlang-bugs?)<br>
<br>
Thanks for your help,<br>
Fred.<br>
<br>
<br>
<br>
______________________________<u></u>_________________<br>
erlang-questions mailing list<br>
<a href="mailto:erlang-questions@erlang.org" target="_blank">erlang-questions@erlang.org</a><br>
<a href="http://erlang.org/mailman/listinfo/erlang-questions" target="_blank">http://erlang.org/mailman/<u></u>listinfo/erlang-questions</a><br>
</blockquote></div>