[erlang-questions] UDP sockets and gen_server are hypocritical and it makes me mad

Wed Jul 18 14:50:09 CEST 2012

A shot from the hip: are these servers by any chance running on a
virtualized machine?
I've on two different occasions had problems with a bug in VMWare which
causes the high-resolution timer to return bad values, very very rarely but
also very confusing for timers.
Den 18/07/2012 14.26 skrev "Fred Hebert" <mononcqc@REDACTED>:

> I'm having the weirdest of issues right now. If you bear with me for the
> read, you'll know what I mean.
>
> I recently pushed an experimental version of vmstats online on our
> production servers, and it mostly works fine.
>
> The code in question is a gen_server sending itself messages here:
> https://github.com/ferd/**vmstats/blob/master/src/**vmstats_server.erl<https://github.com/ferd/vmstats/blob/master/src/vmstats_server.erl>
>
> And it sends through UDP sockets with the help of
> https://github.com/lpgauth/**statsderl/blob/master/src/**statsderl.erl<https://github.com/lpgauth/statsderl/blob/master/src/statsderl.erl>
>
> Now I said it 'mostly' works fine. The problem is that after a few hours,
> two of the servers' statistics related to vmstats go dead. By that, I mean
> that in our reports, all I see is a flat line (see
> http://i.imgur.com/LIJ1J.png, where h019 stops sending data near 19:16)
>
> There is no error log, no exception, everything else runs fine. ngrep
> shows nothing coming from vmstats on UDP, but other parts of the server
> stack keep sending data through the UDP port fine. This is where I would
> suspect things to be related to the server being locked up, but, here's the
> weird part.
>
> As soon as I connect to the virtual machine (^G -> r 'nodename@REDACTED'
> <Enter>) to inspect the vmstats_server, data starts pouring in again (this
> is what happens on the previous image at 19:28). Any mere attempt to
> connect to the node to understand what the problem is causes it to be
> resolved.
>
> I had the bug happen to me again a bit past 4 this morning. When I unstuck
> it around 8, I got the following data on the next few loops of the server
> (caught with "ngrep 'vmstats.scheduler_wall_time.**1\.' udp"):
>
> ##### 1
> vmstats.scheduler_wall_time.1.**active: 189374595
> vmstats.scheduler_wall_time.1.**total: 1002022817
> ##### 2
> vmstats.scheduler_wall_time.1.**active:2293308912394
> vmstats.scheduler_wall_time.1.**total:12460747343114
> ##### 3
> vmstats.scheduler_wall_time.1.**active:186326615
> vmstats.scheduler_wall_time.1.**total:1004246720
>
> The interesting thing with the numbers above is that the server loops
> every second or so. The numbers are coming from a subtractions in
> scheduler_wall_time values as of R15B01 between slices of 1 second for each
> loop. Thus, the values '1002022817' and '1004246720' are averaging the
> equivalent of one second (the Erlang VM says the unit is undefined).
>
> If I make a ratio of them with '12460747343114' (the second value), I get
> 12435 seconds, or roughly 3h45, equivalent to the time the server sent
> nothing. I sadly have no absolute timestamp.
>
> This tells me that when things get stuck, the gen_server itself stops
> polling (or at least pushing data), up until the point I connect to the
> virtual machine, and it gets back to work. This wouldn't be related to UDP.
> Hypocritical thing. It takes breaks until I notice it then goes back to
> work.
>
> I'm running R15B01, only 2 out of 6 servers we deployed it to show this
> issue (so far), it happens a couple of times a day in unpredictable manners.
>
> Can anyone spot an obvious bug in the gen_server? Can this be some weird
> scheduling behavior? Or maybe a bug with erlang:statistics(scheduler_**wall_time)?
> Anyone else ever had similar issues?
> (should I have sent this to erlang-bugs?)
>
> Thanks for your help,
> Fred.
>
>
>
> ______________________________**_________________
> erlang-questions mailing list
> erlang-questions@REDACTED
> http://erlang.org/mailman/**listinfo/erlang-questions<http://erlang.org/mailman/listinfo/erlang-questions>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20120718/942c3d81/attachment.htm>