[erlang-questions] UDP sockets and gen_server are hypocritical and it makes me mad
Erik Søe Sørensen
Wed Jul 18 14:50:09 CEST 2012
A shot from the hip: are these servers by any chance running on a
I've on two different occasions had problems with a bug in VMWare which
causes the high-resolution timer to return bad values, very very rarely but
also very confusing for timers.
Den 18/07/2012 14.26 skrev "Fred Hebert" <>:
> I'm having the weirdest of issues right now. If you bear with me for the
> read, you'll know what I mean.
> I recently pushed an experimental version of vmstats online on our
> production servers, and it mostly works fine.
> The code in question is a gen_server sending itself messages here:
> And it sends through UDP sockets with the help of
> Now I said it 'mostly' works fine. The problem is that after a few hours,
> two of the servers' statistics related to vmstats go dead. By that, I mean
> that in our reports, all I see is a flat line (see
> http://i.imgur.com/LIJ1J.png, where h019 stops sending data near 19:16)
> There is no error log, no exception, everything else runs fine. ngrep
> shows nothing coming from vmstats on UDP, but other parts of the server
> stack keep sending data through the UDP port fine. This is where I would
> suspect things to be related to the server being locked up, but, here's the
> weird part.
> As soon as I connect to the virtual machine (^G -> r ''
> <Enter>) to inspect the vmstats_server, data starts pouring in again (this
> is what happens on the previous image at 19:28). Any mere attempt to
> connect to the node to understand what the problem is causes it to be
> I had the bug happen to me again a bit past 4 this morning. When I unstuck
> it around 8, I got the following data on the next few loops of the server
> (caught with "ngrep 'vmstats.scheduler_wall_time.**1\.' udp"):
> ##### 1
> vmstats.scheduler_wall_time.1.**active: 189374595
> vmstats.scheduler_wall_time.1.**total: 1002022817
> ##### 2
> ##### 3
> The interesting thing with the numbers above is that the server loops
> every second or so. The numbers are coming from a subtractions in
> scheduler_wall_time values as of R15B01 between slices of 1 second for each
> loop. Thus, the values '1002022817' and '1004246720' are averaging the
> equivalent of one second (the Erlang VM says the unit is undefined).
> If I make a ratio of them with '12460747343114' (the second value), I get
> 12435 seconds, or roughly 3h45, equivalent to the time the server sent
> nothing. I sadly have no absolute timestamp.
> This tells me that when things get stuck, the gen_server itself stops
> polling (or at least pushing data), up until the point I connect to the
> virtual machine, and it gets back to work. This wouldn't be related to UDP.
> Hypocritical thing. It takes breaks until I notice it then goes back to
> I'm running R15B01, only 2 out of 6 servers we deployed it to show this
> issue (so far), it happens a couple of times a day in unpredictable manners.
> Can anyone spot an obvious bug in the gen_server? Can this be some weird
> scheduling behavior? Or maybe a bug with erlang:statistics(scheduler_**wall_time)?
> Anyone else ever had similar issues?
> (should I have sent this to erlang-bugs?)
> Thanks for your help,
> erlang-questions mailing list
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the erlang-questions