[erlang-questions] UDP sockets and gen_server are hypocritical and it makes me mad
Wed Jul 18 14:26:28 CEST 2012
I'm having the weirdest of issues right now. If you bear with me for the
read, you'll know what I mean.
I recently pushed an experimental version of vmstats online on our
production servers, and it mostly works fine.
The code in question is a gen_server sending itself messages here:
And it sends through UDP sockets with the help of
Now I said it 'mostly' works fine. The problem is that after a few
hours, two of the servers' statistics related to vmstats go dead. By
that, I mean that in our reports, all I see is a flat line (see
http://i.imgur.com/LIJ1J.png, where h019 stops sending data near 19:16)
There is no error log, no exception, everything else runs fine. ngrep
shows nothing coming from vmstats on UDP, but other parts of the server
stack keep sending data through the UDP port fine. This is where I would
suspect things to be related to the server being locked up, but, here's
the weird part.
As soon as I connect to the virtual machine (^G -> r ''
<Enter>) to inspect the vmstats_server, data starts pouring in again
(this is what happens on the previous image at 19:28). Any mere attempt
to connect to the node to understand what the problem is causes it to be
I had the bug happen to me again a bit past 4 this morning. When I
unstuck it around 8, I got the following data on the next few loops of
the server (caught with "ngrep 'vmstats.scheduler_wall_time.1\.' udp"):
The interesting thing with the numbers above is that the server loops
every second or so. The numbers are coming from a subtractions in
scheduler_wall_time values as of R15B01 between slices of 1 second for
each loop. Thus, the values '1002022817' and '1004246720' are averaging
the equivalent of one second (the Erlang VM says the unit is undefined).
If I make a ratio of them with '12460747343114' (the second value), I
get 12435 seconds, or roughly 3h45, equivalent to the time the server
sent nothing. I sadly have no absolute timestamp.
This tells me that when things get stuck, the gen_server itself stops
polling (or at least pushing data), up until the point I connect to the
virtual machine, and it gets back to work. This wouldn't be related to
UDP. Hypocritical thing. It takes breaks until I notice it then goes
back to work.
I'm running R15B01, only 2 out of 6 servers we deployed it to show this
issue (so far), it happens a couple of times a day in unpredictable manners.
Can anyone spot an obvious bug in the gen_server? Can this be some weird
scheduling behavior? Or maybe a bug with
erlang:statistics(scheduler_wall_time)? Anyone else ever had similar
(should I have sent this to erlang-bugs?)
Thanks for your help,
More information about the erlang-questions