[erlang-questions] UDP sockets and gen_server are hypocritical and it makes me mad
Fred Hebert
mononcqc@REDACTED
Wed Jul 18 14:26:28 CEST 2012
I'm having the weirdest of issues right now. If you bear with me for the
read, you'll know what I mean.
I recently pushed an experimental version of vmstats online on our
production servers, and it mostly works fine.
The code in question is a gen_server sending itself messages here:
https://github.com/ferd/vmstats/blob/master/src/vmstats_server.erl
And it sends through UDP sockets with the help of
https://github.com/lpgauth/statsderl/blob/master/src/statsderl.erl
Now I said it 'mostly' works fine. The problem is that after a few
hours, two of the servers' statistics related to vmstats go dead. By
that, I mean that in our reports, all I see is a flat line (see
http://i.imgur.com/LIJ1J.png, where h019 stops sending data near 19:16)
There is no error log, no exception, everything else runs fine. ngrep
shows nothing coming from vmstats on UDP, but other parts of the server
stack keep sending data through the UDP port fine. This is where I would
suspect things to be related to the server being locked up, but, here's
the weird part.
As soon as I connect to the virtual machine (^G -> r 'nodename@REDACTED'
<Enter>) to inspect the vmstats_server, data starts pouring in again
(this is what happens on the previous image at 19:28). Any mere attempt
to connect to the node to understand what the problem is causes it to be
resolved.
I had the bug happen to me again a bit past 4 this morning. When I
unstuck it around 8, I got the following data on the next few loops of
the server (caught with "ngrep 'vmstats.scheduler_wall_time.1\.' udp"):
##### 1
vmstats.scheduler_wall_time.1.active: 189374595
vmstats.scheduler_wall_time.1.total: 1002022817
##### 2
vmstats.scheduler_wall_time.1.active:2293308912394
vmstats.scheduler_wall_time.1.total:12460747343114
##### 3
vmstats.scheduler_wall_time.1.active:186326615
vmstats.scheduler_wall_time.1.total:1004246720
The interesting thing with the numbers above is that the server loops
every second or so. The numbers are coming from a subtractions in
scheduler_wall_time values as of R15B01 between slices of 1 second for
each loop. Thus, the values '1002022817' and '1004246720' are averaging
the equivalent of one second (the Erlang VM says the unit is undefined).
If I make a ratio of them with '12460747343114' (the second value), I
get 12435 seconds, or roughly 3h45, equivalent to the time the server
sent nothing. I sadly have no absolute timestamp.
This tells me that when things get stuck, the gen_server itself stops
polling (or at least pushing data), up until the point I connect to the
virtual machine, and it gets back to work. This wouldn't be related to
UDP. Hypocritical thing. It takes breaks until I notice it then goes
back to work.
I'm running R15B01, only 2 out of 6 servers we deployed it to show this
issue (so far), it happens a couple of times a day in unpredictable manners.
Can anyone spot an obvious bug in the gen_server? Can this be some weird
scheduling behavior? Or maybe a bug with
erlang:statistics(scheduler_wall_time)? Anyone else ever had similar
issues?
(should I have sent this to erlang-bugs?)
Thanks for your help,
Fred.
More information about the erlang-questions
mailing list