[erlang-questions] UDP sockets and gen_server are hypocritical and it makes me mad

Wed Jul 18 14:26:28 CEST 2012

I'm having the weirdest of issues right now. If you bear with me for the 
read, you'll know what I mean.

I recently pushed an experimental version of vmstats online on our 
production servers, and it mostly works fine.

The code in question is a gen_server sending itself messages here:
https://github.com/ferd/vmstats/blob/master/src/vmstats_server.erl

And it sends through UDP sockets with the help of 
https://github.com/lpgauth/statsderl/blob/master/src/statsderl.erl

Now I said it 'mostly' works fine. The problem is that after a few 
hours, two of the servers' statistics related to vmstats go dead. By 
that, I mean that in our reports, all I see is a flat line (see 
http://i.imgur.com/LIJ1J.png, where h019 stops sending data near 19:16)

There is no error log, no exception, everything else runs fine. ngrep 
shows nothing coming from vmstats on UDP, but other parts of the server 
stack keep sending data through the UDP port fine. This is where I would 
suspect things to be related to the server being locked up, but, here's 
the weird part.

As soon as I connect to the virtual machine (^G -> r 'nodename@REDACTED' 
<Enter>) to inspect the vmstats_server, data starts pouring in again 
(this is what happens on the previous image at 19:28). Any mere attempt 
to connect to the node to understand what the problem is causes it to be 
resolved.

I had the bug happen to me again a bit past 4 this morning. When I 
unstuck it around 8, I got the following data on the next few loops of 
the server (caught with "ngrep 'vmstats.scheduler_wall_time.1\.' udp"):

##### 1
vmstats.scheduler_wall_time.1.active: 189374595
vmstats.scheduler_wall_time.1.total: 1002022817
##### 2
vmstats.scheduler_wall_time.1.active:2293308912394
vmstats.scheduler_wall_time.1.total:12460747343114
##### 3
vmstats.scheduler_wall_time.1.active:186326615
vmstats.scheduler_wall_time.1.total:1004246720

The interesting thing with the numbers above is that the server loops 
every second or so. The numbers are coming from a subtractions in 
scheduler_wall_time values as of R15B01 between slices of 1 second for 
each loop. Thus, the values '1002022817' and '1004246720' are averaging 
the equivalent of one second (the Erlang VM says the unit is undefined).

If I make a ratio of them with '12460747343114' (the second value), I 
get 12435 seconds, or roughly 3h45, equivalent to the time the server 
sent nothing. I sadly have no absolute timestamp.

This tells me that when things get stuck, the gen_server itself stops 
polling (or at least pushing data), up until the point I connect to the 
virtual machine, and it gets back to work. This wouldn't be related to 
UDP. Hypocritical thing. It takes breaks until I notice it then goes 
back to work.

I'm running R15B01, only 2 out of 6 servers we deployed it to show this 
issue (so far), it happens a couple of times a day in unpredictable manners.

Can anyone spot an obvious bug in the gen_server? Can this be some weird 
scheduling behavior? Or maybe a bug with 
erlang:statistics(scheduler_wall_time)? Anyone else ever had similar 
issues?
(should I have sent this to erlang-bugs?)

Thanks for your help,
Fred.