[erlang-questions] UDP sockets and gen_server are hypocritical and it makes me mad

Wed Jul 18 17:33:55 CEST 2012

Another shot from the hip, but is it possible that these machines have
gone to sleep? This sounds sort of like some issues I saw when someone
puts their laptop to sleep and then wakes it up. Erlang notices and
then adjusts time to match reality for a bit.

On Wed, Jul 18, 2012 at 7:26 AM, Fred Hebert <mononcqc@REDACTED> wrote:
> I'm having the weirdest of issues right now. If you bear with me for the
> read, you'll know what I mean.
>
> I recently pushed an experimental version of vmstats online on our
> production servers, and it mostly works fine.
>
> The code in question is a gen_server sending itself messages here:
> https://github.com/ferd/vmstats/blob/master/src/vmstats_server.erl
>
> And it sends through UDP sockets with the help of
> https://github.com/lpgauth/statsderl/blob/master/src/statsderl.erl
>
> Now I said it 'mostly' works fine. The problem is that after a few hours,
> two of the servers' statistics related to vmstats go dead. By that, I mean
> that in our reports, all I see is a flat line (see
> http://i.imgur.com/LIJ1J.png, where h019 stops sending data near 19:16)
>
> There is no error log, no exception, everything else runs fine. ngrep shows
> nothing coming from vmstats on UDP, but other parts of the server stack keep
> sending data through the UDP port fine. This is where I would suspect things
> to be related to the server being locked up, but, here's the weird part.
>
> As soon as I connect to the virtual machine (^G -> r 'nodename@REDACTED'
> <Enter>) to inspect the vmstats_server, data starts pouring in again (this
> is what happens on the previous image at 19:28). Any mere attempt to connect
> to the node to understand what the problem is causes it to be resolved.
>
> I had the bug happen to me again a bit past 4 this morning. When I unstuck
> it around 8, I got the following data on the next few loops of the server
> (caught with "ngrep 'vmstats.scheduler_wall_time.1\.' udp"):
>
> ##### 1
> vmstats.scheduler_wall_time.1.active: 189374595
> vmstats.scheduler_wall_time.1.total: 1002022817
> ##### 2
> vmstats.scheduler_wall_time.1.active:2293308912394
> vmstats.scheduler_wall_time.1.total:12460747343114
> ##### 3
> vmstats.scheduler_wall_time.1.active:186326615
> vmstats.scheduler_wall_time.1.total:1004246720
>
> The interesting thing with the numbers above is that the server loops every
> second or so. The numbers are coming from a subtractions in
> scheduler_wall_time values as of R15B01 between slices of 1 second for each
> loop. Thus, the values '1002022817' and '1004246720' are averaging the
> equivalent of one second (the Erlang VM says the unit is undefined).
>
> If I make a ratio of them with '12460747343114' (the second value), I get
> 12435 seconds, or roughly 3h45, equivalent to the time the server sent
> nothing. I sadly have no absolute timestamp.
>
> This tells me that when things get stuck, the gen_server itself stops
> polling (or at least pushing data), up until the point I connect to the
> virtual machine, and it gets back to work. This wouldn't be related to UDP.
> Hypocritical thing. It takes breaks until I notice it then goes back to
> work.
>
> I'm running R15B01, only 2 out of 6 servers we deployed it to show this
> issue (so far), it happens a couple of times a day in unpredictable manners.
>
> Can anyone spot an obvious bug in the gen_server? Can this be some weird
> scheduling behavior? Or maybe a bug with
> erlang:statistics(scheduler_wall_time)? Anyone else ever had similar issues?
> (should I have sent this to erlang-bugs?)
>
> Thanks for your help,
> Fred.
>
>
>
> _______________________________________________
> erlang-questions mailing list
> erlang-questions@REDACTED
> http://erlang.org/mailman/listinfo/erlang-questions