[erlang-questions] UDP sockets and gen_server are hypocritical and it makes me mad

Fred Hebert mononcqc@REDACTED
Wed Jul 18 17:51:50 CEST 2012


Nope. They're running thousands of requests per second, and as far as 
I'm aware we're directly on the metal.

This is likely an issue with the statistics function and a scheduler not 
knowing it has to answer back to the function (erlang-bugs has an e-mail 
on this too).

I'm trying to reproduce it in a non-production VM to get a good core 
dump at the moment. The problem is that it can apparently take hours and 
hours before it shows up, if at all.

On 12-07-18 11:33 AM, Paul Davis wrote:
> Another shot from the hip, but is it possible that these machines have
> gone to sleep? This sounds sort of like some issues I saw when someone
> puts their laptop to sleep and then wakes it up. Erlang notices and
> then adjusts time to match reality for a bit.
>
> On Wed, Jul 18, 2012 at 7:26 AM, Fred Hebert <mononcqc@REDACTED> wrote:
>> I'm having the weirdest of issues right now. If you bear with me for the
>> read, you'll know what I mean.
>>
>> I recently pushed an experimental version of vmstats online on our
>> production servers, and it mostly works fine.
>>
>> The code in question is a gen_server sending itself messages here:
>> https://github.com/ferd/vmstats/blob/master/src/vmstats_server.erl
>>
>> And it sends through UDP sockets with the help of
>> https://github.com/lpgauth/statsderl/blob/master/src/statsderl.erl
>>
>> Now I said it 'mostly' works fine. The problem is that after a few hours,
>> two of the servers' statistics related to vmstats go dead. By that, I mean
>> that in our reports, all I see is a flat line (see
>> http://i.imgur.com/LIJ1J.png, where h019 stops sending data near 19:16)
>>
>> There is no error log, no exception, everything else runs fine. ngrep shows
>> nothing coming from vmstats on UDP, but other parts of the server stack keep
>> sending data through the UDP port fine. This is where I would suspect things
>> to be related to the server being locked up, but, here's the weird part.
>>
>> As soon as I connect to the virtual machine (^G -> r 'nodename@REDACTED'
>> <Enter>) to inspect the vmstats_server, data starts pouring in again (this
>> is what happens on the previous image at 19:28). Any mere attempt to connect
>> to the node to understand what the problem is causes it to be resolved.
>>
>> I had the bug happen to me again a bit past 4 this morning. When I unstuck
>> it around 8, I got the following data on the next few loops of the server
>> (caught with "ngrep 'vmstats.scheduler_wall_time.1\.' udp"):
>>
>> ##### 1
>> vmstats.scheduler_wall_time.1.active: 189374595
>> vmstats.scheduler_wall_time.1.total: 1002022817
>> ##### 2
>> vmstats.scheduler_wall_time.1.active:2293308912394
>> vmstats.scheduler_wall_time.1.total:12460747343114
>> ##### 3
>> vmstats.scheduler_wall_time.1.active:186326615
>> vmstats.scheduler_wall_time.1.total:1004246720
>>
>> The interesting thing with the numbers above is that the server loops every
>> second or so. The numbers are coming from a subtractions in
>> scheduler_wall_time values as of R15B01 between slices of 1 second for each
>> loop. Thus, the values '1002022817' and '1004246720' are averaging the
>> equivalent of one second (the Erlang VM says the unit is undefined).
>>
>> If I make a ratio of them with '12460747343114' (the second value), I get
>> 12435 seconds, or roughly 3h45, equivalent to the time the server sent
>> nothing. I sadly have no absolute timestamp.
>>
>> This tells me that when things get stuck, the gen_server itself stops
>> polling (or at least pushing data), up until the point I connect to the
>> virtual machine, and it gets back to work. This wouldn't be related to UDP.
>> Hypocritical thing. It takes breaks until I notice it then goes back to
>> work.
>>
>> I'm running R15B01, only 2 out of 6 servers we deployed it to show this
>> issue (so far), it happens a couple of times a day in unpredictable manners.
>>
>> Can anyone spot an obvious bug in the gen_server? Can this be some weird
>> scheduling behavior? Or maybe a bug with
>> erlang:statistics(scheduler_wall_time)? Anyone else ever had similar issues?
>> (should I have sent this to erlang-bugs?)
>>
>> Thanks for your help,
>> Fred.
>>
>>
>>
>> _______________________________________________
>> erlang-questions mailing list
>> erlang-questions@REDACTED
>> http://erlang.org/mailman/listinfo/erlang-questions




More information about the erlang-questions mailing list