[erlang-questions] Investigate an infinite loop on production servers

Thu May 23 04:41:50 CEST 2013

Hi,

Generally, when a module is critical, and a lot solicited, I create a pg2 "pool" of supervised gen_server that will join the group, and get the pid using get_closest_pid in order to have multiple process.

Further more, The server has 16GB of RAM, and when the server is starting to get crazy, it only has 1,5 GB tops of RAM used. it really needs to go crazy a long time before using swap I guess, but I don't see it until another node cluster is telling me that the freeze node is timeout.

However, since we are at it… I may have find something really weird looking at my crash dump.
I'm using the emysql application.
My initialization of the emysql application is pretty basic : 

application:start(emysql),
emysql:add_pool(my_db,
	    30,
	    "login",
	    "password",
	    "my.db-host.com",
	    3306,
	    "table",
	    latin1)

Has you can see, I only have 30 connections asked in the pool. However in the crash dump here's what I have found in the fun table : 

Module		Uniq		Index	Address	Native_address	Refc

emysql_util	8432855		1		0x00007f1d4f9f6f00	 	3476
emysql_util	8432855		0		0x00007f1d4f9f7218	 	3476
emysql_util	8432855		3		0x00007f1d4f9f6e48	 	2
emysql_util	8432855		2		0x00007f1d4f9f6ea8	 	1
emysql		79898780	0		0x00007f1d4f9b56f8	 	841

Is that something normal to have with only 30 connections in one pool ?

Thank you all.

Le 23 mai 2013 à 04:21, Bob Ippolito <bob@REDACTED> a écrit :

> This kind of thing tends to happen when you continuously send messages to a process faster than it can handle them. The most common case that I've seen this is where you have a lot of processes communicating with a single gen_server process. If your server has swap enabled, this may appear to make the node "freeze completely but not crash".
> 
> In the past I've diagnosed this by monitoring the message_queue_len of registered processes, but I'm sure there are tools that can help do this for you.
> 
> 
> On Wed, May 22, 2013 at 7:00 PM, Morgan Segalis <msegalis@REDACTED> wrote:
> Hello everyone,
> 
> I'm having a bit of an issue with my production servers.
> 
> At some point, it seems to enter into an infinite loop that I can't find, or reproduce by myself on the tests servers.
> 
> The bug appear completely random, 1 hour, or 10 hour after restarting the Erlang node.
> The loop will eat up all my server's memory in no time, and freeze completely the Erlang node without crashing it. (most of the time)
> 
> One time I got an crash dump, and tried to investigate it with cdv, but I didn't get much informations about which process or module was eating up all the memory.
> I just know that it crashed because of the crash message : "eheap_alloc: Cannot allocate 6801972448 bytes of memory (of type "heap")."
> 
> I'm surely too new to Erlang to investigate something like this with cdv, I really would like some pointers on how I can understand this problem and fix it asap.
> 
> If you need any informations about the crash dump, let me know what you need, I'll copy/paste…
> 
> I'm using Erlang R16B (erts-5.10.1) [source] [64-bit] [smp:8:8] [async-threads:10] [kernel-poll:true]
> 
> Thank you all for your help !
> 
> 
> _______________________________________________
> erlang-questions mailing list
> erlang-questions@REDACTED
> http://erlang.org/mailman/listinfo/erlang-questions
> 
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20130523/2bf2a225/attachment.htm>