[erlang-questions] Investigate an infinite loop on production servers

Dmitry Kolesnikov dmkolesnikov@REDACTED
Fri May 24 11:06:03 CEST 2013


Indeed the list of apps is simple, basically this is either ssl, emysql or your integration layer.

If you read a lot of data from mysql then that code might leak a memory by keeping ref to huge binaries. 
E.g. you you do select * from xxx then that data is returned as set of binaries. 
Whole binary hands in memory even if a first row is used. This is a one way to treat binaries received from somewhere: 

case binary:referenced_byte_size(X) of
   Large when Large > 2 * byte_size(X) -> 
      binary:copy(X);
   _ ->
      X
end

as long as I see emysql does not use binary:copy anywhere but of course they might dereference them in other way around.
If you could monitor a erlang:memory(binary) over a time then it might reveal the case.

BTW, is this specific to R16 or other release if so then some glitches at ssl?

- Dmitry

On May 24, 2013, at 11:34 AM, Morgan Segalis <msegalis@REDACTED> wrote:

> Thank you, I'll look into it.
> 
> Heres what application:which_application() gives me : 
> 
> [{emysql,"Emysql - Erlang MySQL driver","0.2"},
>  {ssl,"Erlang/OTP SSL application","5.2.1"},
>  {public_key,"Public key infrastructure","0.18"},
>  {crypto,"CRYPTO version 2","2.3"},
>  {stdlib,"ERTS  CXC 138 10","1.19.1"},
>  {kernel,"ERTS  CXC 138 10","2.16.1"}]
> 
> nothing fancy has you can see...
> 
> Le 24 mai 2013 à 10:31, Dmitry Kolesnikov <dmkolesnikov@REDACTED> a écrit :
> 
>> Hello,
>> 
>> I am not aware of a single flag to limit the memory like in Java.
>> You can try to configure memory allocation 
>> http://www.erlang.org/doc/man/erts_alloc.html 
>> 
>> One of the freeze reason might be a huge crash_dump.
>> See the flags at bottom of page how to tune its behaviour 
>> http://www.erlang.org/doc/man/erl.html
>> 
>> If you switch off a swap it helps to observe OOM.
>> 
>> Would you share to the list app you running applications?
>> application:which_applications()
>> 
>> 
>> - Dmitry
>> 
>> 
>> On May 24, 2013, at 11:13 AM, Morgan Segalis <msegalis@REDACTED> wrote:
>> 
>>> The problem is that the VM freezes completely, it does not generate a crash dump
>>> 
>>> Is there a way to limit the memory that a VM can allocate, so the server is not overwhelmed in order to create a crash dump ?
>>> 
>>> Le 24 mai 2013 à 02:00, Vance Shipley <vances@REDACTED> a écrit :
>>> 
>>>> Have you analyzed the crash dump file with the crash dump viewer?
>>>> On May 24, 2013 3:00 AM, "Morgan Segalis" <msegalis@REDACTED> wrote:
>>>> Yeah you got that right ! leaking at a huge rate at some point !
>>>> 
>>>> - The number of Fd - I don't get close to the max
>>>> # cat /proc/sys/fs/file-nr
>>>> 3264    0       6455368
>>>> 
>>>> - On the production server there is only the erlang node, no other service…
>>>> The beam.smp was through the roof at 300% CPU and 97% RAM
>>>> The weird thing is that it got there in a second, I was looking at it when it happens.
>>>> 
>>>> - It has happened with 2000 connections, 4000 connections, and 10000 connections… 5 min after start, 5hours after start.
>>>> 
>>>> I really can't find a pattern here…and I'm becoming a little desperate.
>>>> 
>>>> Thank you for your help again.
>>>> 
>>>> Morgan.
>>>> 
>>>> Le 23 mai 2013 à 23:20, Dmitry Kolesnikov <dmkolesnikov@REDACTED> a écrit :
>>>> 
>>>>> You system definitely leaking some resources :-/
>>>>>  - Check number of used FD(s) may be you exceeded limit there 
>>>>>  - What was overall system memory / cpu utilisation before crash?
>>>>>  - Check how many connections you got before crash, may be you can reproduce it at dev
>>>>> 
>>>>> - Dmitry
>>>>> 
>>>>> On May 24, 2013, at 12:13 AM, Morgan Segalis <msegalis@REDACTED> wrote:
>>>>> 
>>>>>> Ok, it finally got into the infinite loop…
>>>>>> 
>>>>>> And of course, the node on which I was running etop could not give me anymore since it got disconnected from the production node.
>>>>>> 
>>>>>> So back to square one… no way to investigate correctly so far :-/
>>>>>> 
>>>>>> Morgan.
>>>>>> 
>>>>>> Le 23 mai 2013 à 16:34, Morgan Segalis <msegalis@REDACTED> a écrit :
>>>>>> 
>>>>>>> Yeah that what I'm doing right now, but of course, when I'm monitoring it, it won't crash, only when I sleep !!
>>>>>>> 
>>>>>>> I get back to the Erlang list as soon as I have more informations about this.
>>>>>>> 
>>>>>>> Thank you all !
>>>>>>> 
>>>>>>> Morgan.
>>>>>>> 
>>>>>>> Le 23 mai 2013 à 16:30, Vance Shipley <vances@REDACTED> a écrit :
>>>>>>> 
>>>>>>>> Keep etop running and capture the output to a file (e.g. etop ... | tee stop.log). After it gets into trouble look back and see what was happening beforehand.
>>>>>>>> On May 23, 2013 6:16 PM, "Morgan Segalis" <msegalis@REDACTED> wrote:
>>>>>>>> So I should go back to R15B ?
>>>>>>>> 
>>>>>>>> erlang:memory() gives me 
>>>>>>>> 
>>>>>>>> [{total,1525779584},
>>>>>>>>  {processes,1272881427},
>>>>>>>>  {processes_used,1272789743},
>>>>>>>>  {system,252898157},
>>>>>>>>  {atom,372217},
>>>>>>>>  {atom_used,346096},
>>>>>>>>  {binary,148093608},
>>>>>>>>  {code,8274446},
>>>>>>>>  {ets,1546832}]
>>>>>>>> 
>>>>>>>> 
>>>>>>>> But keep in mind that right now, there is no infinite loop, or memory issue at this exact time…
>>>>>>>> It will be more interesting to have that when the VM is asking for 14GB of memory, but when it does, the console is unresponsive, so I can't get anything then.
>>>>>>>> 
>>>>>>>> Le 23 mai 2013 à 14:39, Dmitry Kolesnikov <dmkolesnikov@REDACTED> a écrit :
>>>>>>>> 
>>>>>>>>> Right, you do not have many processes. Same time you goes out of memory…
>>>>>>>>> 
>>>>>>>>> Unfortunately, I had no time play around with R16B at production… 
>>>>>>>>> Could it be some issue with SSL, I re-call there was some complains in the list? 
>>>>>>>>> 
>>>>>>>>> I would use entop to spot the process that has either too much reductions, queue len or heap.
>>>>>>>>> Once you know they pid you can dig more info about them using erlang:process_info(…) and/or sys:get:status(…)
>>>>>>>>> 
>>>>>>>>> BTW, What erlang:memory() says on you production node?
>>>>>>>>> 
>>>>>>>>> - Dmitry
>>>>>>>>> 
>>>>>>>>> On May 23, 2013, at 3:25 PM, Morgan Segalis <msegalis@REDACTED> wrote:
>>>>>>>>> 
>>>>>>>>>> No, I was talking about the function I made to investigate which processes I have created, which gives me this output : 
>>>>>>>>>> 
>>>>>>>>>> Dict: {dict,16,16,16,8,80,48,
>>>>>>>>>>            {[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]},
>>>>>>>>>>            {{[[{{connector_serv,init,1},[connector_suprc42,connector,<0.42.0>]}|548]],
>>>>>>>>>>              [],
>>>>>>>>>>              [[{{supervisor,connector_sup,1},[connector,<0.42.0>]}|3],
>>>>>>>>>>               [{{connector_serv,init,1},[connector_supssl,connector,<0.42.0>]}|1460],
>>>>>>>>>>               [{{supervisor,casserl_sup,1},[connector,<0.42.0>]}|1],
>>>>>>>>>>               [{{supervisor,pushiphone_sup,1},[connector,<0.42.0>]}|2],
>>>>>>>>>>               [{{pushiphone,init,1},['pushiphone-lite',connector,<0.42.0>]}|3],
>>>>>>>>>>               [{{supervisor,clientpool_sup,1},[connector,<0.42.0>]}|1]],
>>>>>>>>>>              [],
>>>>>>>>>>              [[{{clientpool,init,1},[clientpool_sup,connector,<0.42.0>]}|1],
>>>>>>>>>>               [undefined|4]],
>>>>>>>>>>              [],
>>>>>>>>>>              [[{{supervisor,connector,1},[<0.42.0>]}|1],
>>>>>>>>>>               [{{casserl_serv,init,1},[casserl_sup,connector,<0.42.0>]}|50]],
>>>>>>>>>>              [],[],[],
>>>>>>>>>>              [[{{connector_serv,init,1},[connector_suprc4,connector,<0.42.0>]}|472],
>>>>>>>>>>               [{{ssl_connection,init,1},
>>>>>>>>>>                 [ssl_connection_sup,ssl_sup,<0.51.0>]}|
>>>>>>>>>>                1366],
>>>>>>>>>>               [{unknown,unknown}|3]],
>>>>>>>>>>              [],[],
>>>>>>>>>>              [[{{pushiphone,init,1},['pushiphone-full',connector,<0.42.0>]}|3]],
>>>>>>>>>>              [],
>>>>>>>>>>              [[{{pg2,init,1},[kernel_safe_sup,kernel_sup,<0.10.0>]}|1]]}}}
>>>>>>>>>> ok
>>>>>>>>>> 
>>>>>>>>>> I'm very satisfied with supervisor, and I don't think to have the expertise tweaking it...
>>>>>>>>>> 
>>>>>>>>>> Le 23 mai 2013 à 14:19, Dmitry Kolesnikov <dmkolesnikov@REDACTED> a écrit :
>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> On May 23, 2013, at 1:04 PM, Morgan Segalis <msegalis@REDACTED> wrote:
>>>>>>>>>>> 
>>>>>>>>>>>> I have made a little function a while back, getting all processes and removing the processes inited at the beginning…
>>>>>>>>>>> 
>>>>>>>>>>> Could you please elaborate on that? Why you are not satisfied with supervisor?
>>>>>>>>>>> 
>>>>>>>>>>> - Dmitry 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> _______________________________________________
>>>>>>>> erlang-questions mailing list
>>>>>>>> erlang-questions@REDACTED
>>>>>>>> http://erlang.org/mailman/listinfo/erlang-questions
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>>> _______________________________________________
>>> erlang-questions mailing list
>>> erlang-questions@REDACTED
>>> http://erlang.org/mailman/listinfo/erlang-questions
>> 
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20130524/35b56f95/attachment.htm>


More information about the erlang-questions mailing list