[erlang-questions] erlang woes

Thu Aug 5 19:07:40 CEST 2010

Hello Joe..

Thank you for the detailed response... the thing is...
the example I gave was just a specific instance.. I shall give you another
one...

We use RabbitMQ.. and we maintain a couple of persistent connections to the
sever. We also have many consumers for messages that might appear in
rabbitmq queues.. So now all these connection processes (max 10) are started
by a supervisor which has a 'one_for_all' policy (this is so that socket
errors ensure that all connections to the host are torn down and recreated)
Now it so happens that every once a while.. there is a connection issues..
and all connections are recreated.. at which point we notice that sometimes
the node just crashes.. with the same "unable to allocate memory" error...
and the funny thing is.. up until that point, the memory usage of the node
doesnt exceed 10 -20 % of total RAM.. so why should this bomb ?

Basically, what I trying to point out is... almost all the times the node
has crashed, what i see in the sasl logs prior to crashing is that some
process has dies.. and it has brought down a lot of other process and the
supervisor was in the middle of restarting them..

regards
-Arun

On Thu, Aug 5, 2010 at 3:27 AM, Joe Armstrong <erlang@REDACTED> wrote:

> On Thu, Aug 5, 2010 at 7:37 AM, Arun Suresh <arun.suresh@REDACTED> wrote:
> > Hello folks..
> >
> > Ive been using erlang for a while now.. and ive got a production system
> up
> > and running from scratch.. but there are some really annoying aspects of
> the
> > platform.. the most fundamental of which is the fact that when a node
> > crashes it is very hard to figure out exactly why.. Almost ALL the time
> what
> > i see in the crash dump is something akin to :
> >
> > =erl_crash_dump:0.1
> > Wed Aug  4 21:50:01 2010
> > Slogan: eheap_alloc: Cannot allocate 1140328500 bytes of memory (of type
> > "heap").
> > System version: Erlang R13B04 (erts-5.7.5) [source] [smp:2:2] [rq:2]
> > [async-threads:0] [hipe] [kernel-poll:false]
> > Compiled: Tue May 11 12:37:38 2010
> > Taints:
> >
> >
> > at which point I start to comb the sasl logs... and 9 out of 10 times...
> it
> > is because some critical process has died and the supervisor is busy
> > restarting it.. for example, the other day.. my node crashed and from the
> > sasl logs.. i see that the http manager for a profile I had created had
> > crashed like so :
> >
> > =CRASH REPORT==== 4-Aug-2010::21:47:09 ===
> >  crasher:
> >    initial call: httpc_manager:init/1
> >    pid: <0.185.0>
> >    registered_name: httpc_manager_store
> >    exception exit: {{case_clause,
> >                         [{handler_info,#Ref<0.0.17.61372>,<0.17225.36>,
> >                              undefined,<0.15665.36>,initiating}]},
> >                     [{httpc_manager,handle_connect_and_send,5},
> >                      {httpc_manager,handle_info,2},
> >                      {gen_server,handle_msg,5},
> >                      {proc_lib,init_p_do_apply,3}]}
> >      in function  gen_server:terminate/6
> >    ancestors: [httpc_profile_sup,httpc_sup,inets_sup,<0.46.0>]
> >    messages: [{'EXIT',<0.16755.36>,normal},
> >                  {connect_and_send,<0.16752.36>,#Ref<0.0.17.61366>,
> >
> >
> > and subsequent messages were related to the supervisor trying to restart
> the
> > profile manager... and failing..
> >
> > Now my point is... why did the node have to crash.. just because the
> manager
> > has to be restarted ?
> > and why does the crash.dump always keep telling me im out of memory..
> >
> > The problem is.. I thought erlang was built to be fault tolerant.. the
> > choice of me using erlang had a LOT to do with doing away with the having
> to
> > code defensively.. "let it crash" and all that .. just make sure u have a
> > supervisor that restarts ur process and everything will just work fine...
> > but my experience is that most of the time.. simple process restarts
> bring
> > the whole node crashing down...
>
> Erlang was designed to be fault-tolerant. The point is that if you crash
> then
> "somebody else" has to diagnose the error - "you" can't diagnose the
> error because
> you are dead. Now if a process crashes, some other process can detect
> and fix the error.
> If you use otp behaviors this mechanism is hidden from you in the
> supervisor hierarchy.
> But if the entire node crashes then some other node must fix the
> error. In raw erlang
> (ie not using the OTP libraries) their is a spawn_link/4 primitive
> precisely to propagate
> errors over node boundaries.
>
> So to make an entire node fault-tolerant you need more than one node -
> the other node(s)
> should fix the error.
>
> The problem with error is that some you can fix at run-time others you
> can't. Running out of
> memory is something that you typically can't easily fix. If you run
> out of memory you have to
> kill something to reclaim memory - but what should you kill? - this
> depends upon your application. In such circumstances Erlang decides it
> can't do anything sensible and the whole node dies - with hopefully a
> helpful error message.
>
> When things crash you're supposed to leave enough information behind
> so you can figure out
> why things went wrong - the default behavior of the system is to try
> and leave some helpful clues as to what went wrong (so you don't have
> to code this yourself). Mostly you don't
> have to mess with this. For example, if you try to open a non existent
> file you might generate
> an 'enoent' exception, if your application only opens a single file
> then this information will be
> enough for you to discover and fix your error. But if your application
> opens many different
> files you'll have to add an explicit exception exit({missingFile,
> File}) at appropriate places
> in your code.
>
> No matter how good any exception and error recovery strategy is there
> will always be corner
> cases where automatic strategies fail, and here you're back to manual
> debugging.
>
> You have a pretty good clue - you've got a memory leak - some process
> is consuming too
> much memory - so now you have to start using various tools to figure
> out which process run
> wild.
>
> I suspect this is an uncommon error - if it were common then you'd
> find more tools
> for detecting rogue processes in the standard libraries - their
> absence indicates that either this
> is not a common problem, or that it is a common problem, but resolving
> it it easy.
>
> I know that none of what I've said helps you find your specific
> problem - but you have
> to understand the limitations of the system - a dead man cannot
> determine the reason why they themselves died - somebody else has to
> do this ...
>
> Good luck finding your error ..
>
> Cheers
>
> /Joe
>
>
> >
> > Would deeply appreciate it someone could tell me if there is something
> > fundamentally wrong with the way im doing things.. or if anyones been in
> my
> > situation and have had some enlightenments.
> >
> > thanks in advance
> > -Arun
> >
>