[erlang-questions] erlang woes

Joe Armstrong erlang@REDACTED
Thu Aug 5 12:27:37 CEST 2010


On Thu, Aug 5, 2010 at 7:37 AM, Arun Suresh <arun.suresh@REDACTED> wrote:
> Hello folks..
>
> Ive been using erlang for a while now.. and ive got a production system up
> and running from scratch.. but there are some really annoying aspects of the
> platform.. the most fundamental of which is the fact that when a node
> crashes it is very hard to figure out exactly why.. Almost ALL the time what
> i see in the crash dump is something akin to :
>
> =erl_crash_dump:0.1
> Wed Aug  4 21:50:01 2010
> Slogan: eheap_alloc: Cannot allocate 1140328500 bytes of memory (of type
> "heap").
> System version: Erlang R13B04 (erts-5.7.5) [source] [smp:2:2] [rq:2]
> [async-threads:0] [hipe] [kernel-poll:false]
> Compiled: Tue May 11 12:37:38 2010
> Taints:
>
>
> at which point I start to comb the sasl logs... and 9 out of 10 times... it
> is because some critical process has died and the supervisor is busy
> restarting it.. for example, the other day.. my node crashed and from the
> sasl logs.. i see that the http manager for a profile I had created had
> crashed like so :
>
> =CRASH REPORT==== 4-Aug-2010::21:47:09 ===
>  crasher:
>    initial call: httpc_manager:init/1
>    pid: <0.185.0>
>    registered_name: httpc_manager_store
>    exception exit: {{case_clause,
>                         [{handler_info,#Ref<0.0.17.61372>,<0.17225.36>,
>                              undefined,<0.15665.36>,initiating}]},
>                     [{httpc_manager,handle_connect_and_send,5},
>                      {httpc_manager,handle_info,2},
>                      {gen_server,handle_msg,5},
>                      {proc_lib,init_p_do_apply,3}]}
>      in function  gen_server:terminate/6
>    ancestors: [httpc_profile_sup,httpc_sup,inets_sup,<0.46.0>]
>    messages: [{'EXIT',<0.16755.36>,normal},
>                  {connect_and_send,<0.16752.36>,#Ref<0.0.17.61366>,
>
>
> and subsequent messages were related to the supervisor trying to restart the
> profile manager... and failing..
>
> Now my point is... why did the node have to crash.. just because the manager
> has to be restarted ?
> and why does the crash.dump always keep telling me im out of memory..
>
> The problem is.. I thought erlang was built to be fault tolerant.. the
> choice of me using erlang had a LOT to do with doing away with the having to
> code defensively.. "let it crash" and all that .. just make sure u have a
> supervisor that restarts ur process and everything will just work fine...
> but my experience is that most of the time.. simple process restarts bring
> the whole node crashing down...

Erlang was designed to be fault-tolerant. The point is that if you crash then
"somebody else" has to diagnose the error - "you" can't diagnose the
error because
you are dead. Now if a process crashes, some other process can detect
and fix the error.
If you use otp behaviors this mechanism is hidden from you in the
supervisor hierarchy.
But if the entire node crashes then some other node must fix the
error. In raw erlang
(ie not using the OTP libraries) their is a spawn_link/4 primitive
precisely to propagate
errors over node boundaries.

So to make an entire node fault-tolerant you need more than one node -
the other node(s)
should fix the error.

The problem with error is that some you can fix at run-time others you
can't. Running out of
memory is something that you typically can't easily fix. If you run
out of memory you have to
kill something to reclaim memory - but what should you kill? - this
depends upon your application. In such circumstances Erlang decides it
can't do anything sensible and the whole node dies - with hopefully a
helpful error message.

When things crash you're supposed to leave enough information behind
so you can figure out
why things went wrong - the default behavior of the system is to try
and leave some helpful clues as to what went wrong (so you don't have
to code this yourself). Mostly you don't
have to mess with this. For example, if you try to open a non existent
file you might generate
an 'enoent' exception, if your application only opens a single file
then this information will be
enough for you to discover and fix your error. But if your application
opens many different
files you'll have to add an explicit exception exit({missingFile,
File}) at appropriate places
in your code.

No matter how good any exception and error recovery strategy is there
will always be corner
cases where automatic strategies fail, and here you're back to manual debugging.

You have a pretty good clue - you've got a memory leak - some process
is consuming too
much memory - so now you have to start using various tools to figure
out which process run
wild.

I suspect this is an uncommon error - if it were common then you'd
find more tools
for detecting rogue processes in the standard libraries - their
absence indicates that either this
is not a common problem, or that it is a common problem, but resolving
it it easy.

I know that none of what I've said helps you find your specific
problem - but you have
to understand the limitations of the system - a dead man cannot
determine the reason why they themselves died - somebody else has to
do this ...

Good luck finding your error ..

Cheers

/Joe


>
> Would deeply appreciate it someone could tell me if there is something
> fundamentally wrong with the way im doing things.. or if anyones been in my
> situation and have had some enlightenments.
>
> thanks in advance
> -Arun
>


More information about the erlang-questions mailing list