[erlang-questions] Non-reproducible bug on a live erlang system

Angel J. Alvarez Miguel clist@REDACTED
Thu Jan 14 22:01:50 CET 2010


Well, erlang is aimed at "a let it crash" if its sounds hard to you just blame 
Joe!!

The well know second part should be "and let it recover" (Joe said).


Consider your call manager acepting calls where the billing information 
(provided form some other external system) is inacurate or incomplete.

Well, the users will be happily able to make conference calls with being 
charged for and the system is pretty usable so "where is the problem?? :-P

¿Why is this still running?" because unintendended behaviour has to be 
handled as exceptions so you can recover in a structured manner.

Before erlang, you have to program defensively, trying to catch faulty 
conditions before they catch you.

yesterday ive to restart a OC4j container because a "max process reached" 
strange error from ORACLE JDBC unhandled trougth a very deep nested class Java 
App.

No one developer care to catch those exceptions... Just blame the DBA for 
lacking infinite file descriptors available!!!

After erlang only you have to recover appropiately from those situations not 
covered on the intended behaviour by converting them on proper exceptions or 
runtime-failures.

Proper error handling becomes a tool just like IDEs and you have to learn how 
to use them properly.


/Angel
On Jueves, 14 de Enero de 2010 19:38:55 Kaiduan Xie escribió:
> "Im with others that you need to make
> more assertions on the code just to let the erlang runtime trigger the
>  faulty condition."
> 
> Very good point, Angel, just let it crash!
> 
> kaiduan
> 
> 2010/1/14 Angel J. Alvarez Miguel <clist@REDACTED>:
> > On Jueves, 14 de Enero de 2010 17:25:47 Kaiduan Xie escribió:
> >> Thanks Jayson and Attila for throwing light on this.
> >>
> >> To be more specific, this is a call processing system, it processes
> >> incoming message, and sends messages out. Customer reports call
> >> failure, and it does not generate crash report, it is a programming
> >> logic error. As I mentioned, this is a non-reproducible issue, or hard
> >> reproducible issue.
> >>
> >> 1. If this only happens to a particular user, then erlang built-in
> >> trace can help on this.
> >>
> >> 2. Otherwise, what to do?
> >>
> >> Has anyone encountered this before? How you solve it?
> >>
> >> Thanks,
> >>
> >> kaiduan
> >>
> >> On Thu, Jan 14, 2010 at 10:35 AM, Attila Rajmund Nohl
> >>
> >> <attila.r.nohl@REDACTED> wrote:
> >> > 2010/1/14, Kaiduan Xie <kaiduanx@REDACTED>:
> >> >> Hi, all,
> >> >>
> >> >> Consider the following case, you have a live/busy Erlang system in
> >> >> production which handles thousands of transactions per second and
> >> >> millions of users, and customer reported a non-reproducible bug. The
> >> >> problem is non-reproducible, or intermittent, or very hard to
> >> >> reproduce in live system and in lab.
> >> >
> >> > Does this bug involve a crash report with a stack trace? You can
> >> > always add some assert-like statements (i.e. if you know that a
> >> > variable must not bound to the 'undefined' atom at a certain point in
> >> > the code, you can add something like 'Variable /= undefined') where
> >> > you think something is wrong.
> >>
> >> ________________________________________________________________
> >> erlang-questions mailing list. See http://www.erlang.org/faq.html
> >> erlang-questions (at) erlang.org
> >
> > "A software bug is the common term used to describe an error, flaw,
> > mistake, failure, or fault in a computer program or system that produces
> > an incorrect or unexpected result, or causes it to behave in unintended
> > ways."
> >
> >
> > A missbehaving system can still be (or pretend to be ) fully funtional in
> > the sense that no exceptions are triggered. Im with others that you need
> > to make more assertions on the code just to let the erlang runtime
> > trigger the faulty condition.
> >
> > Holes in the software specifications allow (Type 1?) errors that are
> > dificult to trap. The mere fact that the system still handle millions of
> > users without severe degradatión makes clear this is the case.
> >
> > I just remember some discussion on patterns like
> >
> > case file:open(..) of
> >        {ok,Fd} -> ...
> >        true; ->
> > end
> >
> > vs the (I think) more idiomatic (prolog inherited?)
> > {ok,Fd} = file:open()...
> >
> > where the former is more perhaps more flexible (and prone to
> > missbehaving) the latter is rigid and safer (and needs a "try ... catch"
> > container to deal with errors on the same process or another process to
> > wath for errors).
> >
> > /Angel
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > ________________________________________________________________
> > erlang-questions mailing list. See http://www.erlang.org/faq.html
> > erlang-questions (at) erlang.org
> 


More information about the erlang-questions mailing list