[erlang-questions] Non-reproducible bug on a live erlang system

Angel J. Alvarez Miguel clist@REDACTED
Thu Jan 14 17:56:28 CET 2010

On Jueves, 14 de Enero de 2010 17:25:47 Kaiduan Xie escribió:
> Thanks Jayson and Attila for throwing light on this.
> To be more specific, this is a call processing system, it processes
> incoming message, and sends messages out. Customer reports call
> failure, and it does not generate crash report, it is a programming
> logic error. As I mentioned, this is a non-reproducible issue, or hard
> reproducible issue.
> 1. If this only happens to a particular user, then erlang built-in
> trace can help on this.
> 2. Otherwise, what to do?
> Has anyone encountered this before? How you solve it?
> Thanks,
> kaiduan
> On Thu, Jan 14, 2010 at 10:35 AM, Attila Rajmund Nohl
> <attila.r.nohl@REDACTED> wrote:
> > 2010/1/14, Kaiduan Xie <kaiduanx@REDACTED>:
> >> Hi, all,
> >>
> >> Consider the following case, you have a live/busy Erlang system in
> >> production which handles thousands of transactions per second and
> >> millions of users, and customer reported a non-reproducible bug. The
> >> problem is non-reproducible, or intermittent, or very hard to
> >> reproduce in live system and in lab.
> >
> > Does this bug involve a crash report with a stack trace? You can
> > always add some assert-like statements (i.e. if you know that a
> > variable must not bound to the 'undefined' atom at a certain point in
> > the code, you can add something like 'Variable /= undefined') where
> > you think something is wrong.
> ________________________________________________________________
> erlang-questions mailing list. See http://www.erlang.org/faq.html
> erlang-questions (at) erlang.org

"A software bug is the common term used to describe an error, flaw, mistake, 
failure, or fault in a computer program or system that produces an incorrect 
or unexpected result, or causes it to behave in unintended ways."

A missbehaving system can still be (or pretend to be ) fully funtional in the 
sense that no exceptions are triggered. Im with others that you need to make 
more assertions on the code just to let the erlang runtime trigger the faulty 

Holes in the software specifications allow (Type 1?) errors that are dificult 
to trap. The mere fact that the system still handle millions of users without 
severe degradatión makes clear this is the case.

I just remember some discussion on patterns like

case file:open(..) of
	{ok,Fd} -> ...
	true; ->

vs the (I think) more idiomatic (prolog inherited?) 
{ok,Fd} = file:open()...

where the former is more perhaps more flexible (and prone to missbehaving) the 
latter is rigid and safer (and needs a "try ... catch" container to deal with 
errors on the same process or another process to wath for errors).



More information about the erlang-questions mailing list