[erlang-questions] Latent bugs in Erlang/OTP

Kostis Sagonas <>
Thu Jan 11 17:53:23 CET 2007


Partly due to extending Dialyzer and partly wanting to satisfying my 
scientific curiosity, I've been spending a significant portion of my 
recent time looking for bugs in Erlang code, mainly in OTP.

More importantly, I am trying to get a grasp on the reasons why even the 
most competent Erlang programmers write code that contains latent bugs. 
This, in the hope that this knowledge can be fed back into Dialyzer or 
some other similar tool.  Still, the things I've run across so far are 
interesting -- perhaps even worth of a talk at some forum.


Let me share with you one of the latest ones -- discovered by the 
development version of Dialyzer which, among other things, infers 
integer ranges.

The beginning of lib/orber/src/orber_tb.erl in R11B-2 reads:

%%----------------------------------------------------------------------
-module(orber_tb).
...
-export([wait_for_tables/1, wait_for_tables/2,
	 is_loaded/0, is_loaded/1, is_running/0, is_running/1,
	 info/2, error/2, unique/1]).
%%----------------------------------------------------------------------

Note that two wait_for_tables functions are exported, those with arities 
1 and 2.  These functions are defined as follows:

%%----------------------------------------------------------------------
%% function : wait_for_tables/1
%% Arguments: Tables  - list of mnesia tables
%%            Timeout - integer (no point in allowing infinity)
%%            Attempts - integer > 0 How many times should we try
%% Returns  :
%% Exception:
%% Effect   :
%%----------------------------------------------------------------------
wait_for_tables(Tables) ->
     wait_for_tables(Tables, 30000, -1).

wait_for_tables(Tables, Timeout) ->
     wait_for_tables(Tables, Timeout, -1).

wait_for_tables(Tables, _Timeout, 0) ->
     error("Mnesia failed to load the some or all of the following"
	  "tables:~n~p", [Tables]),
     {error, "The requested Mnesia tables not yet available."};
wait_for_tables(Tables, Timeout, Attempts) ->
     case mnesia:wait_for_tables(Tables, Timeout) of
	ok ->
	    ok;
	{timeout, BadTabList} ->
	    info("Mnesia hasn't loaded the following tables (~p msec):~n~p",
		 [Timeout, BadTabList]),
	    wait_for_tables(BadTabList, Timeout, Attempts-1);
	{error, Reason} ->
	    error("Mnesia failed to load the some or all of the following"
		  "tables:~n~p", [Tables]),
	    {error, Reason}
     end.
%%----------------------------------------------------------------------

Notice the comments above the functions. Obviously, they refer to all 
three functions, not only wait_for_tables/1.  They indicate that the 
programmer intended that the second argument of the wait_for_tables/2 
function (the timeout) is an integer, but note that this intention is 
nowhere reflected in the code (e.g. in the form of an is_integer/1 guard).

The third argument of wait_for_tables/3 is even more interesting. 
(Reasons in increasing order of importance.)

   1. Its "starting" value (-1) is _inconsistent_ with the documentation

   2. This value has _NO effect whatsoever_ to the intention of the
      programmer of counting the number of attempts.
      (Recall that this is a non-exported function, thus a positive value
       of "Attempts" will never be supplied for this function.)

   3. This -1 value in conjunction with the fact that there is no
      guarantee that Timeout is an integer is a serious bug waiting to
      happen.  Imagine the effect that a call

	 wait_for_tables(SomeTables, infinity)

      will have to the system.  It will throw it to an almost endless
      loop which, thanks to the fact that Erlang has bignum arithmetic,
      will first consume all available heap and then crash the node
      running this code.  This is a bug that is very difficult to
      discover by testing.


Now, all this is currently identified by the *development* version of 
Dialyzer as:

     {orber_tb,wait_for_tables,3}:
	Type guard {integer,0} will always fail
	since variable has type neg_integer()!

which, admittedly, does not explain to the programmer the problem in
exactly the same way that I did ;-).  Instead, it simply points out
that the first clause of this function is just dead code.

There was a relatively recent comment to this mailing list by Mats 
Cronqvist:
     <quote>
	running dialyzer on well-tested code will turn up tons of
	errors. alas, that almost always turns out to be dead code.
     </quote>

I guess one point I am implicitly trying to make here is that even dead 
code might not be so harmless as its "dead" status might suggest, and 
worth taking a closer look. Typically, competent programmers do not 
write dead code.


Feedback is welcome,

Kostis



More information about the erlang-questions mailing list