[erlang-questions] Best practices for handling invalid / unexpected messages

Sat May 14 21:18:51 CEST 2011

2011/5/13, Ciprian Dorin Craciun <ciprian.craciun@REDACTED>:
>     Hello all!
>
>     Lately I've started programming a bit more seriously in Erlang,
> and after creating a few `gen_server` or `gen_fsm` components, I've
> started wondering what is the best way to handle unexpected or invalid
> requests / messages. (For example in `handle_call`, or `handle_info`
> or state handlers.) (By invalid requests I mean those that have the
> correct form (i.e. a tuple `{add, A, B}`) but invalid syntax (i.e. `A`
> is an atom); by unexpected messages I mean those that fail to even
> loosely match a valid pattern (i.e. `{some_unrecognized_operation, A,
> B}`.)
>
>     I think of three possibilities:
>     * reply back with an error:
> ~~~~
> handle_call (Request, _Sender, State) -> {reply, {error,
> {invalid_request, Request}}, State}.
> ~~~~
>     * exit with an error;
> ~~~~
> handle_call (Request, _Sender, State) -> {stop, {error,
> {invalid_request, Request}}, State}.
> ~~~~
>     * don't even create a special catch-all case and just let the
> process fail with `function_clause` error;
>
>     Now each of these have advantages or disadvantages:
>     * the first one is "polite" to the caller, letting him retry, but
> could lead to hidden bugs if the caller doesn't check the return term;
>     * the second one I think fits more the Erlang philosophy of
> "let-it-crash", but could lead to state loss (i.e. an ETS table);
>     * the last one I consider to be just rude as it doesn't give any
> information on why it failed;

Let's take a look at an example. The ssh_connection_handler is a
gen_fsm in Erlang/OTP and it receives the usual gen_tcp tuples from
the TCP connection in its handle_info clause. Only the tuples are
handled, if some other message is received, the process dies with a
function_clause error and the SSH connection is closed. This is an
application-internal process, so it is safe to assume that no other
kind of message will be received.

Or not. The ssh application lets the user specify a callback function
to authenticate the user. This function happens to be executed in the
ssh_connection_handler process. Due to pressing deadlines the author
of this callback (i.e. the user of the ssh application) chose to reuse
some already existing framework with code like this:

AuthPid = spawn(authfun, [User, Password, self()]),
receive
    auth_passed -> true;
    auth_failed -> false
after 10000
    false
end,
stop_authfun(AuthPid),

Then the authfun/3 sends back the authentication result to this
process. Keen eyes probably noticed the problem - what happens, if the
authfun sends back the result just after the 10 seconds timeout, but
before it's stopped? It's the ssh_connection_handler that receives the
message, but it's handle_info can't handle the unexpected message, so
the process crashes, the SSH connection abruptly goes down (during
authentication) and the end user complains. Even though the
supervisors keep the SSH server running, it's still not ideal.

Because the authentication was harder to fix, I've first changed the
ssh_connection_handler:handle_info to just log on an unexpected
message (this is what Joe advised in this thread, if I understood
right) and it satisfied the end user, he doesn't care one extra
message in the logs. On the other hand I'm not quite sure this is the
right solution - if the ssh_connection_handler hadn't crashed, we
wouldn't found this problem at all... (the end users never really
check the logs and local blackbox testing is hardly useful for these
kind of errors).

So I think if you absolutely positively sure that you can't receive
unexpected message (the pid of the process is only known to internal
code and there are no callbacks executed), you can handle only the
expected messages. If your users do check the logs regularly, you
might be better of logging the unexpected message an keep the process
alive.