[erlang-bugs] Race condition in cover.erl

Andrew Thompson andrew@REDACTED
Thu Apr 3 02:00:39 CEST 2014


I've been doing some pretty extreme coverage reporting lately in an
effort to help understand the coverage provided by Basho's integration
test suite riak_test.

As part of that work I've seen, several times now, an error like this:

2014-04-02 15:47:23 =ERROR REPORT====
Error in process <0.80.0> on node 'riak_test@REDACTED' with exit value:
{function_clause,[{cover,'-sync_compiled/2-lc$^0/1-0-',[ok],[{file,"cover.erl"},{line,1077}]},{cover,sync_compiled,2,[{file,"cover.erl"},{line,1077}]},{cover,main_process_loop,1,[{file,"cover.erl"},{line,819}]}]}

However, this is *extremely* hard to reproduce with my use case, taking
upwards of 15 hours, and it only happens on slower machines.

I've added some debug prints, and the result of
remote_call(Node,{remote,get_compiled}) is coming back as 'ok'.

Looking at the code for that, we can see that is clearly impossible:

https://github.com/erlang/otp/blob/maint/lib/tools/src/cover.erl#L893

#remote_state.compiled is always a list, so where is the 'ok' coming
from?

At first I thought the async reply from collect,remote was the source of
the errant 'ok', but re-reading that code, it is using the 'from'
syntax, so the collect,remote replies are going to a particular pid, not
the registered cover ?SERVER.

The problem remains, however, that cover.erl plays fast and loose with
the mailbox, requests and replies are not tagged with a ref (like in
gen_server) so it is possible for the receive in remote_call to get a
reply for a request it did not make:

https://github.com/erlang/otp/blob/maint/lib/tools/src/cover.erl#L570

I am pretty sure that is what is happening here, although I cannot spot
the exact cause. Mismatched requests/replies could happen quiet
frequently in this module, given that most of the commands simply return
'ok' anyway.

I'm happy to put some more time into debugging and fixing this, but I
need some more information on what I can and can't do.

1 - Can I change the messaging protocol in a backwards compatible way?
    Is running coverage across multiple nodes at once expected to work
    across OTP versions? Can I change the protocol if I keep things
    backwards compatible (by enumerating something on the spawned
    remotes to see if they can use the new protocol)?

2 - Why is this code not a gen_server? Is there some reason or is it
    just because of the age of the code? Would it be permissible to
    refactor cover.erl into 2 gen_servers (main and remote cover
    servers)?

Andrew



More information about the erlang-bugs mailing list