[erlang-bugs] Fwd: Race condition in cover.erl
Andreas Schumacher
andreas@REDACTED
Fri Apr 4 19:03:41 CEST 2014
Thank you for the report and your initial investigation. We appreciate your
offer to fix the issue and will come back with answers to your questions
next week.
Andreas Schumacher, Erlang/OTP, Ericsson AB
-----Original Message-----
From: erlang-bugs-bounces@REDACTED [mailto:erlang-bugs-bounces@REDACTED]
On Behalf Of Andrew Thompson
Sent: den 3 april 2014 02:01
To: erlang-bugs@REDACTED
Subject: [erlang-bugs] Race condition in cover.erl
I've been doing some pretty extreme coverage reporting lately in an effort
to help understand the coverage provided by Basho's integration test suite
riak_test.
As part of that work I've seen, several times now, an error like this:
2014-04-02 15:47:23 =ERROR REPORT====
Error in process <0.80.0> on node 'riak_test@REDACTED' with exit value:
{function_clause,[{cover,'-sync_compiled/2-lc$^0/1-0-',[ok],[{file,"cover.erl"},{line,1077}]},{cover,sync_compiled,2,[{file,"cover.erl"},{line,1077}]},{cover,main_process_loop,1,[{file,"cover.erl"},{line,819}]}]}
However, this is *extremely* hard to reproduce with my use case, taking
upwards of 15 hours, and it only happens on slower machines.
I've added some debug prints, and the result of
remote_call(Node,{remote,get_compiled}) is coming back as 'ok'.
Looking at the code for that, we can see that is clearly impossible:
https://github.com/erlang/otp/blob/maint/lib/tools/src/cover.erl#L893
#remote_state.compiled is always a list, so where is the 'ok' coming from?
At first I thought the async reply from collect,remote was the source of
the errant 'ok', but re-reading that code, it is using the 'from'
syntax, so the collect,remote replies are going to a particular pid, not
the registered cover ?SERVER.
The problem remains, however, that cover.erl plays fast and loose with the
mailbox, requests and replies are not tagged with a ref (like in
gen_server) so it is possible for the receive in remote_call to get a reply
for a request it did not make:
https://github.com/erlang/otp/blob/maint/lib/tools/src/cover.erl#L570
I am pretty sure that is what is happening here, although I cannot spot the
exact cause. Mismatched requests/replies could happen quiet frequently in
this module, given that most of the commands simply return 'ok' anyway.
I'm happy to put some more time into debugging and fixing this, but I need
some more information on what I can and can't do.
1 - Can I change the messaging protocol in a backwards compatible way?
Is running coverage across multiple nodes at once expected to work
across OTP versions? Can I change the protocol if I keep things
backwards compatible (by enumerating something on the spawned
remotes to see if they can use the new protocol)?
2 - Why is this code not a gen_server? Is there some reason or is it
just because of the age of the code? Would it be permissible to
refactor cover.erl into 2 gen_servers (main and remote cover
servers)?
Andrew
_______________________________________________
erlang-bugs mailing list
erlang-bugs@REDACTED
http://erlang.org/mailman/listinfo/erlang-bugs
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-bugs/attachments/20140404/a0ba84ba/attachment.htm>
More information about the erlang-bugs
mailing list