[erlang-bugs] common_test + test_server_io errors

Mon Jul 1 11:25:13 CEST 2013

Hi Peter,

On 27 Jun 2013, at 15:37, Peter Andersson wrote:

> When a process that hasn't inherited the group leader from a test case
> process (i.e typically one that hasn't been spawned directly or
> indirectly from a test case) sends a printout to common_test (by means
> of e.g ct:log), common_test will print the string to the "Unexpected I/O
> log", which you find a link to from the test overview page. This is done
> by actually sending the string to a process named test_server_io, which
> performs the printing. It is passing the message from the common_test
> logger to test_server_io that fails, because test_server_io has terminated.
> 

Ok, but this issue only shows up on R16B01, so presumably something has changed in this regard since the last release?

> Now, test_server_io should live for as long as common_test is up and
> running (well, almost). You should not get that crash report unless
> ct:log is called after the test run is finished or test_server_io has
> crashed unexpectedly because of a bug. The latter should not be a silent
> event - there should be some indication of such a crash during the test
> run. Is it possible it's the former problem you encounter, or are you
> sure the tests are still running when the printout happens? This thing
> smells timing related error to me.

Oh it most certainly is timing related. My code is basically doing the following:

1. bootstrap a bunch of configuration
2. find test cases + test config
3. install common test hooks
4. start common test

The hooks installed in (3) use the config identified in (2) to bootstrap the test environment, starting and configuring remote nodes and so on, then it monitors these nodes (and other things, all just "test resources" as far as it's concerned) for the lifetime of the test, and verifies they've behaved as expected once the test case/group/suite is completed.

Specifically, we do things like "shut down one node of a cluster whilst a client is busy publishing or consuming data and verify failover etc" using this approach. 

> There are only two short "windows"
> during a test run when the common_test logger is alive while
> test_server_io isn't: During startup and shutdown of the test run. If
> the ct:log call happens during either window, you could get this error.
> 

Well, it's certainly possible - likely even - that the logging handler which calls ct:log/2 *is* being called before a test suite/group/function. But this *does not* produce an error in previous versions of OTP - so it seems you're saying this timing issue is now something that must be considered if you're using >= R16B01. There's nothing in the documentation to warn about this, and it means that any ct-hooks you install must now be very careful not to use ct:log/2 at the wrong time. That's pretty annoying, since one of the reasons for choosing common_test for this is that you get a lot of nice output logging and whatnot after each test run.

> Is it easy for you to reproduce the error or does it only happen
> occasionally?
> 

This happens every time I run the tests. You can replicate it thus:

$ git clone https://github.com/nebularis/systest.git
$ cd systest
$ make test

The pertinent code path is:

=> systest_main:main/1
=> systest_runner:execute/1
=> systest_ct:run/3
=> ct:run_test/1

The bit that does the logging is systest_ct_log - this is adds a gen_event handler to the 'systest_event_log' event manager, which when logging occurs, will route to systest_ct_log:write_log/4 which in turn calls ct:log/2.

The ct-hook that is installed is systest_cth, and there will be plenty of logging whilst we're in calls like pre_init_per_suite, post_end_per_suite and the equivalents for test groups and test functions.

> We should try to rule out that there's a bug that causes test_server
> failure.

How can I assist in verifying that?

> Then, the timing problem I described above is quite easily
> fixed by making sure common_test prints to stdout if test_server isn't
> up and running. I'll implement a fix for this right now, in any event.
> 

Erm - ok, but that's hardly a fix. In that eventuality, I'm likely to just redirect all my logs somewhere else since I don't want my test runs on CI having their stdout cluttered with the *massive* amount of debug logging I do at the internal/framework level. I rely on the HTML logs to store the verbose output of each test, since we might launch up to 6 erlang nodes and report on all manner of things during the test run, but the operator log (i.e., stdout) should just contain the 'passed... failed... etc...' information.

Besides, this was never an issue with previous versions of common_test, so I think the right way to fix this is to identify which change(s) broke things and then take a view on how to 'un-break' them. I can assist with this by doing a bisect, but you'll need to tell me which OTP branch I should do that against!?

Cheers,
Tim

> Best regards,
> Peter
> 
> Ericsson AB, Erlang/OTP
> 
> 
> Tim Watson wrote:
>> I'm seeing some weird errors in my common_test suites since upgrading
>> to R16B01 - I'd like to know if this is a bug or if my code is doing
>> something wrong...
>> 
>> ct_util_server got EXIT from <0.227.0>: {noproc,
>>                                        {gen_server,call,
>>                                         [test_server_io,
>>                                          {print,xxxFrom,unexpected_io,
>>                                           [[[["<div
>> class=\"default\"><b>*** User 2013-06-26 16:47:39.122 ***</b>"],
>>                                              "\n",
>>                                              [91,102,114,97,109,101,119,111,
>>                                               114,107,93,32,119,97,116,99,
>>                                               104,100,111,103,58,32,110,111,
>>                                               32,112,114,111,99,115,32,116,
>>                                               111,32,107,105,108,108,"\n"]],
>>                                             "\n","</div>"]]},
>>                                          infinity]}}
>> 
>> That message originated in these lines of code, running in a process
>> started outside of common_test:
>> 
>> log(framework, "watchdog: no procs to kill~n", []);
>> 
>> which calls another module:
>> 
>> log(Scope, Fmt, Args) ->
>>   gen_event:sync_notify(systest_event_log, {Scope, Fmt, Args}).
>> 
>> which ends up in the following handler (amongst other places):
>> 
>> write_log(EvId, _Fd, What, Args) ->
>>   ct:log("[" ++ as_string(EvId) ++ "] " ++ as_string(What), Args).
>> 
>> So what's up with the "unexpected_io" error here, and *why* has this
>> brought the whole common_test run down?
>> 
>> Thanks,
>> 
>> Tim
>> 
>> 
>> 
>> ------------------------------------------------------------------------
>> 
>> _______________________________________________
>> erlang-bugs mailing list
>> erlang-bugs@REDACTED
>> http://erlang.org/mailman/listinfo/erlang-bugs
>> 
>