Avoid waiting on a single test for output #72

ColMelvin · 2018-02-09T20:00:40Z

When running Perl's prove with multiple jobs (-j #), the running jobs can
get stuck waiting on a single test to complete. This occurs when a test
outputs to STDERR, but does not output to STDOUT for some time after that.
The parser iterator gets stuck waiting for can_read() on that one test,
instead of looking at all of the running tests.

Avoid this slowdown by returning, potentially early, to ensure nothing
holds up the other tests.

To view the problem mentioned, create a directory and populate it with the following files:

00_blocker.t:

use Test::More tests => 1;
print STDERR "Error message\n";
sleep 10;
pass();

01_blocked.t:

use Test::More tests => 1;
select undef, undef, undef, 0.009;  # Wait 9ms
pass();

Then make 98 copies of the blocked test, to emphasize the time difference:

 for i in $(seq -w 2 99); do ln -s 01_blocked.t ${i}_blocked.t; done

Run prove *_blocked.t. On my system, this takes about 7 wallclock seconds.

Run prove -j 2 *_blocked.t. With the parallelization of similar files, we would expect this to take about half the time. And, indeed, on my system it takes about 3 wallclock seconds.

If we were to add 00_blocker.t to the parallelized prove -j 2 run, we would expect it to start first and, with its 10 second sleep, it would force the other 99 tests to run sequentially. Thus, on my system it would take 10 seconds to run (as the max of 10 and 7 seconds).

Running prove --timer -j 2 *.t before this fix actually takes 13 seconds (or 10 + 3 seconds) on my system, because of the blocking nature. However, after the fix, it runs for 10 seconds, as expected.

When running Perl's prove with multiple jobs (-j #), the running jobs can get stuck waiting on a single test to complete. This occurs when a test outputs to STDERR, but does not output to STDOUT for some time after that. The parser iterator gets stuck waiting for can_read() on that one test, instead of looking at all of the running tests. Avoid this slowdown by returning, potentially early, to ensure nothing holds up the other tests.

ribasushi · 2018-02-09T23:50:26Z

@ColMelvin does your change also happen to fix this deadlock? https://rt.cpan.org/Ticket/Display.html?id=108390#txn-1551927

ColMelvin · 2018-02-10T01:27:18Z

@ribasushi It would appear so:

Before:

HARNESS_OPTIONS=j2 perl -MTest::Harness -e 'Test::Harness::runtests( qw( ./test.t test.t ) )'; echo $?


27245 acquiring lock

Killed
137

After:

HARNESS_OPTIONS=j2 perl -MTest::Harness -e 'Test::Harness::runtests( qw( ./test.t test.t ) )'; echo $?


27101 acquiring lock

===(    1036;0  1/?  1035/? )===========================================

27102 acquiring lock

test.t .... ok                                                          
./test.t .. ok      
All tests successful.
Files=2, Tests=8200,  1 wallclock secs ( 0.59 usr  0.01 sys +  0.02 cusr  0.00 csys =  0.62 CPU)
Result: PASS
0

Leont · 2018-02-10T01:39:43Z

Interesting. I wouldn't have expected a workaround to be that simple :-o

ribasushi · 2018-02-19T12:33:21Z

@Leont what's the next step to get this workaround in?

Leont · 2018-02-20T23:24:58Z

what's the next step to get this workaround in?

The main reason why I haven't merged this yet is that I want to understand why it works, it kind of shouldn't. A quick dive through the source suggests that it will essentially insert an "unknown" token into the stream, and that will usually be ignored. I've been lacking tuits lately for further investigation (TAP::Harness being a maze of lambdas spread over different classes doesn't help here), let alone targeted testing.

bulk88 · 2018-02-21T03:05:11Z

I agree with leont, there will be a gazillion unknown token objects created of class "TAP::Parser::Result::Unknown" and a large rise in CPU/mem usage.

If me reading the code is correct, that select() call is ONLY between STDOUT and STDERR of 1 proc, not a select() between multiple proc STD FHs. If STDERR gets data, the process.pm loop will block on STDERR or STDOUT again, you can't switch/leave the loop to another proc's FHs until the "current" proc outputs a STDOUT line.

It seems to me (I didnt step debug it), that

https://github.com/Perl-Toolchain-Gang/Test-Harness/blob/master/lib/TAP/Parser/Multiplexer.pm#L133

A select between ALL proc's ERR and OUT is done first, then another select is done between ONE proc's ERR and OUT. If one proc's ERR fires, currently, it will call select in a loop for one proc until that proc's OUT fires. In reality, Multiplexer.pm should reissue the global select, something like jump from my $result = $parser->next;

Test-Harness/lib/TAP/Parser/Multiplexer.pm

Line 139 in 62688b0

my $result = $parser->next;

to unless (@ready) {

Test-Harness/lib/TAP/Parser/Multiplexer.pm

Line 133 in 62688b0

unless (@ready) {

if ->next() was STDERR. Some API design is required since parser->next() is supposed to create/process a TAP line, not process empty string with the "unknown token" problem you identified.

=head3 C<get_select_handles>

Get an a list of file handles which can be passed to C<select> to
determine the readiness of this parser.

=cut

sub get_select_handles { shift->_iterator->get_select_handles }

If I read it litterally, a parser isn't "ready" if ERR fires, only if "OUT" fires. Maybe get_select_handles should only be returning OUT not OUT and ERR (I can't test this myself) but the parser should pump its OUT and ERR together. So for each OUT, try to pump ERR if anything is on ERR, but I can imagine a deadlock where a test proc filled the OS kernel buffer with ERR data and hung in a write() to ERR, but perl is sleeping in select on that proc's OUT, but there is nothing in OUT, and won't be until harness proc ERR is drained unfreezing the test proc. Maybe the solution is not to have one IO::Handle ERR per test proc, but have test procs inherit harness's ERR unless the "merge" option is used,

bulk88 · 2018-02-21T04:13:53Z

Thinking about the design, I think the API solution is, all iterator objects must have a method, will_block or ready or really_ready or can_read (bad name, confusing IO:Select), this method is inherited/forwarded by Parser object to Multiplexer, each time IO::Select returns a parser, the FH that fired the parser is passed to the parser, and a true/false is returned, if true, call _next();

return sub {

        # Drain all the non-selectable parsers first
        if (@$avid) {
            my ( $parser, $stash ) = @{ $avid->[0] };
            my $result = $parser->next;
            shift @$avid unless defined $result;
            return ( $parser, $stash, $result );
        }

        PULL:
        unless (@ready) {
            return unless $sel->count;
            @ready = $sel->can_read;
        }

        my ( $h, $parser, $stash, @handles ) = @{ shift @ready };
        goto PULL unless $parser->really_ready($h);
        my $result = $parser->next;

sub really_ready { shift->_iterator->really_ready @_ }

sub really_ready {
    my $self = shift;
    my $fh = shift;
    if ( $fh == $self->{err} ) {
	my $got = sysread $fh, my ($chunk), $self->{chunk_size};
	if ( $got == 0 ) {##????????
	    my $sel = $self->{sel}
	    $sel->remove($fh);
	}
	print STDERR $chunk;    # echo STDERR
	return 0;
    } else {
	return 1;
    }
}

edit:

ultimately, thinking about what I never finished at https://github.com/bulk88/Test-Harness/blob/bulk88/w32_async/lib/TAP/Parser/Multiplexer.pm#L154 , each iterator must have a method to be tickled for event loop purposes from 1 central loop but the API must ensure that you dont always have a "result" from a tickle event dispatching to an iterator. An example of a tickle without a result is a partial line on STDOUT without a newline. Currently in Process.pm, if a select fires on STDOUT, and the STDOUT has a partial line (AFAIK OS console pipes trigger select ready bit in units of 1 or more newlines at a time not once per char/byte so this maybe not observable real word), so with a partial line the inner per-proc select in Process.pm will execute goto REDO and will block on 1 proc until that 1 proc emits the rest of the line, starving other procs; just like a STDERR starves other procs. The sysread() should happen in the tickle method of the iterator, and _next method of iterator ONLY fetches already buffered in Perl land STDOUT lines queued in the iterator object the way I did it in my branch.

Theoretically, if there are multiple event loops from different classes of Iterator in the same harness proc, the tickle should be non-blocking poll as API design, so you can poll the iterator object just by itself every x milliseconds with a usleep() in between each poll(). Yeah its CPU wasted, but you sometimes CANT integrate 2 different event loops in the same "put proc to sleep" call, so sleep timers/time-outs and non-blocking is only way to service both event sources from 1 future-unwritten multiplexor.PM. If Multiplexer has special awareness of Iterator classes and can fuse them together in 1 wait-till-data syscall (select but more generic concept), then yeah, no data from any test proc==no CPU time in harness proc, then after the wake up from the fused wait-till-data() call in multiplexer, 1 particular parser/interator's tickle will be called, or ALL interators have their tickles called (lets say data-ready event was an OS signal handler, unknown which iterator fired the signal so all iterators are tickled),

ColMelvin · 2018-02-21T04:36:10Z

@bulk88 Yep, the use of IO::Select->can_read in two different places is the issue. If Iterator::Process' blocks, then Multiplexer never gets to call its can_read, so the other processes get blocked writing to their pipes. By forcing Iterator::Process to not block on its can_read, we avoid the starvation in Multiplexer.

ColMelvin · 2018-02-21T04:57:04Z

We have to return to avoid the $self->_finish call at the end of Iterator::Process. And returning undef indicates we are done with the Iterator, so it gets removed.

Could we change the system to ignore the empty string as a special case, instead of creating TAP::Parser::Result::Unknown objects? For example, we could modify TAP::Parser::Grammar::tokenize() to try again if it gets an empty string. I haven't tested it, but it seems like it would work.

my $line = '';
$line = $self->{iterator}->next while (defined $line && length $line == 0);

bulk88 · 2018-02-21T05:53:46Z

Here is a stack trace of how Iterator::next is reached from Multiplexer::next.

 at ../lib/TAP/Parser/Iterator/Process.pm line 289, <GEN21> line 47.
        TAP::Parser::Iterator::Process::__ANON__() called at ../lib/TAP/Parser/I
terator/Process.pm line 310
        TAP::Parser::Iterator::Process::next_raw(TAP::Parser::Iterator::Process=
HASH(0x21ec1ac)) called at ../lib/TAP/Parser/Iterator.pm line 66
        TAP::Parser::Iterator::next(TAP::Parser::Iterator::Process=HASH(0x21ec1a
c)) called at ../lib/TAP/Parser/Grammar.pm line 288
        TAP::Parser::Grammar::tokenize(TAP::Parser::Grammar=HASH(0x21ecafc)) cal
led at ../lib/TAP/Parser.pm line 1459
        eval {...} called at ../lib/TAP/Parser.pm line 1459
        TAP::Parser::__ANON__() called at ../lib/TAP/Parser.pm line 315
        TAP::Parser::next(TAP::Parser=HASH(0x1f6dcbc)) called at ../lib/TAP/Pars
er/Multiplexer.pm line 128
        TAP::Parser::Multiplexer::__ANON__() called at ../lib/TAP/Parser/Multipl
exer.pm line 183
        TAP::Parser::Multiplexer::next(TAP::Parser::Multiplexer=HASH(0x215454c))
 called at ../lib/TAP/Harness.pm line 624
        TAP::Harness::_aggregate_parallel(TAP::Harness=HASH(0x1ee88bc), TAP::Par
ser::Aggregator=HASH(0x20d1d34), TAP::Parser::Scheduler=HASH(0x21548ac)) called
at ../lib/TAP/Harness.pm line 739
        TAP::Harness::aggregate_tests(TAP::Harness=HASH(0x1ee88bc), TAP::Parser:
:Aggregator=HASH(0x20d1d34), "base/cond.t", "base/if.t", "base/lex.t", "base/num
.t", "base/pat.t", "base/rs.t", ...) called at ../lib/TAP/Harness.pm line 558
        TAP::Harness::__ANON__() called at ../lib/TAP/Harness.pm line 571
        TAP::Harness::runtests(TAP::Harness=HASH(0x1ee88bc), "base/cond.t", "bas
e/if.t", "base/lex.t", "base/num.t", "base/pat.t", "base/rs.t", "base/term.t", .
..) called at harness line 303
 at ../lib/TAP/Parser/Iterator/Process.pm line 289, <GEN21> line 48.

$line = $self->{iterator}->next while (defined $line && length $line == 0);

wouldnt that just block again on the same 2 ERR/OUT FHs? You are still inside 1 Parser which is bound to 1 Grammer which is bound to 1 Iterator. If the Parser (really the Iterator obj) doesn't return control to Multiplexer, or Multiplexer never calls into a Parse/Grammer/Iterator sandwich, you can't switch test procs, except, this is probably a joke, if you want "deep recursion" hell of having an Iterator that can't return a line, call back into Multiplexer to pop another FD/Parser off the IO::Select, then Multiplexer dispatches to a 2nd Parser, and potentially you can 1 (or fixed amount of) stack frame for every frozen Iterator (AKA Grammer AKA Parser) in the harness proc :-D

Leont · 2018-02-21T19:02:08Z

If me reading the code is correct, that select() call is ONLY between STDOUT and STDERR of 1 proc, not a select() between multiple proc STD FHs. If STDERR gets data, the process.pm loop will block on STDERR or STDOUT again, you can't switch/leave the loop to another proc's FHs until the "current" proc outputs a STDOUT line.

Correct.

The interesting question would be "why does it do this in the first place?". The git archeology is confusing (6641692 and fe53db5).

I'm wondering if just ignoring STDERR wouldn't be a more obvious fix.

ribasushi · 2018-02-21T19:16:07Z

@bulk88

but I can imagine a deadlock where a test proc filled the OS kernel buffer with ERR data and hung in a write() to ERR, but perl is sleeping in select on that proc's OUT

You do not have to imagine it - I wrote a self-contained test for exactly this years ago, and referenced it higher up: https://rt.cpan.org/Ticket/Display.html?id=108390#txn-1551927

Neither is this a hypothetical - this is actual code that had to be written to stop CI ( and -jN installs ) from hanging randomly: https://metacpan.org/source/RIBASUSHI/DBIx-Class-0.082841/t/lib/DBICTest/Util.pm#L74-100

The reason I am highlighting this: please get together and implement a solution - either the one proposed or something more palatable. It doesn't have to be pretty. Nor terribly efficient. It has to work.

ColMelvin · 2018-02-21T21:54:05Z

IMHO, software should be, in decreasing order of importance:

Reliable
Correct
Maintainable
Performant

At present, the code is unreliable and incorrect vis-à-vis this issue. With this pull-request applied, it is reliable, though its correctness is arguable and the performance takes a hit.

Iff the performance hit credibly reduces reliability (e.g. because of the OOM killer), we should close this pull-request without applying it. Otherwise, I argue this fix should be applied and a second fix can be added to improve performance. This avoids any delays and reduces the pressure for getting out a cleaner fix (thereby allowing the developer to focus on the cleanliness of the fix, rather than rushing to get a fix out). If reliability issues are discovered down the line, this change is small enough to be reverted with a minimal chance of conflicts.

That said, I cannot commit to providing performance improvements. I know that can be an issue for FOSS projects sometimes.

ColMelvin · 2018-02-21T22:01:02Z

@bulk88 Good catch on the Grammar thing. I guess we're still stuck with TAP::Parser::Result::Unknown objects :(.

Other than filtering out ::Result::Unknown objects with 0 length in TAP::Harness, long after they are created, I have no ideas that might improve performance with minimal work.

Leont · 2018-02-21T23:45:09Z

I'm wondering if just ignoreing STDERR wouldn't be a more obvious fix.

I just pushed a different fix based on this idea to a branch.

Leont self-assigned this May 4, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid waiting on a single test for output #72

Avoid waiting on a single test for output #72

ColMelvin commented Feb 9, 2018 •

edited

Loading

ribasushi commented Feb 9, 2018

ColMelvin commented Feb 10, 2018

Leont commented Feb 10, 2018

ribasushi commented Feb 19, 2018

Leont commented Feb 20, 2018

bulk88 commented Feb 21, 2018

bulk88 commented Feb 21, 2018 •

edited

Loading

ColMelvin commented Feb 21, 2018

ColMelvin commented Feb 21, 2018 •

edited

Loading

bulk88 commented Feb 21, 2018

Leont commented Feb 21, 2018 •

edited

Loading

ribasushi commented Feb 21, 2018

ColMelvin commented Feb 21, 2018

ColMelvin commented Feb 21, 2018

Leont commented Feb 21, 2018 •

edited

Loading

Avoid waiting on a single test for output #72

Are you sure you want to change the base?

Avoid waiting on a single test for output #72

Conversation

ColMelvin commented Feb 9, 2018 • edited Loading

ribasushi commented Feb 9, 2018

ColMelvin commented Feb 10, 2018

Leont commented Feb 10, 2018

ribasushi commented Feb 19, 2018

Leont commented Feb 20, 2018

bulk88 commented Feb 21, 2018

bulk88 commented Feb 21, 2018 • edited Loading

ColMelvin commented Feb 21, 2018

ColMelvin commented Feb 21, 2018 • edited Loading

bulk88 commented Feb 21, 2018

Leont commented Feb 21, 2018 • edited Loading

ribasushi commented Feb 21, 2018

ColMelvin commented Feb 21, 2018

ColMelvin commented Feb 21, 2018

Leont commented Feb 21, 2018 • edited Loading

ColMelvin commented Feb 9, 2018 •

edited

Loading

bulk88 commented Feb 21, 2018 •

edited

Loading

ColMelvin commented Feb 21, 2018 •

edited

Loading

Leont commented Feb 21, 2018 •

edited

Loading

Leont commented Feb 21, 2018 •

edited

Loading