-
Notifications
You must be signed in to change notification settings - Fork 67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avoid waiting on a single test for output #72
base: master
Are you sure you want to change the base?
Avoid waiting on a single test for output #72
Conversation
When running Perl's prove with multiple jobs (-j #), the running jobs can get stuck waiting on a single test to complete. This occurs when a test outputs to STDERR, but does not output to STDOUT for some time after that. The parser iterator gets stuck waiting for can_read() on that one test, instead of looking at all of the running tests. Avoid this slowdown by returning, potentially early, to ensure nothing holds up the other tests.
@ColMelvin does your change also happen to fix this deadlock? https://rt.cpan.org/Ticket/Display.html?id=108390#txn-1551927 |
@ribasushi It would appear so: Before: HARNESS_OPTIONS=j2 perl -MTest::Harness -e 'Test::Harness::runtests( qw( ./test.t test.t ) )'; echo $?
27245 acquiring lock
Killed
137 After: HARNESS_OPTIONS=j2 perl -MTest::Harness -e 'Test::Harness::runtests( qw( ./test.t test.t ) )'; echo $?
27101 acquiring lock
===( 1036;0 1/? 1035/? )===========================================
27102 acquiring lock
test.t .... ok
./test.t .. ok
All tests successful.
Files=2, Tests=8200, 1 wallclock secs ( 0.59 usr 0.01 sys + 0.02 cusr 0.00 csys = 0.62 CPU)
Result: PASS
0 |
Interesting. I wouldn't have expected a workaround to be that simple :-o |
@Leont what's the next step to get this workaround in? |
The main reason why I haven't merged this yet is that I want to understand why it works, it kind of shouldn't. A quick dive through the source suggests that it will essentially insert an "unknown" token into the stream, and that will usually be ignored. I've been lacking tuits lately for further investigation (TAP::Harness being a maze of lambdas spread over different classes doesn't help here), let alone targeted testing. |
I agree with leont, there will be a gazillion unknown token objects created of class "TAP::Parser::Result::Unknown" and a large rise in CPU/mem usage. If me reading the code is correct, that select() call is ONLY between STDOUT and STDERR of 1 proc, not a select() between multiple proc STD FHs. If STDERR gets data, the process.pm loop will block on STDERR or STDOUT again, you can't switch/leave the loop to another proc's FHs until the "current" proc outputs a STDOUT line. It seems to me (I didnt step debug it), that https://github.com/Perl-Toolchain-Gang/Test-Harness/blob/master/lib/TAP/Parser/Multiplexer.pm#L133 A select between ALL proc's ERR and OUT is done first, then another select is done between ONE proc's ERR and OUT. If one proc's ERR fires, currently, it will call select in a loop for one proc until that proc's OUT fires. In reality, Multiplexer.pm should reissue the global select, something like jump from Test-Harness/lib/TAP/Parser/Multiplexer.pm Line 139 in 62688b0
unless (@ready) { Test-Harness/lib/TAP/Parser/Multiplexer.pm Line 133 in 62688b0
If I read it litterally, a parser isn't "ready" if ERR fires, only if "OUT" fires. Maybe get_select_handles should only be returning OUT not OUT and ERR (I can't test this myself) but the parser should pump its OUT and ERR together. So for each OUT, try to pump ERR if anything is on ERR, but I can imagine a deadlock where a test proc filled the OS kernel buffer with ERR data and hung in a write() to ERR, but perl is sleeping in select on that proc's OUT, but there is nothing in OUT, and won't be until harness proc ERR is drained unfreezing the test proc. Maybe the solution is not to have one IO::Handle ERR per test proc, but have test procs inherit harness's ERR unless the "merge" option is used, |
Thinking about the design, I think the API solution is, all iterator objects must have a method,
edit: ultimately, thinking about what I never finished at https://github.com/bulk88/Test-Harness/blob/bulk88/w32_async/lib/TAP/Parser/Multiplexer.pm#L154 , each iterator must have a method to be tickled for event loop purposes from 1 central loop but the API must ensure that you dont always have a "result" from a tickle event dispatching to an iterator. An example of a tickle without a result is a partial line on STDOUT without a newline. Currently in Process.pm, if a select fires on STDOUT, and the STDOUT has a partial line (AFAIK OS console pipes trigger select ready bit in units of 1 or more newlines at a time not once per char/byte so this maybe not observable real word), so with a partial line the inner per-proc select in Process.pm will execute Theoretically, if there are multiple event loops from different classes of Iterator in the same harness proc, the tickle should be non-blocking poll as API design, so you can poll the iterator object just by itself every x milliseconds with a usleep() in between each poll(). Yeah its CPU wasted, but you sometimes CANT integrate 2 different event loops in the same "put proc to sleep" call, so sleep timers/time-outs and non-blocking is only way to service both event sources from 1 future-unwritten multiplexor.PM. If Multiplexer has special awareness of Iterator classes and can fuse them together in 1 wait-till-data syscall (select but more generic concept), then yeah, no data from any test proc==no CPU time in harness proc, then after the wake up from the fused wait-till-data() call in multiplexer, 1 particular parser/interator's tickle will be called, or ALL interators have their tickles called (lets say data-ready event was an OS signal handler, unknown which iterator fired the signal so all iterators are tickled), |
@bulk88 Yep, the use of |
We have to return to avoid the Could we change the system to ignore the empty string as a special case, instead of creating TAP::Parser::Result::Unknown objects? For example, we could modify my $line = '';
$line = $self->{iterator}->next while (defined $line && length $line == 0); |
Here is a stack trace of how Iterator::next is reached from Multiplexer::next.
wouldnt that just block again on the same 2 ERR/OUT FHs? You are still inside 1 Parser which is bound to 1 Grammer which is bound to 1 Iterator. If the Parser (really the Iterator obj) doesn't return control to Multiplexer, or Multiplexer never calls into a Parse/Grammer/Iterator sandwich, you can't switch test procs, except, this is probably a joke, if you want "deep recursion" hell of having an Iterator that can't return a line, call back into Multiplexer to pop another FD/Parser off the IO::Select, then Multiplexer dispatches to a 2nd Parser, and potentially you can 1 (or fixed amount of) stack frame for every frozen Iterator (AKA Grammer AKA Parser) in the harness proc :-D |
Correct. The interesting question would be "why does it do this in the first place?". The git archeology is confusing (6641692 and fe53db5). I'm wondering if just ignoring STDERR wouldn't be a more obvious fix. |
You do not have to imagine it - I wrote a self-contained test for exactly this years ago, and referenced it higher up: https://rt.cpan.org/Ticket/Display.html?id=108390#txn-1551927 Neither is this a hypothetical - this is actual code that had to be written to stop CI ( and The reason I am highlighting this: please get together and implement a solution - either the one proposed or something more palatable. It doesn't have to be pretty. Nor terribly efficient. It has to work. |
IMHO, software should be, in decreasing order of importance:
At present, the code is unreliable and incorrect vis-à-vis this issue. With this pull-request applied, it is reliable, though its correctness is arguable and the performance takes a hit. Iff the performance hit credibly reduces reliability (e.g. because of the OOM killer), we should close this pull-request without applying it. Otherwise, I argue this fix should be applied and a second fix can be added to improve performance. This avoids any delays and reduces the pressure for getting out a cleaner fix (thereby allowing the developer to focus on the cleanliness of the fix, rather than rushing to get a fix out). If reliability issues are discovered down the line, this change is small enough to be reverted with a minimal chance of conflicts. That said, I cannot commit to providing performance improvements. I know that can be an issue for FOSS projects sometimes. |
@bulk88 Good catch on the Grammar thing. I guess we're still stuck with TAP::Parser::Result::Unknown objects :(. Other than filtering out ::Result::Unknown objects with 0 length in TAP::Harness, long after they are created, I have no ideas that might improve performance with minimal work. |
I just pushed a different fix based on this idea to a branch. |
When running Perl's prove with multiple jobs (-j #), the running jobs can
get stuck waiting on a single test to complete. This occurs when a test
outputs to STDERR, but does not output to STDOUT for some time after that.
The parser iterator gets stuck waiting for can_read() on that one test,
instead of looking at all of the running tests.
Avoid this slowdown by returning, potentially early, to ensure nothing
holds up the other tests.
To view the problem mentioned, create a directory and populate it with the following files:
00_blocker.t
:01_blocked.t
:Then make 98 copies of the blocked test, to emphasize the time difference:
Run
prove *_blocked.t
. On my system, this takes about 7 wallclock seconds.Run
prove -j 2 *_blocked.t
. With the parallelization of similar files, we would expect this to take about half the time. And, indeed, on my system it takes about 3 wallclock seconds.If we were to add
00_blocker.t
to the parallelizedprove -j 2
run, we would expect it to start first and, with its 10 second sleep, it would force the other 99 tests to run sequentially. Thus, on my system it would take 10 seconds to run (as the max of 10 and 7 seconds).Running
prove --timer -j 2 *.t
before this fix actually takes 13 seconds (or 10 + 3 seconds) on my system, because of the blocking nature. However, after the fix, it runs for 10 seconds, as expected.