-
Notifications
You must be signed in to change notification settings - Fork 659
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add individual tasks input files to TaskArray #5726
Conversation
✅ Deploy Preview for nextflow-docs-staging canceled.
|
@pditommaso Since you were able to fix the other task array issue so quickly, may I ask you for your feedback on this? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have reproduced the error and checked that the proposed fix is working. The solution looks fine but it could also be implemented in the taskbean as done for the working dir. You would not need to create the input parameters, just add the paths. The pipeline suggested to test could be added as validation test. @pditommaso do you think it is needed?
Yeah, the TaskBean looks a better solution |
I agree with that assessment. Just got to know the nextflow codebase, thank you for your feedback! I'll change my PR. Regarding a test: I would feel better if the test bucket was not owned by my company, but also owned by seqera/nextflow. Do you have a second bucket to test with? Then I can also adapt one of the tests. |
You can use this bucket nextflow/validation/google.config Line 11 in 0a29236
|
Even simpler, just override @Override
Map<String,Path> getInputFilesMap() {
Map<String,Path> result = [:]
for( final handler : children )
result.putAll(handler.task.getInputFilesMap())
return result
} Everything else should work out-of-the-box. My only concern is that if two tasks stage an input file with the same name, they will clobber each other here. We might need to refactor this code here: nextflow/plugins/nf-google/src/main/nextflow/cloud/google/batch/GoogleBatchScriptLauncher.groovy Lines 73 to 76 in 039c8f3
When the task is an array, we don't need to set the bean input files, we just need to call Should we just add another field to the TaskBean? List<Path> arrayInputFiles |
023486d
to
530b11c
Compare
just switched out the branches to use the taskbean based implementation suggested above. I think that's the easiest – since it's a list it doesn't have issues with multiple files with the same names from different tasks colliding and replacing each other. @pditommaso For the test case, I would need a second, different bucket, which would not be host to the workDir |
Is this an issue with aws batch aswell? I didn't touch that code because the TaskBeans |
I'd suggest @jorgee to take over this and review all cases |
AWS Batch should not need to be changed because it does not need the bucket mounts. Google Batch needs them only because it uses gcsfuse by default to access the inputs. Fusion is technically doing a similar thing to Google Batch: nextflow/modules/nextflow/src/main/groovy/nextflow/fusion/FusionScriptLauncher.groovy Lines 56 to 59 in 0a29236
But I don't think the array parent needs the input files for Fusion. They are only used to populate the staging commands in the wrapper script, which the array job doesn't need. So in summary, this new |
fixes nextflow-io#5701 TaskBean has been modified to generate a list of files which are staged for the individual tasks for taskArrays. GoogleBatchScriptLauncher then takes the paths and mounts the required buckets. Signed-off-by: Christian Romberg <[email protected]>
530b11c
to
b43785b
Compare
Closing in favour of #5739 |
Update TaskArrayCollector.groovy, fixes #5701
Making this a draft PR since I don't know if this is the best way