Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add individual tasks input files to TaskArray #5726

Closed
wants to merge 1 commit into from

Conversation

chris-mbiomics
Copy link
Contributor

Update TaskArrayCollector.groovy, fixes #5701

Making this a draft PR since I don't know if this is the best way

Copy link

netlify bot commented Jan 30, 2025

Deploy Preview for nextflow-docs-staging canceled.

Name Link
🔨 Latest commit b43785b
🔍 Latest deploy log https://app.netlify.com/sites/nextflow-docs-staging/deploys/67a0aba7477ce10008f728f6

@chris-mbiomics
Copy link
Contributor Author

@pditommaso Since you were able to fix the other task array issue so quickly, may I ask you for your feedback on this?

@pditommaso pditommaso requested a review from jorgee January 30, 2025 11:02
Copy link
Contributor

@jorgee jorgee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have reproduced the error and checked that the proposed fix is working. The solution looks fine but it could also be implemented in the taskbean as done for the working dir. You would not need to create the input parameters, just add the paths. The pipeline suggested to test could be added as validation test. @pditommaso do you think it is needed?

@pditommaso
Copy link
Member

Yeah, the TaskBean looks a better solution

@chris-mbiomics
Copy link
Contributor Author

I agree with that assessment. Just got to know the nextflow codebase, thank you for your feedback!

I'll change my PR.

Regarding a test: I would feel better if the test bucket was not owned by my company, but also owned by seqera/nextflow. Do you have a second bucket to test with? Then I can also adapt one of the tests.

@pditommaso
Copy link
Member

You can use this bucket

workDir = 'gs://rnaseq-nf/scratch'

@bentsherman
Copy link
Member

Even simpler, just override getInputFilesMap() in TaskArrayRun:

    @Override
    Map<String,Path> getInputFilesMap() {
        Map<String,Path> result = [:]
        for( final handler : children )
            result.putAll(handler.task.getInputFilesMap())
        return result
    }

Everything else should work out-of-the-box. My only concern is that if two tasks stage an input file with the same name, they will clobber each other here.

We might need to refactor this code here:

// remap input files to container mounted paths
for( Map.Entry<String,Path> entry : new HashMap<>(bean.inputFiles).entrySet() ) {
bean.inputFiles.put( entry.key, toContainerMount(entry.value, true) )
}

When the task is an array, we don't need to set the bean input files, we just need to call toContainerMount() on all of the input file paths. But the problem is that at this point we no longer have the TaskRun.

Should we just add another field to the TaskBean?

List<Path> arrayInputFiles

@chris-mbiomics
Copy link
Contributor Author

just switched out the branches to use the taskbean based implementation suggested above. I think that's the easiest – since it's a list it doesn't have issues with multiple files with the same names from different tasks colliding and replacing each other.
It also follows the same pattern as with the working directory.

@pditommaso For the test case, I would need a second, different bucket, which would not be host to the workDir

@chris-mbiomics chris-mbiomics marked this pull request as ready for review January 31, 2025 07:25
@chris-mbiomics
Copy link
Contributor Author

Is this an issue with aws batch aswell? I didn't touch that code because the TaskBeans arrayWorkDirs wasn't used by aws batch

@pditommaso
Copy link
Member

I'd suggest @jorgee to take over this and review all cases

@bentsherman
Copy link
Member

AWS Batch should not need to be changed because it does not need the bucket mounts. Google Batch needs them only because it uses gcsfuse by default to access the inputs.

Fusion is technically doing a similar thing to Google Batch:

// remap input files to container mounted paths
for( Map.Entry<String,Path> entry : new HashMap<>(bean.inputFiles).entrySet() ) {
bean.inputFiles.put( entry.key, toContainerMount(entry.value, scheme) )
}

But I don't think the array parent needs the input files for Fusion. They are only used to populate the staging commands in the wrapper script, which the array job doesn't need.

So in summary, this new arrayInputFiles should only be needed for Google Batch on gcsfuse.

fixes nextflow-io#5701
TaskBean has been modified to generate a list of files which are staged for the individual tasks for taskArrays. GoogleBatchScriptLauncher then takes the paths and mounts the required buckets.

Signed-off-by: Christian Romberg <[email protected]>
@pditommaso
Copy link
Member

Closing in favour of #5739

@pditommaso pditommaso closed this Feb 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Google batch arrays don't mount all required mounts
4 participants