Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NullPointerException when retrying task with an input transfer failure #5727

Open
jorgee opened this issue Jan 30, 2025 · 5 comments
Open

NullPointerException when retrying task with an input transfer failure #5727

jorgee opened this issue Jan 30, 2025 · 5 comments
Assignees

Comments

@jorgee
Copy link
Contributor

jorgee commented Jan 30, 2025

Bug report

A NullPointerException is thrown during a task retry when a failure there is a failure in the foreign file staging of an input file managed by the file porter. (#5690 (comment))

Expected behavior and actual behavior

When a task requires an input file transfer and it fails, Nextflow produces an exception during the invokeTask method before generating the task hash, the management of the exception creates a retry but, it fails because the hash is null.

Steps to reproduce the problem

The original failure happens when a read timeout exception is produced when transferring a file in an s3 bucket to a working dir in Google Cloud. I have reproduced the same error when using a non-existing file in s3 as task input, and running the pipeline in Google Cloud. When running the same locally the task fails without retrying.

Program output

See attached log.

Environment

  • Nextflow version: 24.12.0-edge
  • Java version: 21
  • Operating system: Linux
@jorgee jorgee changed the title NullPointerException when retrying task with a input transfer failure NullPointerException when retrying task with an input transfer failure Jan 30, 2025
@jorgee
Copy link
Contributor Author

jorgee commented Jan 30, 2025

When there is a IOException, the FilePorter failure generates a ProcessStageException that extends ProcessException. So, this is the reason why Nextflow retries the task execution. However, when there is a failure during the invocation, the task context has not fully generated what makes it not retryable, at least with the resumeOrDie method. In this cases, I think we should retry cleaning the task and invoking again.
Moreover, the read timeout error is a SocketTimeoutException that is a subclass of InterruptedIOException. It is not retried because this statement. I think the read timeout is caused by a temporal network outage, so it could be retried.

@pditommaso
Copy link
Member

Any chance to include the related error stack trace? there are many in the log file.

@jorgee
Copy link
Contributor Author

jorgee commented Jan 30, 2025

This is the error stacktrace

Jan-26 21:49:16.792 [Actor Thread 3] DEBUG nextflow.processor.TaskProcessor - Handling unexpected condition for
  task: name=NFCORE_SAREK:PREPARE_GENOME:TABIX_DBSNP (1); work-dir=null
  error [nextflow.exception.ProcessStageException]: Can't stage file s3://ngi-igenomes/igenomes/Homo_sapiens/GATK/GRCh38/Annotation/GATKBundle/dbsnp_146.hg38.vcf.gz -- reason: Read timed out
Jan-26 21:49:16.872 [Actor Thread 3] INFO  nextflow.processor.TaskProcessor - [null] NOTE: Can't stage file s3://ngi-igenomes/igenomes/Homo_sapiens/GATK/GRCh38/Annotation/GATKBundle/dbsnp_146.hg38.vcf.gz -- reason: Read timed out -- Execution is retried (1)
Jan-26 21:49:16.882 [pool-2-thread-2] ERROR nextflow.processor.TaskProcessor - Unable to re-submit task `NFCORE_SAREK:PREPARE_GENOME:TABIX_DBSNP (1)`
java.lang.NullPointerException: Cannot invoke "com.google.common.hash.HashCode.asBytes()" because "hash" is null
	at nextflow.processor.TaskProcessor.checkCachedOrLaunchTask(TaskProcessor.groovy:804)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:568)
	at org.codehaus.groovy.reflection.CachedMethod.invoke(CachedMethod.java:343)
	at groovy.lang.MetaMethod.doMethodInvoke(MetaMethod.java:328)
	at org.codehaus.groovy.runtime.metaclass.ClosureMetaClass.invokeMethod(ClosureMetaClass.java:342)
	at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1007)
	at org.codehaus.groovy.vmplugin.v8.IndyInterface.fromCache(IndyInterface.java:321)
	at nextflow.processor.TaskProcessor$_checkErrorStrategy_closure20.doCall(TaskProcessor.groovy:1169)
	at nextflow.processor.TaskProcessor$_checkErrorStrategy_closure20.doCall(TaskProcessor.groovy)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:568)
	at org.codehaus.groovy.reflection.CachedMethod.invoke(CachedMethod.java:343)
	at groovy.lang.MetaMethod.doMethodInvoke(MetaMethod.java:328)
	at org.codehaus.groovy.runtime.metaclass.ClosureMetaClass.invokeMethod(ClosureMetaClass.java:279)
	at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1007)
	at groovy.lang.Closure.call(Closure.java:433)
	at groovy.lang.Closure.call(Closure.java:412)
	at groovy.lang.Closure.run(Closure.java:505)
	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539)
	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
	at java.base/java.lang.Thread.run(Thread.java:833)

@pditommaso
Copy link
Member

pditommaso commented Jan 30, 2025

is this meant to be solved by #5723?

@jorgee
Copy link
Contributor Author

jorgee commented Jan 30, 2025

It is included in #5690, mainly to allow @ejseqera to do the stress test and check both issues at the same time.

@jorgee jorgee self-assigned this Jan 30, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants