Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: The submission_environment_dependencies.txt file does not get staged when running with Flink runner on Dataproc #32743

Open
1 of 17 tasks
liferoad opened this issue Oct 10, 2024 · 12 comments · Fixed by #32752
Assignees

Comments

@liferoad
Copy link
Contributor

liferoad commented Oct 10, 2024

What happened?

In some cases, "submission_environment_dependencies.txt" might not be staged.

#32752 added a workaround to ignore the error for a missing artifact, but we should rootcause why it didn't get staged.

Issue Priority

Priority: 2 (default / most bugs should be filed as P2)

Issue Components

  • Component: Python SDK
  • Component: Java SDK
  • Component: Go SDK
  • Component: Typescript SDK
  • Component: IO connector
  • Component: Beam YAML
  • Component: Beam examples
  • Component: Beam playground
  • Component: Beam katas
  • Component: Website
  • Component: Infrastructure
  • Component: Spark Runner
  • Component: Flink Runner
  • Component: Samza Runner
  • Component: Twister2 Runner
  • Component: Hazelcast Jet Runner
  • Component: Google Cloud Dataflow Runner
@tvalentyn
Copy link
Contributor

Do we crash? Seems like we should just print sth:

bufLogger.Printf(ctx, "couldn't fetch the submission environment dependencies: %v", err)

@liferoad
Copy link
Contributor Author

Flink on Dataproc returns this:
failed to retrieve staged files: failed to retrieve /tmp/staged in 3 attempts: failed to retrieve chunk for /tmp/staged/submission_environment_dependencies.txt
And the job then failed.

@tvalentyn
Copy link
Contributor

ok, then the problem concerns the area of materialization of staged artifacts - we have a file that is being added to a manifest, but then not available when we try to materialize it.

It should either not be staged (and not included in the manifest), or be available in the staging location.

@tvalentyn
Copy link
Contributor

Workaround: supply --experiments=disable_logging_submission_environment

'disable_logging_submission_environment')):

@github-actions github-actions bot added this to the 2.61.0 Release milestone Oct 11, 2024
@tvalentyn tvalentyn reopened this Oct 11, 2024
@tvalentyn tvalentyn changed the title [Bug]: when submission_environment_dependencies.txt somehow does not exist, the error should be ignored [Bug]: The submission_environment_dependencies.txt file does not get staged when running with Flink runner on Dataproc Oct 11, 2024
@tvalentyn tvalentyn removed this from the 2.61.0 Release milestone Oct 11, 2024
@liferoad liferoad self-assigned this Oct 11, 2024
@liferoad
Copy link
Contributor Author

Thanks, @tvalentyn. Let me investigate this later when I have time.

@zendesk-kjaanson
Copy link

I also ran into this bug when trying to run Beam 2.59.0 using PortableRunner on Kubernetes with Apache Flink Operator. When looking into task manager pods then the submission_environment_dependencies.txt file at /tmp/staging seems to be there but completely empty. All task-managers spawned will simply sit idle after logging this error.

@liferoad
Copy link
Contributor Author

liferoad commented Jan 8, 2025

You can disable this by using --experiments disable_logging_submission_environment

@zendesk-kjaanson
Copy link

I solved the issue for myself, not sure how relevant it is to the issue at hand here. In my case, when trying to use PortableRunner with flink using Apache Flink Operator, the staging volume was not accessible/same for job manager and task managers/workers. For some reason this causes empty files for submission_environment_dependencies.txt (and if you use save_main_session then also for pickled_main_session to appear in /tmp/staged which then results in the failed to retrieve staged files: failed to retrieve /tmp/staged in 3 attempts: failed to retrieve chunk for /tmp/staged/submission_environment_dependencies.txt error to appear when worker process tries to load these.

My issue was solved when I was able to create a working shared staging volume across pods.

Fun side note that might be helpful for someone: When you try to create host mounted PersistantVolume with ReadWriteMany access mode on Googles GKE and use it as a volume then it never actually tells you that you can't do it, but will simply mount random (different) volumes across all pods. Docs mention that it is not supposed to be supported :D. I went with FUSE CSI driver that solved the issue for GKE.

@liferoad
Copy link
Contributor Author

@liferoad
Copy link
Contributor Author

https://github.com/liferoad/beam-ml-flink/blob/main/Makefile#L159 another workaroud for my case

@kennknowles
Copy link
Member

I was just trying to understand this but I am really naive here - is this file not just transmitted over artifact API? we should really never rely on shared storage or sharing with docker containers, etc.

@kennknowles
Copy link
Member

What is "semi persist dir" anyhow?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants