Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Periodic Impulse with Fixed Window Stalls When Used as a Side Input in a Session Window Pipeline #34010

Open
1 of 17 tasks
SanjayPanda opened this issue Feb 18, 2025 · 3 comments

Comments

@SanjayPanda
Copy link

SanjayPanda commented Feb 18, 2025

What happened?

I am using the Apache Beam Python SDK with the Dataflow runner. Our pipeline is designed to periodically fetch lookup data and use it to enrich the main processing pipeline. The main pipeline is configured as a session window, while the lookup side-input is provided using Periodic Impulse with a fixed window of 2 hours.
Currently, every time we run the pipeline, it executes for approximately 45 minutes before stopping at the enrich ParDo step, where the side input is passed using AsMultiMap.

I investigated the issue and found two Stack Overflow discussions that describe similar behaviour:
google cloud dataflow - Apache beam blocked on unbounded side input - Stack Overflow
python - Apache Beam Cloud Dataflow Streaming Stuck Side Input - Stack Overflow

Would appreciate any insights or suggestions on resolving this problem.
Image

Here is the data freshness graph
Image
The Enrich Pardo stopped processing until next pulse triggered.

Questions:

  1. Am I overlooking anything?
  2. Do Session Window and Fixed Window side inputs work together as expected? If not why?
  3. What alternative approaches can I use for this use case?

Appreciate your help and support. Thanks 😊

I attempted to email using the provided email ID ('[email protected]'), but it didn't work, so I decided to create this issue instead.

Issue Priority

Priority: 2 (default / most bugs should be filed as P2)

Issue Components

  • Component: Python SDK
  • Component: Java SDK
  • Component: Go SDK
  • Component: Typescript SDK
  • Component: IO connector
  • Component: Beam YAML
  • Component: Beam examples
  • Component: Beam playground
  • Component: Beam katas
  • Component: Website
  • Component: Infrastructure
  • Component: Spark Runner
  • Component: Flink Runner
  • Component: Samza Runner
  • Component: Twister2 Runner
  • Component: Hazelcast Jet Runner
  • Component: Google Cloud Dataflow Runner
@SanjayPanda SanjayPanda changed the title [Bug]: Periodic Impulse with fixed window getting stuck after sometime when used side input to a session window main pipeline [Bug]: Periodic Impulse with Fixed Window Stalls When Used as a Side Input in a Session Window Pipeline Feb 18, 2025
@liferoad
Copy link
Contributor

Have you tried https://cloud.google.com/dataflow/docs/guides/enrichment? For your case, you might need to use with_redis_cache with TTL: https://beam.apache.org/documentation/transforms/python/elementwise/enrichment/

Also, to check whether Session Window and Fixed Window side inputs work, you probably can test it by removing windowing. I do not see why it shouldn't work.

@SanjayPanda
Copy link
Author

SanjayPanda commented Feb 18, 2025

Hi @liferoad

Thanks for reply :)

I tried processing the data without windowing, but it still gets stalled at Enrich ParDo. Using triggers with a custom window seems to work but consumes too many resources.

Here is screenshots of where it stall. after this step no data is flowing.
Image

This is the data freshness for this step,
Image

Here is the example of one periodic impulse step
Image

I have around few 21 sources/ each under 30MB data fairly static for 24hr. Do you think the enrich way will be better suite or side input?

@liferoad
Copy link
Contributor

https://beam.apache.org/documentation/patterns/side-inputs/ has the Python example to show how to use side input. I suggest you to check it as well. I will recommend Enrichment transform if possible since this is developed by the Beam team and has been tested on Dataflow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants