Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consumers in a consumer group stuck for 4 days after ErrOffsetOutOfRange error #2855

Open
shubham-dogra-s1 opened this issue Apr 6, 2024 · 7 comments
Assignees
Labels
needs-investigation Issues that require followup from maintainers

Comments

@shubham-dogra-s1
Copy link

shubham-dogra-s1 commented Apr 6, 2024

Description

We recently noticed in our staging and prod environment that consumer groups got stuck for more than 4 days and not consuming messages from the partition. After restarting the pods it started working again.

Related Issue: #2682

Versions
Sarama Kafka Go
1.42.1 3.4.1 1.21
Configuration
cfg.Consumer.Group.Rebalance.GroupStrategies = []sarama.BalanceStrategy{sarama.NewBalanceStrategyRoundRobin()}
cfg.Consumer.Offsets.Initial = sarama.OffsetNewest
cfg.Consumer.Group.Session.Timeout = time.Second * time.Duration(SESSION_TIMEOUT)
cfg.Consumer.Group.Heartbeat.Interval = time.Second * time.Duration(CONSUMER_HEARTBEAT)
cfg.Consumer.Return.Errors = true
cfg.Consumer.Fetch.Min = 100 * 1024          // 100 KB
cfg.Consumer.Fetch.Default = 2 * 1024 * 1024 // 2 MB
Logs
[ERRR] 2024/04/04 17:45:01 [Sarama Consumer Error]: kafka: error while consuming results.default/5: 
kafka server: The requested offset is outside the range of offsets maintained by the server for the given topic/partition
[ERRR] 2024/04/04 17:44:47 [Sarama Consumer Error]: kafka: error while consuming results.default/1: kafka server: The requested offset is outside the range of offsets maintained by the server for the given topic/partition
[ERRR] 2024/04/04 17:44:38 [Sarama Consumer Error]: kafka: error while consuming results.default/13: kafka server: The requested offset is outside the range of offsets maintained by the server for the given topic/partition
[ERRR] 2024/04/04 17:44:37 [Sarama Consumer Error]: kafka: error while consuming results.default/11: kafka server: The requested offset is outside the range of offsets maintained by the server for the given topic/partition
[ERRR] 2024/04/04 17:43:56 [Sarama Consumer Error]: kafka: error while consuming results.default/6: kafka server: The requested offset is outside the range of offsets maintained by the server for the given topic/partition
[ERRR] 2024/04/04 17:43:28 [Sarama Consumer Error]: kafka: error while consuming results.default/3: kafka server: The requested offset is outside the range of offsets maintained by the server for the given topic/partition
[ERRR] 2024/04/04 17:39:16 [Sarama Consumer Error]: kafka: error while consuming results.default/0: kafka server: The requested offset is outside the range of offsets maintained by the server for the given topic/partition
[ERRR] 2024/04/04 17:39:13 [Sarama Consumer Error]: kafka: error while consuming results.default/14: kafka server: The requested offset is outside the range of offsets maintained by the server for the given topic/partition
[ERRR] 2024/04/04 17:39:03 [Sarama Consumer Error]: kafka: error while consuming results.default/4: kafka server: The requested offset is outside the range of offsets maintained by the server for the given topic/partition
[ERRR] 2024/04/04 17:38:51 [Sarama Consumer Error]: kafka: error while consuming results.default/12: kafka server: The requested offset is outside the range of offsets maintained by the server for the given topic/partition
[ERRR] 2024/04/04 17:38:42 [Sarama Consumer Error]: kafka: error while consuming results.default/10: kafka server: The requested offset is outside the range of offsets maintained by the server for the given topic/partition
[ERRR] 2024/04/04 17:38:34 [Sarama Consumer Error]: kafka: error while consuming results.default/2: kafka server: The requested offset is outside the range of offsets maintained by the server for the given topic/partition
[ERRR] 2024/04/04 16:45:02 [Sarama Consumer Error]: kafka: error while consuming results.default/8: kafka server: The requested offset is outside the range of offsets maintained by the server for the given topic/partition
[ERRR] 2024/04/04 16:44:53 [Sarama Consumer Error]: kafka: error while consuming results.default/15: kafka server: The requested offset is outside the range of offsets maintained by the server for the given topic/partition
[ERRR] 2024/04/04 16:44:07 [Sarama Consumer Error]: kafka: error while consuming results.default/9: kafka server: The requested offset is outside the range of offsets maintained by the server for the given topic/partition
[ERRR] 2024/04/04 16:44:02 [Sarama Consumer Error]: kafka: error while consuming results.default/7: kafka server: The requested offset is outside the range of offsets maintained by the server for the given topic/partition

Additional Context

Faced the same behaviour with this error as well Request exceeded the user-specified time limit in the request

@shubham-dogra-s1
Copy link
Author

@dnwe can you help with this issue? Again faced the same problem, consumers getting stuck for longer period of time.

@dnwe dnwe self-assigned this Apr 11, 2024
@dnwe dnwe added the needs-investigation Issues that require followup from maintainers label Apr 11, 2024
@dnwe
Copy link
Collaborator

dnwe commented Apr 11, 2024

@shubham-dogra-s1 👋🏻 thanks for getting in touch

The first thing to double check would be your consumer group lag vs the topic retention. If the group is too far behind in committed offset to keep up with the retention, then it is possible the log has been truncated and your client is trying to consume from an older offset that no longer exists

@shubham-dogra-s1
Copy link
Author

shubham-dogra-s1 commented Apr 11, 2024

@dnwe yes that is possible.

I can see that in the lib code we are resetting the offset if we got ErrOffsetOutOfRange error

if errors.Is(err, ErrOffsetOutOfRange) && sess.parent.config.Consumer.Group.ResetInvalidOffsets {

Even though it is handled but still resulting in infinite loop somehow

But we recently faced the same issue with another error Request exceeded the user-specified time limit in the request.I guess same thing is happening here as well. Attached logs below error while consuming results.priority/53: read tcp i/o timeout. Consumer is trying to consume an offset (53) and request timed out (possibly due to offset no longer available)

On restarting the pods, consumers starting working again.

@shubham-dogra-s1
Copy link
Author

shubham-dogra-s1 commented Apr 11, 2024

Attaching some more logs regarding Request exceeded the user-specified time limit in the request for easier debugging

Logs from client

[ERRR][Sarama Consumer Error]: kafka: error while consuming results.priority/53: read tcp  i/o timeout
[ERRR][Sarama Consumer Error]: kafka: error while consuming results.priority/0: read tcp i/o timeout
[ERRR][Sarama Consumer Error]: kafka: error while consuming results.on_demand/49: read tcp  i/o timeout
[ERRR][Sarama Consumer Error]: kafka: error while consuming results.on_demand/41: read tcp i/o timeout
[ERRR][Sarama Consumer Error]: kafka: error while consuming results.on_demand/0: read tcp  i/o timeout

Kafka Exporter Logs

E0411 07:12:22.868864       1 kafka_exporter.go:598] Cannot get offset of group results.on_demand tcp i/o timeout

Copy link

Thank you for taking the time to raise this issue. However, it has not had any activity on it in the past 90 days and will be closed in 30 days if no updates occur.
Please check if the main branch has already resolved the issue since it was raised. If you believe the issue is still valid and you would like input from the maintainers then please comment to ask for it to be reviewed.

@github-actions github-actions bot added the stale Issues and pull requests without any recent activity label Jul 10, 2024
@dnwe dnwe removed the stale Issues and pull requests without any recent activity label Jul 10, 2024
Copy link

github-actions bot commented Oct 8, 2024

Thank you for taking the time to raise this issue. However, it has not had any activity on it in the past 90 days and will be closed in 30 days if no updates occur.
Please check if the main branch has already resolved the issue since it was raised. If you believe the issue is still valid and you would like input from the maintainers then please comment to ask for it to be reviewed.

@github-actions github-actions bot added the stale Issues and pull requests without any recent activity label Oct 8, 2024
@dharmjit
Copy link

Hi @shubham-dogra-s1, I faced a similar issue(ErrOffsetOutOfRange) in my app as well. This for us is related to kafka purging the topic partitions based on retention policy/params as @dnwe suggested. Increasing the retention size for partition makes the error go away for us.
Also make sure that the consume loop runs indefinitely and is able to recover from such errors.

	for {
		ll.Info("consume loop started")

		// This will block until a rebalance or an error occurs
		if err := cg.group.Consume(cg.ctx, topics, cg.handler); err != nil {
			ll.Err(err).Error("consumer error")
		}

		// Check if the context is done before starting a new consume cycle
		if cg.ctx.Err() != nil {
			// Context is canceled, exit the loop
			ll.Info("context canceled, stopping consume loop")
			return
		}
		ll.Info("restarting consumer session")

		// Reset the handler.ready channel when existing session ends
		cg.handler.ready = make(chan bool, 1)
	}

@github-actions github-actions bot removed the stale Issues and pull requests without any recent activity label Oct 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs-investigation Issues that require followup from maintainers
Projects
None yet
Development

No branches or pull requests

3 participants