Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stabilize adaptive rate limit by considering current rate #1027

Merged
merged 10 commits into from
Feb 6, 2025

Conversation

ykmr1224
Copy link
Collaborator

@ykmr1224 ykmr1224 commented Feb 1, 2025

Description

  • Based on testing adaptive rate limit implementation with data, I identified following issues:
  • Rate limit was increase even when the current limit is not fully utilized, and this lead to too high rate limit and limit decrease took long time to become effective.
  • As the rate limit is adjusted when a request finished, adjustment frequency depends on the current request rate. (means rate limit is increased slowly when rate limit is low, and quickly when rate limit is high)
  • This PR address the first issue. (Second issue is not critical as far as we don't set minimum rate to lower value. current value is 5000)
  • BulkRequestRateMeter collect data points for current rate using recent 3 seconds window, and calculate the estimated current rate. If the current rate is lower than 80% of rate limit, it won't increase the rate.

Related Issues

Check List

  • Updated documentation (docs/ppl-lang/README.md)
  • Implemented unit tests
  • Implemented tests for combination with other commands
  • New added source code should include a copyright header
  • Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@ykmr1224
Copy link
Collaborator Author

ykmr1224 commented Feb 3, 2025

I'll address integ test failure.

Signed-off-by: Tomoyuki Morita <[email protected]>
Signed-off-by: Tomoyuki Morita <[email protected]>
Signed-off-by: Tomoyuki Morita <[email protected]>
Signed-off-by: Tomoyuki Morita <[email protected]>
@@ -543,6 +543,8 @@ In the index mapping, the `_meta` and `properties`field stores meta and schema i
- `spark.datasource.flint.read.scroll_size`: default value is 100.
- `spark.datasource.flint.read.scroll_duration`: default value is 5 minutes. scroll context keep alive duration.
- `spark.datasource.flint.retry.max_retries`: max retries on failed HTTP request. default value is 3. Use 0 to disable retry.
- `spark.datasource.flint.retry.bulk.max_retries`: max retries on failed bulk request. default value is 10. Use 0 to disable retry.
- `spark.datasource.flint.retry.bulk.initial_backoff`: initial backoff in seconds for bulk request retry, default is 4.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any reason choose 4s as default value?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was fixed value and I made it a configuration. The original intention is having higher initial backoff to quickly reduce the rate.

private Queue<DataPoint> dataPoints = new LinkedList<>();
private long currentSum = 0;

synchronized void addDataPoint(long timestamp, long requestCount) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, could we consider reduce the datapoints maintained by using bucket-based aggregation which have bounde memory usage.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Number of data points tend to be small like around 1 ~ 20 since each batch contain around 5k requests (we put data point for each batch). And memory usage is almost ignorable.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added removeOldDataPoints in addDataPoint.


synchronized void addDataPoint(long timestamp, long requestCount) {
dataPoints.add(new DataPoint(timestamp, requestCount));
currentSum += requestCount;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could removeOldDataPoint during add to reduce memory usage.

Copy link
Collaborator Author

@ykmr1224 ykmr1224 Feb 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above. Memory usage is almost ignorable.

Signed-off-by: Tomoyuki Morita <[email protected]>
Copy link
Collaborator

@dai-chen dai-chen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the changes!

Signed-off-by: Tomoyuki Morita <[email protected]>
Signed-off-by: Tomoyuki Morita <[email protected]>
Copy link
Member

@YANG-DB YANG-DB left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ykmr1224
Are you familiar with resilience4j ?
I think this can be an interesting option for better resilient with managing fault tolerance for remote communications.
We already use it in SQL repo

@ykmr1224
Copy link
Collaborator Author

ykmr1224 commented Feb 6, 2025

@ykmr1224 Are you familiar with resilience4j ? I think this can be an interesting option for better resilient with managing fault tolerance for remote communications. We already use it in SQL repo

I checked resilience4j earlier, do we see any additional benefit than using Failsafe library which we are currently using?

Signed-off-by: Tomoyuki Morita <[email protected]>
@YANG-DB
Copy link
Member

YANG-DB commented Feb 6, 2025

IMO there is a benefit of using the same library in both repos (SQL/spark)...

@ykmr1224 ykmr1224 merged commit 77da4a7 into opensearch-project:main Feb 6, 2025
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants