Stabilize adaptive rate limit by considering current rate #1027

ykmr1224 · 2025-02-01T06:40:15Z

Description

Based on testing adaptive rate limit implementation with data, I identified following issues:
Rate limit was increase even when the current limit is not fully utilized, and this lead to too high rate limit and limit decrease took long time to become effective.
As the rate limit is adjusted when a request finished, adjustment frequency depends on the current request rate. (means rate limit is increased slowly when rate limit is low, and quickly when rate limit is high)
This PR address the first issue. (Second issue is not critical as far as we don't set minimum rate to lower value. current value is 5000)
BulkRequestRateMeter collect data points for current rate using recent 3 seconds window, and calculate the estimated current rate. If the current rate is lower than 80% of rate limit, it won't increase the rate.

Related Issues

Check List

Updated documentation (docs/ppl-lang/README.md)
Implemented unit tests
Implemented tests for combination with other commands
New added source code should include a copyright header
Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: Tomoyuki Morita <[email protected]>

ykmr1224 · 2025-02-03T23:33:10Z

I'll address integ test failure.

Signed-off-by: Tomoyuki Morita <[email protected]>

flint-core/src/main/scala/org/opensearch/flint/core/storage/BulkRequestRateLimiterImpl.java

flint-core/src/main/scala/org/opensearch/flint/core/storage/OpenSearchBulkWrapper.java

flint-core/src/main/scala/org/opensearch/flint/core/storage/RequestRateMeter.java

Signed-off-by: Tomoyuki Morita <[email protected]>

penghuo · 2025-02-05T01:08:10Z

docs/index.md

@@ -543,6 +543,8 @@ In the index mapping, the `_meta` and `properties`field stores meta and schema i
 - `spark.datasource.flint.read.scroll_size`: default value is 100.
 - `spark.datasource.flint.read.scroll_duration`: default value is 5 minutes. scroll context keep alive duration.
 - `spark.datasource.flint.retry.max_retries`: max retries on failed HTTP request. default value is 3. Use 0 to disable retry.
+- `spark.datasource.flint.retry.bulk.max_retries`: max retries on failed bulk request. default value is 10. Use 0 to disable retry.
+- `spark.datasource.flint.retry.bulk.initial_backoff`: initial backoff in seconds for bulk request retry, default is 4.


any reason choose 4s as default value?

It was fixed value and I made it a configuration. The original intention is having higher initial backoff to quickly reduce the rate.

flint-spark-integration/src/main/scala/org/apache/spark/sql/flint/config/FlintSparkConf.scala

penghuo · 2025-02-05T01:30:32Z

flint-core/src/main/scala/org/opensearch/flint/core/storage/RequestRateMeter.java

+  private Queue<DataPoint> dataPoints = new LinkedList<>();
+  private long currentSum = 0;
+
+  synchronized void addDataPoint(long timestamp, long requestCount) {


nit, could we consider reduce the datapoints maintained by using bucket-based aggregation which have bounde memory usage.

Number of data points tend to be small like around 1 ~ 20 since each batch contain around 5k requests (we put data point for each batch). And memory usage is almost ignorable.

Added removeOldDataPoints in addDataPoint.

penghuo · 2025-02-05T01:31:06Z

flint-core/src/main/scala/org/opensearch/flint/core/storage/RequestRateMeter.java

+
+  synchronized void addDataPoint(long timestamp, long requestCount) {
+    dataPoints.add(new DataPoint(timestamp, requestCount));
+    currentSum += requestCount;


we could removeOldDataPoint during add to reduce memory usage.

Same as above. Memory usage is almost ignorable.

Signed-off-by: Tomoyuki Morita <[email protected]>

dai-chen

Thanks for the changes!

Signed-off-by: Tomoyuki Morita <[email protected]>

YANG-DB

@ykmr1224
Are you familiar with resilience4j ?
I think this can be an interesting option for better resilient with managing fault tolerance for remote communications.
We already use it in SQL repo

ykmr1224 · 2025-02-06T19:49:52Z

@ykmr1224 Are you familiar with resilience4j ? I think this can be an interesting option for better resilient with managing fault tolerance for remote communications. We already use it in SQL repo

I checked resilience4j earlier, do we see any additional benefit than using Failsafe library which we are currently using?

Signed-off-by: Tomoyuki Morita <[email protected]>

YANG-DB · 2025-02-06T23:33:36Z

IMO there is a benefit of using the same library in both repos (SQL/spark)...

Stabilize adaptive rate limit by considering current rate

f45324a

Signed-off-by: Tomoyuki Morita <[email protected]>

ykmr1224 marked this pull request as ready for review February 3, 2025 23:31

ykmr1224 requested review from dai-chen, mengweieric, penghuo, seankao-az, anirudha, kaituo, YANG-DB, noCharger and LantaoJin as code owners February 3, 2025 23:31

ykmr1224 added 4 commits February 3, 2025 17:20

Fix bulk retry condition

0016506

Signed-off-by: Tomoyuki Morita <[email protected]>

Fix FlintSparkConfSuite

2809df2

Signed-off-by: Tomoyuki Morita <[email protected]>

Fix FlintSparkConfSuite

615a949

Signed-off-by: Tomoyuki Morita <[email protected]>

Reformat

f3da7af

Signed-off-by: Tomoyuki Morita <[email protected]>

dai-chen reviewed Feb 4, 2025

View reviewed changes

Use Queue interface instead of List

2bf3e74

Signed-off-by: Tomoyuki Morita <[email protected]>

penghuo reviewed Feb 5, 2025

View reviewed changes

Fix doc for BULK_INITIAL_BACKOFF

b111326

Signed-off-by: Tomoyuki Morita <[email protected]>

dai-chen approved these changes Feb 5, 2025

View reviewed changes

ykmr1224 added 2 commits February 6, 2025 08:33

Add retryable http status code

c6199b1

Signed-off-by: Tomoyuki Morita <[email protected]>

Fix test

5105a41

Signed-off-by: Tomoyuki Morita <[email protected]>

YANG-DB reviewed Feb 6, 2025

View reviewed changes

Fix RequestRateMeter.addDataPoint

956a34d

Signed-off-by: Tomoyuki Morita <[email protected]>

penghuo approved these changes Feb 6, 2025

View reviewed changes

ykmr1224 merged commit 77da4a7 into opensearch-project:main Feb 6, 2025
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stabilize adaptive rate limit by considering current rate #1027

Stabilize adaptive rate limit by considering current rate #1027

ykmr1224 commented Feb 1, 2025

ykmr1224 commented Feb 3, 2025

penghuo Feb 5, 2025

ykmr1224 Feb 5, 2025

penghuo Feb 5, 2025

ykmr1224 Feb 5, 2025

ykmr1224 Feb 6, 2025

penghuo Feb 5, 2025

ykmr1224 Feb 5, 2025 •

edited

Loading

dai-chen left a comment

YANG-DB left a comment •

edited

Loading

ykmr1224 commented Feb 6, 2025

YANG-DB commented Feb 6, 2025

Stabilize adaptive rate limit by considering current rate #1027

Stabilize adaptive rate limit by considering current rate #1027

Conversation

ykmr1224 commented Feb 1, 2025

Description

Related Issues

Check List

ykmr1224 commented Feb 3, 2025

penghuo Feb 5, 2025

Choose a reason for hiding this comment

ykmr1224 Feb 5, 2025

Choose a reason for hiding this comment

penghuo Feb 5, 2025

Choose a reason for hiding this comment

ykmr1224 Feb 5, 2025

Choose a reason for hiding this comment

ykmr1224 Feb 6, 2025

Choose a reason for hiding this comment

penghuo Feb 5, 2025

Choose a reason for hiding this comment

ykmr1224 Feb 5, 2025 • edited Loading

Choose a reason for hiding this comment

dai-chen left a comment

Choose a reason for hiding this comment

YANG-DB left a comment • edited Loading

Choose a reason for hiding this comment

ykmr1224 commented Feb 6, 2025

YANG-DB commented Feb 6, 2025

ykmr1224 Feb 5, 2025 •

edited

Loading

YANG-DB left a comment •

edited

Loading