[Enhancement] Using lru cache to limit the number of starlet filesystem instance #55845

xiangguangyxg · 2025-02-12T10:32:51Z

Why I'm doing:

When there are too many partitions, the starlet filesystem instances also has a lot, this may take too many memory.

What I'm doing:

Using lru cache to limit the number of starlet filesystem instance.
Add be config starlet_filesystem_instance_cache_capacity to config the lru cache capacity, default 10000.

Fixes #55765

What type of PR is this:

Does this PR entail a change in behavior?

Yes, this PR will result in a change in behavior.
No, this PR will not result in a change in behavior.

If yes, please specify the type of change:

Interface/UI changes: syntax, type conversion, expression evaluation, display information
Parameter changes: default values, similar parameters but with different default values
Policy changes: use new policy to replace old one, functionality automatically enabled
Feature removed
Miscellaneous: upgrade & downgrade compatibility, etc.

Checklist:

I have added test cases for my bug fix or my new feature
This pr needs user documentation (for new or modified features or behaviors)
- I have added documentation for my new feature or new function
This is a backport pr

Bugfix cherry-pick branch check:

kevincai

Could have a different issue after current implementation. e.g. A few of the shards/tablets deleted, the filesystem will never be used again. However if the LRU cache still has plenty of slots, the filesystem instance in the cache will not be evicted at all which in turns causes the related objects are not released unless a service restart.

be/src/common/config.h

xiangguangyxg · 2025-02-13T03:26:21Z

Could have a different issue after current implementation. e.g. A few of the shards/tablets deleted, the filesystem will never be used again. However if the LRU cache still has plenty of slots, the filesystem instance in the cache will not be evicted at all which in turns causes the related objects are not released unless a service restart.

@kevincai Yes it has the issue. Once a cache slot is used, It will not be released actively, but will be evicted passively. That is a bit like the starcache.

kevincai · 2025-02-13T04:31:37Z

Could have a different issue after current implementation. e.g. A few of the shards/tablets deleted, the filesystem will never be used again. However if the LRU cache still has plenty of slots, the filesystem instance in the cache will not be evicted at all which in turns causes the related objects are not released unless a service restart.

@kevincai Yes it has the issue. Once a cache slot is used, It will not be released actively, but will be evicted passively. That is a bit like the starcache.

sounds like a LRU with auto expiration perfectly match the use scenario.

kevincai

do we want to get the LRU with expiration done in this PR or do it later?

be/src/http/action/update_config_action.cpp

be/src/service/staros_worker.cpp

kevincai · 2025-02-13T13:40:21Z

think about add unit test to ensure this new behavior via syncpoint or failpoint to mockup underlying new_filesystem(). check the correctness of the caching and lru eviction.

xiangguangyxg · 2025-02-14T09:59:18Z

think about add unit test to ensure this new behavior via syncpoint or failpoint to mockup underlying new_filesystem(). check the correctness of the caching and lru eviction.

Added active cache deletion and unit test.

kevincai · 2025-02-17T15:35:47Z

be/src/service/staros_worker.cpp

+    if (!fs_or.ok()) {
+        return fs_or.status();
+    }
+    return fs_or->second;


the cache key shared_ptr is not held by anyone, so after this function call returned, the shared_ptr cache key will be expired, and the fs instance will be removed from the cache immediately. This is the expected behavior, right? Better add comments here to make this behavior explicit.

Yes, if not hit the fs cache, the new fs instance will be removed from the cache immediately.

added comments.

kevincai · 2025-02-17T15:37:55Z

be/src/service/staros_worker.cpp

+absl::StatusOr<std::pair<std::shared_ptr<std::string>, std::shared_ptr<fslib::FileSystem>>> StarOSWorker::find_fs_cache(
+        const std::string& key) {
+    if (key.empty()) {
+        return absl::NotFoundError(key + " not found");


this will yield an " not found" message, which is odd. make the error message more accurate.

kevincai · 2025-02-17T15:39:00Z

be/src/service/staros_worker.cpp

+    auto value = static_cast<CacheValue*>(_fs_cache->value(handle));
+    _fs_cache->release(handle);
+
+    return std::make_pair(value->key.lock(), value->fs);


what if the value->key is expired? the value->key.lock() will return an empty shared_ptr.

this behavior is ok ? do you mean an error status should be returned in this situation ?

it will be ok, but then the caller will get a shared_ptr of the empty string. Shall reset the value->key or reset the value item with a valid item there?

the caller will get an empty key shared_ptr and a valid fs instance in this situation. It means the cache item is about to be removed but it is not yet removed by other thread. If we reset the value->key, It will also be removed by other thread.

so what should do? take it as cache miss and rebuild the fs?

the current code is like taking it as cache miss but reuse the fs instance.

added some comments to explain this situation.

…em instance Signed-off-by: xiangguangyxg <[email protected]>

github-actions · 2025-02-18T12:25:22Z

[Java-Extensions Incremental Coverage Report]

✅ pass : 0 / 0 (0%)

github-actions · 2025-02-18T12:25:25Z

[FE Incremental Coverage Report]

✅ pass : 0 / 0 (0%)

github-actions · 2025-02-18T12:35:27Z

[BE Incremental Coverage Report]

❌ fail : 59 / 76 (77.63%)

file detail

	path	covered_line	new_line	coverage	not_covered_line_detail
🔵	be/src/http/action/update_config_action.cpp	1	6	16.67%	[373, 374, 375, 376, 378]
🔵	be/src/service/staros_worker.cpp	57	69	82.61%	[87, 88, 89, 144, 240, 272, 273, 274, 278, 386, 387, 418]
🔵	be/src/service/staros_worker.h	1	1	100.00%	[]

xiangguangyxg requested a review from a team as a code owner February 12, 2025 10:32

github-actions bot added 3.4 3.3 3.2 labels Feb 12, 2025

mergify bot assigned xiangguangyxg Feb 12, 2025

kevincai reviewed Feb 13, 2025

View reviewed changes

be/src/common/config.h Outdated Show resolved Hide resolved

xiangguangyxg force-pushed the fs_instance branch from 1ece582 to 419435a Compare February 13, 2025 03:34

xiangguangyxg requested a review from a team as a code owner February 13, 2025 03:34

kevincai reviewed Feb 13, 2025

View reviewed changes

be/src/http/action/update_config_action.cpp Outdated Show resolved Hide resolved

be/src/service/staros_worker.cpp Outdated Show resolved Hide resolved

xiangguangyxg force-pushed the fs_instance branch from 419435a to fb3ce7d Compare February 13, 2025 11:29

kevincai previously approved these changes Feb 13, 2025

View reviewed changes

wyb previously approved these changes Feb 14, 2025

View reviewed changes

wyb enabled auto-merge (squash) February 14, 2025 02:17

auto-merge was automatically disabled February 14, 2025 09:55
Head branch was pushed to by a user without write access

xiangguangyxg dismissed stale reviews from wyb and kevincai via 8fe6945 February 14, 2025 09:55

xiangguangyxg force-pushed the fs_instance branch from fb3ce7d to 8fe6945 Compare February 14, 2025 09:55

xiangguangyxg force-pushed the fs_instance branch 3 times, most recently from a844b28 to fda98ea Compare February 17, 2025 02:02

kevincai reviewed Feb 17, 2025

View reviewed changes

xiangguangyxg force-pushed the fs_instance branch from fda98ea to f96e2f7 Compare February 18, 2025 03:17

[Enhancement] Using lru cache to limit the number of starlet filesyst…

b8c08bf

…em instance Signed-off-by: xiangguangyxg <[email protected]>

xiangguangyxg force-pushed the fs_instance branch from f96e2f7 to b8c08bf Compare February 18, 2025 10:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Enhancement] Using lru cache to limit the number of starlet filesystem instance #55845

[Enhancement] Using lru cache to limit the number of starlet filesystem instance #55845

xiangguangyxg commented Feb 12, 2025 •

edited

Loading

kevincai left a comment

xiangguangyxg commented Feb 13, 2025 •

edited

Loading

kevincai commented Feb 13, 2025

kevincai left a comment

kevincai commented Feb 13, 2025

xiangguangyxg commented Feb 14, 2025

kevincai Feb 17, 2025

xiangguangyxg Feb 18, 2025

xiangguangyxg Feb 18, 2025

kevincai Feb 17, 2025

xiangguangyxg Feb 18, 2025

kevincai Feb 17, 2025

xiangguangyxg Feb 18, 2025

kevincai Feb 18, 2025

xiangguangyxg Feb 18, 2025

kevincai Feb 18, 2025

xiangguangyxg Feb 18, 2025

xiangguangyxg Feb 18, 2025 •

edited

Loading

github-actions bot commented Feb 18, 2025

github-actions bot commented Feb 18, 2025

github-actions bot commented Feb 18, 2025

[Enhancement] Using lru cache to limit the number of starlet filesystem instance #55845

Are you sure you want to change the base?

[Enhancement] Using lru cache to limit the number of starlet filesystem instance #55845

Conversation

xiangguangyxg commented Feb 12, 2025 • edited Loading

Why I'm doing:

What I'm doing:

What type of PR is this:

Checklist:

Bugfix cherry-pick branch check:

kevincai left a comment

Choose a reason for hiding this comment

xiangguangyxg commented Feb 13, 2025 • edited Loading

kevincai commented Feb 13, 2025

kevincai left a comment

Choose a reason for hiding this comment

kevincai commented Feb 13, 2025

xiangguangyxg commented Feb 14, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xiangguangyxg Feb 18, 2025 • edited Loading

Choose a reason for hiding this comment

github-actions bot commented Feb 18, 2025

[Java-Extensions Incremental Coverage Report]

github-actions bot commented Feb 18, 2025

[FE Incremental Coverage Report]

github-actions bot commented Feb 18, 2025

[BE Incremental Coverage Report]

file detail

xiangguangyxg commented Feb 12, 2025 •

edited

Loading

xiangguangyxg commented Feb 13, 2025 •

edited

Loading

xiangguangyxg Feb 18, 2025 •

edited

Loading