-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Enhancement] Using lru cache to limit the number of starlet filesystem instance #55845
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could have a different issue after current implementation. e.g. A few of the shards/tablets deleted, the filesystem will never be used again. However if the LRU cache still has plenty of slots, the filesystem instance in the cache will not be evicted at all which in turns causes the related objects are not released unless a service restart.
@kevincai Yes it has the issue. Once a cache slot is used, It will not be released actively, but will be evicted passively. That is a bit like the starcache. |
1ece582
to
419435a
Compare
sounds like a LRU with auto expiration perfectly match the use scenario. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we want to get the LRU with expiration done in this PR or do it later?
419435a
to
fb3ce7d
Compare
think about add unit test to ensure this new behavior via syncpoint or failpoint to mockup underlying new_filesystem(). check the correctness of the caching and lru eviction. |
Head branch was pushed to by a user without write access
fb3ce7d
to
8fe6945
Compare
Added active cache deletion and unit test. |
a844b28
to
fda98ea
Compare
if (!fs_or.ok()) { | ||
return fs_or.status(); | ||
} | ||
return fs_or->second; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the cache key shared_ptr is not held by anyone, so after this function call returned, the shared_ptr cache key will be expired, and the fs instance will be removed from the cache immediately. This is the expected behavior, right? Better add comments here to make this behavior explicit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, if not hit the fs cache, the new fs instance will be removed from the cache immediately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added comments.
be/src/service/staros_worker.cpp
Outdated
absl::StatusOr<std::pair<std::shared_ptr<std::string>, std::shared_ptr<fslib::FileSystem>>> StarOSWorker::find_fs_cache( | ||
const std::string& key) { | ||
if (key.empty()) { | ||
return absl::NotFoundError(key + " not found"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this will yield an " not found"
message, which is odd. make the error message more accurate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
be/src/service/staros_worker.cpp
Outdated
auto value = static_cast<CacheValue*>(_fs_cache->value(handle)); | ||
_fs_cache->release(handle); | ||
|
||
return std::make_pair(value->key.lock(), value->fs); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what if the value->key
is expired? the value->key.lock() will return an empty shared_ptr.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this behavior is ok ? do you mean an error status should be returned in this situation ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it will be ok, but then the caller will get a shared_ptr of the empty string. Shall reset the value->key or reset the value item with a valid item there?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the caller will get an empty key shared_ptr and a valid fs instance in this situation. It means the cache item is about to be removed but it is not yet removed by other thread. If we reset the value->key, It will also be removed by other thread.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so what should do? take it as cache miss and rebuild the fs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the current code is like taking it as cache miss but reuse the fs instance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added some comments to explain this situation.
fda98ea
to
f96e2f7
Compare
…em instance Signed-off-by: xiangguangyxg <[email protected]>
f96e2f7
to
b8c08bf
Compare
[Java-Extensions Incremental Coverage Report]✅ pass : 0 / 0 (0%) |
[FE Incremental Coverage Report]✅ pass : 0 / 0 (0%) |
[BE Incremental Coverage Report]❌ fail : 59 / 76 (77.63%) file detail
|
Why I'm doing:
When there are too many partitions, the starlet filesystem instances also has a lot, this may take too many memory.
What I'm doing:
Using lru cache to limit the number of starlet filesystem instance.
Add be config starlet_filesystem_instance_cache_capacity to config the lru cache capacity, default 10000.
Fixes #55765
What type of PR is this:
Does this PR entail a change in behavior?
If yes, please specify the type of change:
Checklist:
Bugfix cherry-pick branch check: