[Feature Request] [Tiered Caching 3.0] Additional IRC stats #17366

peteralfonsi · 2025-02-14T21:08:21Z

Is your feature request related to a problem? Please describe

As part of tiered caching's 3.0 changes (#17303) we want to expose some new stats for the IndicesRequestCache.

Most importantly, we should report the fraction of requests that are or aren't cacheable as decided in IndicesService.canCache(). This can help users decide whether to enable tiered caching (if the cacheable fraction is low, it may not help much). It can also help them decide whether to change indices.requests.cache.maximum_cacheable_size to allow some size > 0 queries into the cache, and it can help verify that more queries are entering the cache after making that change.

We also might expose some caching latency stat. This could expose issues, for example if the disk cache is performing poorly.

Tracking these is simple but exposing them in an API has some complications. Currently (for the IRC) there are 2 relevant APIs: _nodes/stats/indices/request_cache and nodes/stats/caches/request_cache.

The first API is older and is tracked by ShardRequestCache, which has callbacks that the IndicesRequestCache class directly calls. Responses can be aggregated by index or shard. We plan to remove this code in 3.0, and ideally, hitting this API would just return the exact same content as the second API. So, we shouldn't add new functionality here.

The second API is newer and the stats are tracked by the ICache implementation itself. CacheService holds a map from cache type (for example REQUEST_CACHE) to ICache instance, and when the API is hit, the instance provides its stats. This allows more info, for example, aggregating by tier in the TieredSpilloverCache. In the future, other caches which use the ICache framework will get a _nodes/stats/caches/... API as well. Responses can be aggregated by index or shard (for the IRC), and ICache implementations can allow other dimensions like tier.

However, the cacheable-fraction stat is specific to the request cache, because it relies on IndicesService's logic. It's not like hits, misses, etc which make sense for every cache and are already tracked by the ICache implementation. So the ICache implementation can't be the one responsible for tracking it. It also doesn't really make sense to aggregate it by the levels we can use to aggregate the rest of the cache stats response. For example it makes no sense to aggregate this value by cache tier. It would make sense to aggregate by index or shard, but I'm not sure it would be useful and it would make tracking more complex.

Caching latency on the other hand might make sense to be tracked by every ICache implementation, in the way hits, misses, etc, are right now.

Describe the solution you'd like

I'm not totally happy with any solution I've come up with so please leave feedback if you have ideas.

The best solution I've come up with is to allow some non-aggregateable fields in the newer cache stats API. These would be specific to each cache type (REQUEST_CACHE, QUERY_CACHE, etc) and would not be tracked by the ICache implementation. They would be tracked at the CacheService level. IndicesService already requires a CacheService to be passed in through the constructor so it can instantiate its IRC in the first place, and the same will be true for any other services that need to instantiate an ICache, so communicating the stats to the CacheService shouldn't be difficult.

For the request cache these extra fields would be something like "total_requests" and "cacheable_requests". We'd expect "cacheable_requests" to equal hits + misses so it may not be necessary. So the response from _nodes/stats/caches/request_cache might look like:

"request_cache" : {
  "size_in_bytes" : 5000000,
  "evictions" : 500,
  "hit_count" : 700,
  "miss_count" : 800,
  "item_count" : 400,
  "total_requests": 4300,          // new field
  "cacheable_requests": 1500,      // new field
  "store_name" : "tiered_spillover"
}

And if you aggregated by some value, like _nodes/stats/caches/request_cache?level=tier, that aggregation would show up underneath the total values like it does now, but the new extra fields wouldn't appear in the aggregation:

"request_cache" : {
  "size_in_bytes" : 5000000,
  "evictions" : 500,
  "hit_count" : 700,
  "miss_count" : 800,
  "item_count" : 400,
  "total_requests": 4300,          // new field
  "cacheable_requests": 1500,      // new field
  "tier" : {
      "disk" : {
        "size_in_bytes" : 3000000,
        "evictions" : 500,
        "hit_count" : 400,
        "miss_count" : 800,
        "item_count" : 200
      },
      "on_heap" : {
        "size_in_bytes" : 2000000,
        "evictions" : 1000,
        "hit_count" : 300,
        "miss_count" : 1200,
        "item_count" : 200
      }
  },
  "store_name" : "tiered_spillover"
}

With this setup it's simple to add new non-generic stats for other future caches. For example in the query cache if we decided to track something around the frequency-based entry policy, which similarly doesn't make sense to aggregate by shard/index/tier, that could just show up in the response body like it does for the request cache:

"query_cache" : {
  "size_in_bytes" : 4000000,
  "evictions" : 800,
  "hit_count" : 100,
  "miss_count" : 2900,
  "item_count" : 300,
  "policy_frequency_threshold": 5,          // new field
  "store_name" : "tiered_spillover"
}

If we decided caching latency made sense for all ICaches to track we could just add some cumulative total_caching_latency into the StatsHolder objects, and it'd show up in both the total and aggregated sections of the response.

Related component

Search:Performance

Describe alternatives you've considered

Have ShardRequestCache track it directly. This allows aggregation by shard/index along with the rest of the old cache stats API. But aggregation by these levels probably isn't useful, and we're getting rid of this code very soon anyway. It's also not easily extendable to new caches/stats and couldn't easily appear in the new cache stats API.
Use some notification-based method rather than tracking these stats in memory and exposing them through an API. I thought I'd heard the otel plugin could be used like this for stats, but from what I can tell it's just for more detailed request traces? Maybe I'm missing something here.
Have IndicesService track the cacheability stats itself, and break it down by index/shard so it can be aggregated according to whatever levels the overall API call requests. Then provide it to CacheService somehow on API call. I think this would be way overcomplicating it for little gain, and it wouldn't work for all combinations of levels (what happens when a user aggregates by shard+tier? Do we make two separate response sections, one for stats that can be aggregated by tier, and one for the cacheability stats that can't?) This is also not easily extensible.

Additional context

No response

The text was updated successfully, but these errors were encountered:

peteralfonsi added enhancement Enhancement or improvement to existing feature or request untriaged labels Feb 14, 2025

peteralfonsi self-assigned this Feb 14, 2025

github-actions bot added the Search:Performance label Feb 14, 2025

github-project-automation bot added this to Search Project Board Feb 14, 2025

github-project-automation bot moved this to 🆕 New in Search Project Board Feb 14, 2025

peteralfonsi mentioned this issue Feb 12, 2025

[META] Tiered Caching 3.0 changes #17303

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Request] [Tiered Caching 3.0] Additional IRC stats #17366

[Feature Request] [Tiered Caching 3.0] Additional IRC stats #17366

peteralfonsi commented Feb 14, 2025

[Feature Request] [Tiered Caching 3.0] Additional IRC stats #17366

[Feature Request] [Tiered Caching 3.0] Additional IRC stats #17366

Comments

peteralfonsi commented Feb 14, 2025

Is your feature request related to a problem? Please describe

Describe the solution you'd like

Related component

Describe alternatives you've considered

Additional context