Prefix Caching with HBM and latency test #1278

yuyanpeng-google · 2025-02-17T23:51:08Z

Description

Implements Prefix Caching in HBM and latency test in inference_microbenchmark.

Stores prefix tokens as a trie for fast lookup index of PrefixCache store in cache.

Insert longer Key replace shorter key to be the longest common prefix key.
The shorter key will never be returned even if longer key is erased, and should got evicted in the future.

Assume Key is equal length to tokens, which can be used to slice prompt and cache Value.
Should check the return key common prefix length by the caller.

If erase the Key not the leaf, nothing will happen.
If erased key match at a leaf, delete the node and ancestors would be the leaf after deleted.

Value will be moved to the cache, which means cannot used the same value reference after add_to_cache.

The jax may modified the value even stored in another python reference.
If the value need to be used after add_to_cache, make sure copy them before add_to_cache.

Return value copied from cache to avoid modified value in the cache, always copied the value before return.

Add PrefixCaching benchmark test in inference_microbenchmark.

Using half of the prefill_length as the common prefix and save 100 prefix in the cache.

Loading the cache (including jax.array.copy) appears to be independent of the prefill_length (tested with 128 and 1024), even though the saved cache sizes are different.

Using jax.profiler shows that the copy operation consumes a similar amount of time on TPU. This might be because the sizes aren't large or different enough to see a significant impact.

Part of results below

Prefix Cache benchmark results for prefill length 128:

PrefixCaching results:
	Per prefix size bytes: 0.124 GB
	Average save cache time: 12.142 ms
	Average fetch longest prefix time: 0.029 ms
	Average load cache time: 5.589 ms


Prefix Cache benchmark results for prefill length 1024:

PrefixCaching results:
	Per prefix size bytes: 0.220 GB
	Average save cache time: 12.987 ms
	Average fetch longest prefix time: 0.218 ms
	Average load cache time: 5.143 ms

FIXES: b/389788256
TESTED: unittest

Checklist

Before submitting this PR, please make sure (put X in square brackets):

I have performed a self-review of my code.
I have necessary comments in my code, particularly in hard-to-understand areas.
I have run end-to-end tests tests and provided workload links above if applicable.
I have made or will make corresponding changes to the doc if needed.

TESTED=unittest

Stores prefix tokens as a trie for fast lookup index of PrefixCache store in cache. Insert longer Key replace shorter key to be the longest common prefix key. The shorter key will never be returned even if longer key is erased, and should got evicted in the future. Assume Key is equal length to tokens, which can be used to slice prompt and cache Value. Should check the return key common prefix length by the caller. If erase the Key not the leaf, nothing will happen. If erased key match at a leaf, delete the node and ancestors would be the leaf after deleted. TESTED=unittest

Clone to prevent use the same jax array. TESTED=unittest

Value will be moved to the cache, which means cannot used the same value reference after add_to_cache. The jax may modified the value even stored in another python reference. If the value need to be used after add_to_cache, make sure copy them before add_to_cache. Return value copied from cache to avoid modified value in the cache, always copied the value before return. TESTED=unittest

jax.profiler shows that tree copy and calculate size consume lots of time in clone.

Need more test dimension and review

mailvijayasingh · 2025-02-18T23:38:01Z

MaxText/configs/base.yml

@@ -499,6 +499,7 @@ vertex_tensorboard_region: ""
 max_checkify: False

 # Inference
+inference_microbenchmark_cache_num: 100


is this number of entries in the cache? Can we make it num_entries_in_cache? That way it will not be restricted to microbenchmarking

The number was intended to facilitate inserting various suffixes into the cache easily, but I think that it's not suitable as a hard limitation.

We can optimize memory usage by saving identical prefix values shared across multiple keys. For instance, [1, 2, 3] and [1, 2, 3, 4] can share the same [1, 2, 3] value, so we only need to store one [1, 2, 3, 4] value for both entries. Adding a parameter with a maximum entries limitation could complicate the use case. I believe the optimal cache size should be determined through benchmarking under various scenarios rather than imposing a limit in the API.

mailvijayasingh · 2025-02-18T23:42:26Z

MaxText/inference_microbenchmark.py

+  )
+  prefix_size_bytes_gb = value.prefix_size_bytes / 1024 / 1024 / 1024
+  prefix_cache_inst = prefix_cache.PrefixCache(cache_num * value.prefix_size_bytes)
+  common_len = prefill_length // 2


can you abstract common_len factor to the config?

mailvijayasingh · 2025-02-18T23:44:08Z

MaxText/tests/prefix_cache_test.py

+    assert hbm_cache.retrieve_from_cache((1)) == value1
+    assert hbm_cache.retrieve_from_cache((2)) == value2
+
+  def test_evict_cache(self):


can you elaborate this test, add multiple values and then delete?

vipannalla

Looks good, I left few comments.

vipannalla · 2025-02-19T02:11:27Z

MaxText/prefix_cache.py

+    self._remain_size_bytes = max_size_bytes
+    self._saved_values: dict[Key, Value] = {}
+
+  def is_enough_space_remain(self, value: Value) -> bool:


nit: rename to has_enough_space()

vipannalla · 2025-02-19T02:14:11Z

MaxText/prefix_cache.py

+    To avoid modified value in the cache, always copied the value before return.
+    """
+    if key in self._saved_values:
+      return self._saved_values[key].clone()


That looks inefficient, you are making copy of a large JAX array.

vipannalla · 2025-02-19T02:18:09Z

MaxText/prefix_cache.py

+    """Save key/value to the cache."""
+    logger.debug("save key=%r", key)
+    if not self._hbm_cache.is_enough_space_remain(value):
+      self._evict_cache()


Pass value.prefix_bytes_size into self._evict_cache() to make check you evicted enough bytes. See below..

vipannalla · 2025-02-19T02:21:27Z

MaxText/prefix_cache.py

+    self._trie = PrefixCacheTrie()
+    self._cache_strategy = LRUStrategy()
+
+  def _evict_cache(self) -> Optional[Value]:


You need to evict until you make enough space:

def _evict_cache(self, bytes_needed: int) -> Optional[List[Value]]: evicted_bytes = 0 evicted_values = [] while evicted_bytes < bytes_needed: key = self._cache_strategy.evict() if key is None: logger.debug("no key to evict") return None logger.debug("evict key=%r", key) value = self._hbm_cache.evict_cache(key) if value is None: logger.warning("key=%r should exist in HBM cache.", key) self._trie.erase(key) evicted_values.append(value) return evicted_values

vipannalla · 2025-02-19T02:23:39Z

MaxText/prefix_cache.py

+    # init in clear()
+    self._hbm_cache: HBMCache = None
+    self._trie: PrefixCacheTrie = None
+    self._cache_strategy: LRUStrategy = None


You need a lock since the cache will get updated from multiple threads.

self._lock = threading.RLock()

vipannalla · 2025-02-19T02:24:40Z

MaxText/prefix_cache.py

+
+  def save(self, key: Key, value: Value) -> bool:
+    """Save key/value to the cache."""
+    logger.debug("save key=%r", key)


Grab lock before saving:

with self.lock: if not self._hbm_cache.is_enough_space_remain(value): self._evict_cache() ...

vipannalla · 2025-02-19T02:42:59Z

MaxText/prefix_cache.py

+class LRUStrategy:
+  """Least recently used cache strategy manage key."""
+
+  def __init__(self):
+    self._order: OrderedDict[Key, None] = OrderedDict()
+
+  def evict(self) -> Optional[Key]:
+    """Return and pop the least recently used key."""
+    if len(self._order) == 0:
+      return None
+    return self._order.popitem(last=False)[0]
+
+  def use(self, key: Key) -> None:
+    """Updated the usage history."""
+    if key not in self._order:
+      self._order[key] = None
+    else:
+      self._order.move_to_end(key, last=True)


+1 for abstracting into separate class. We should pass the LRUStrategy into HBMCache and let HBMCache evict the key based on the strategy.

vipannalla · 2025-02-19T02:43:32Z

MaxText/prefix_cache.py

+
+  def load(self, key: Key) -> Optional[Value]:
+    """Returns Value stored with key or None if not found."""
+    logger.debug("load key=%r", key)


Same here, grab the lock...

…ive limit in long context

yuyanpeng-google added 12 commits February 14, 2025 08:58

Add PrefixCache module skeleton and Value

26d6619

TESTED=unittest

Add clone and eq in prefix_cache value

fd3b786

Clone to prevent use the same jax array. TESTED=unittest

Test miss save hit load

6b397e2

Test HBM memory usage

2052183

Implement LRU strategy

b1a2ac0

Implement HBMCache evict_cache

8e62c5c

Use LRU strategy in PrefixCache

6629ebe

Update docs to match Value API

efd760a

Use jit function copy prefix and init with size to accelerate clone

eedc994

jax.profiler shows that tree copy and calculate size consume lots of time in clone.

WIP: PrefixCaching microbenchmark

21c8e7d

Need more test dimension and review

mailvijayasingh reviewed Feb 18, 2025

View reviewed changes

yuyanpeng-google marked this pull request as ready for review February 19, 2025 00:28

yuyanpeng-google requested review from gobbleturk, khatwanimohit, bvandermoon, vipannalla and RissyRan as code owners February 19, 2025 00:28

yuyanpeng-google changed the title ~~[WIP] Prefix Caching~~ Prefix Caching with HBM and latency test Feb 19, 2025

This was referenced Feb 19, 2025

Add prefix_cache module skeleton and Value inside #1265

Closed

Implement PrefixCacheTrie #1266

Closed

Add clone and eq in prefix_cache value #1267

Closed

Implement HBMCache #1268

Closed

vipannalla reviewed Feb 19, 2025

View reviewed changes

yuyanpeng-google added 5 commits February 20, 2025 08:39

Do not use recursive erase in PrefixCacheTrie to prevent stack recurs…

d00a2ec

…ive limit in long context

Rename prefix_cache_entries_num

9aa8cc9

Add docstring of prefix_cache_benchmark

ecec3fc

Adjust time calculate specific to the certain function

62d39aa

Add common_prefix_proportion into config

f360731

yuyanpeng-google added 3 commits February 20, 2025 09:20

Make more describable of evict cache test

d6ba30c

Add more test with evict cache

8092980

Rename has_enough_space

6d4ab9c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prefix Caching with HBM and latency test #1278

Prefix Caching with HBM and latency test #1278

yuyanpeng-google commented Feb 17, 2025 •

edited

Loading

mailvijayasingh Feb 18, 2025

yuyanpeng-google Feb 19, 2025

mailvijayasingh Feb 18, 2025

mailvijayasingh Feb 18, 2025

vipannalla left a comment

vipannalla Feb 19, 2025

vipannalla Feb 19, 2025

vipannalla Feb 19, 2025

vipannalla Feb 19, 2025

vipannalla Feb 19, 2025

vipannalla Feb 19, 2025

vipannalla Feb 19, 2025

vipannalla Feb 19, 2025

Prefix Caching with HBM and latency test #1278

Are you sure you want to change the base?

Prefix Caching with HBM and latency test #1278

Conversation

yuyanpeng-google commented Feb 17, 2025 • edited Loading

Description

Checklist

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vipannalla left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yuyanpeng-google commented Feb 17, 2025 •

edited

Loading