make postgres a continuously checkpointing datastore #2247

vroldanbet · 2025-02-20T16:13:21Z

postgres datastore used pg_current_snapshot(), which does only moves when transactions take place.

By using "pg_current_xact_id()", the statement becomes a transaction, and we can reliably return the timestamp and the XID because we know that transaction ID won't be used by any other tx.

postgres datastore does not offer a continuous checkpointing Watch API, because postgres revisions are not time based. In the other databases, any timestamp represents a fully fledged revision, while in the postgres datastore a timestamp does not necessarily represent a transaction. This is only relevant for consumers of the Watch API, and more particularly, consumers of the datastore Watch API, which exposes the concept of Watch Checkpoints. Consumers of the postgres Watch API don't see observe the passing of time, and this is important for any system that needs to report lag as a health metric. This change introduces a new way to report checkpoints in postgres watch by running an actual transaction. When no new application revisions are detected, a method will be invoked to create a new synthetic transaction, which is not written to the transactions table, it's just returned as the new checkpoint. Because we know that there were no new revisions at the time of the evaluation of the query, we can return a new revision. This is not different from what HeadRevision does, it is in fact the same, but the difference is that the transaction is actually assigned an ID. The new logic leverages the pg_current_xact_id() function, which assigns a transaction ID to the current running transaction, even if not committed. We compute the transactionID, the snapshot and timestamp, all within the same transaction that determines if there are new revisions to make sure there isn't a race. The operation will be run only if no new application revisions were determined. The concern of this approach is consumption of the XID8 space. XID8 is a 64 bit ID space, and thus it would take 584.5 years to consume the space if we were able to generate 1 XID per nanosecond, which totally unrealistic. The expectation is that the watch API poll interval happens at a maximum of 1 poll per second. The other concern is that this makes `HeadRevision` move continuously, which impacts selection of optimized revision, which keeps moving continuously and affecting cache hit rates. This will be addressed by changing how OptimizedRevision is computed, by making sure it quantizes only application revisions, and not any out-of-band transaction generated in Postgres. This was in fact identified as an issue for SpiceDB deployments where some other workloads run on it, causing the HeadRevision to continuously move. Finally, this change fixes a latent bug that was identified while reasoning about all the different pieces in place to support postgres datastore watch API. In commit 74dc7b2 a change was introduced to address the lack of checkpoints when the postgres datastore moved its pg_current_snapshot() outside of application changes (e.g. VACUUM operations, or other local databases in the same postgres instance). The inconsistency identified is that pg_current_snapshot() will be emitted as a checkpoint, but it's entirely possible for a new transaction to come in and be evaluated at that transaction, because pg_current_snapshot() returns _the snapshot at which the query is evaluated_, and that does not consume it if not transaction ID was assigned (e.g. there was no commit involved). As a consequence, it's possible for observe one checkpoint at revision T, and then a change that comes after that at the exact same revision T, which violates the semantics of the checkpoint acting as a high-watermark (client must not observe pop-in after the checkpoint). This commit fixes it because the checkpoint being returned is now part of an actual transaction, so no pop-in is possible. One phenomenon to bear in mind is that a client might call Watch API with the revision out of a previous call to HeadRevision. If that revision was never committed, then the creation of the new revision will actually consume it. This is fine, but we need to make sure the checkpoint is not emitted via the Watch API, because the expectation from the client is that only revisios above the provided argument revision will be emitted.

github-actions bot added the area/datastore Affects the storage system label Feb 20, 2025

vroldanbet force-pushed the continuous-checkpointing-pg branch 2 times, most recently from 947e9ef to 7c2054d Compare February 21, 2025 16:16

github-actions bot added the area/tooling Affects the dev or user toolchain (e.g. tests, ci, build tools) label Feb 21, 2025

vroldanbet force-pushed the continuous-checkpointing-pg branch from 66f76f3 to b0d05a2 Compare February 21, 2025 17:17

vroldanbet added 2 commits February 21, 2025 17:47

expand pgSnapshot tests

28ab8fb

vroldanbet force-pushed the continuous-checkpointing-pg branch from b0d05a2 to 28ab8fb Compare February 21, 2025 17:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

make postgres a continuously checkpointing datastore #2247

make postgres a continuously checkpointing datastore #2247

vroldanbet commented Feb 20, 2025

make postgres a continuously checkpointing datastore #2247

Are you sure you want to change the base?

make postgres a continuously checkpointing datastore #2247

Conversation

vroldanbet commented Feb 20, 2025