Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

make postgres a continuously checkpointing datastore #2247

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

vroldanbet
Copy link
Contributor

postgres datastore used pg_current_snapshot(), which does only moves when transactions take place.

By using "pg_current_xact_id()", the statement becomes a transaction, and we can reliably return the timestamp and the XID because we know that transaction ID won't be used by any other tx.

@github-actions github-actions bot added the area/datastore Affects the storage system label Feb 20, 2025
@vroldanbet vroldanbet force-pushed the continuous-checkpointing-pg branch 2 times, most recently from 947e9ef to 7c2054d Compare February 21, 2025 16:16
@github-actions github-actions bot added the area/tooling Affects the dev or user toolchain (e.g. tests, ci, build tools) label Feb 21, 2025
@vroldanbet vroldanbet force-pushed the continuous-checkpointing-pg branch from 66f76f3 to b0d05a2 Compare February 21, 2025 17:17
postgres datastore does not offer a continuous checkpointing
Watch API, because postgres revisions are not time based.
In the other databases, any timestamp represents a fully
fledged revision, while in the postgres datastore a timestamp
does not necessarily represent a transaction.

This is only relevant for consumers of the Watch API, and more
particularly, consumers of the datastore Watch API, which exposes
the concept of Watch Checkpoints. Consumers of the postgres Watch
API don't see observe the passing of time, and this is important
for any system that needs to report lag as a health metric.

This change introduces a new way to report checkpoints in postgres
watch by running an actual transaction. When no new application
revisions are detected, a method will be invoked to create a new
synthetic transaction, which is not written to the transactions table,
it's just returned as the new checkpoint. Because we know that
there were no new revisions at the time of the evaluation of the
query, we can return a new revision. This is not different from
what HeadRevision does, it is in fact the same, but the difference
is that the transaction is actually assigned an ID.

The new logic leverages the pg_current_xact_id() function, which assigns a transaction ID
to the current running transaction, even if not committed. We compute
the transactionID, the snapshot and timestamp, all within the same
 transaction that determines if there are new revisions to make sure
 there isn't a race. The operation will be run only if no new
 application revisions were determined.

The concern of this approach is consumption of the XID8 space. XID8
is a 64 bit ID space, and thus it would take 584.5 years to consume
the space if we were able to generate 1 XID per nanosecond, which
totally unrealistic. The expectation is that the watch API poll interval
happens at a maximum of 1 poll per second.

The other concern is that this makes `HeadRevision` move continuously,
which impacts selection of optimized revision, which keeps moving
continuously and affecting cache hit rates. This will be addressed
by changing how OptimizedRevision is computed, by making sure it
quantizes only application revisions, and not any out-of-band
transaction generated in Postgres. This was in fact identified as
an issue for SpiceDB deployments where some other workloads run on
it, causing the HeadRevision to continuously move.

Finally, this change fixes a latent bug that was identified while
reasoning about all the different pieces in place to support postgres
datastore watch API. In commit
74dc7b2
a change was introduced to address the lack of checkpoints when
the postgres datastore moved its pg_current_snapshot() outside of
application changes (e.g. VACUUM operations, or other local databases
in the same postgres instance). The inconsistency identified is that
pg_current_snapshot() will be emitted as a checkpoint, but it's entirely
possible for a new transaction to come in and be evaluated at that
transaction, because pg_current_snapshot() returns _the snapshot at which
the query is evaluated_, and that does not consume it if not transaction ID
was assigned (e.g. there was no commit involved). As a consequence, it's
possible for observe one checkpoint at revision T, and then a change
that comes after that at the exact same revision T, which violates
the semantics of the checkpoint acting as a high-watermark (client must
not observe pop-in after the checkpoint).

This commit fixes it because the checkpoint being returned is now
part of an actual transaction, so no pop-in is possible. One phenomenon
to bear in mind is that a client might call Watch API with the revision
out of a previous call to HeadRevision. If that revision was never committed, then the creation of the new revision will actually consume it.
This is fine, but we need to make sure the checkpoint is not emitted
via the Watch API, because the expectation from the client is that only
revisios above the provided argument revision will be emitted.
@vroldanbet vroldanbet force-pushed the continuous-checkpointing-pg branch from b0d05a2 to 28ab8fb Compare February 21, 2025 17:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/datastore Affects the storage system area/tooling Affects the dev or user toolchain (e.g. tests, ci, build tools)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant