-
Notifications
You must be signed in to change notification settings - Fork 465
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
docs: Add ingest data landing page #31468
base: main
Are you sure you want to change the base?
Conversation
depending on the volume of the upstream changes, Materialize may lag behind the | ||
upstream system. If the lag is significant, queries may block until Materialize | ||
has caught up sufficiently with the upstream system when using the default | ||
[isolation level](/get-started/isolation-level/) of [strict |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI -- I'm going to bubble up the isolation + arrangements pages (i.e., make an architecture section in the docs) sometime next week.
@@ -0,0 +1,244 @@ | |||
--- | |||
title: "Ingesting data" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
identifier: ingest-monitoring | ||
parent: ingest-data | ||
weight: 39 | ||
--- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only one grammar change, otherwise just some take it or leave it comments 😄
cluster. | ||
|
||
- Consider using a larger cluster size during snapshotting. Once the | ||
snapshottingis complete, you can downsize the cluster to align with the volume |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needs a space between snapshottingis
|
||
{{% /tip %}} | ||
|
||
If you create your source from the Materialize Console, the overview page for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we link out to docs about monitoring in the Console? Like here? https://materialize.com/docs/console/data/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah ignore, I just saw the next link :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you create your source from the Materialize Console, the overview page for the source displays the snapshotting progress.
For shared context, this isn't correct. The Console will always display snapshotting progress, regardless of what interface you used to create the source.
|
||
In the Materialize Console, you can see a source's data freshness from the | ||
**Data Explorer** screen. Alternatively, you can run a query to monitor the lag. | ||
See [Monitoring the snapshotting progress](/ingest-data/monitoring-data-ingestion). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we instead link to the steady state portion of this page? https://preview.materialize.com/materialize/31468/ingest-data/monitoring-data-ingestion/#monitoring-rehydrationdata-freshness-status
**Data Explorer** screen. Alternatively, you can run a query to monitor the lag. | ||
See [Monitoring the snapshotting progress](/ingest-data/monitoring-data-ingestion). | ||
|
||
## Rehydration |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We no longer use the term "rehydration", only "hydration".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As for hydration vs rehydration ... I did ask about that since you had mentioned it to me in an earlier pr. There seemed to be some ambiguity as to when we use the terms. There actually was a draft where I went with Hydration/rehydration and even (re)Hydration 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could go with Hydration and in that first sentence, add "(sometimes referred to as rehydration)."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(sometimes referred to as rehydration).
We should have a single term for user-facing documentation (or anything), and we've previously agreed that it should be simply hydration. Sources are either snapshotting (from the external system) or hydrating (from storage), compute objects just hydrate. No need to differentiate and potentially confuse users, since this is just our internal lingo coming through; you can see that hydration is consistently used in the system catalog, too.
Thanks for putting this up, @kay-kim! This PR reads more like reference documentation than guidance, and is missing some important bits we laid out in the previous draft — in particular, recommendations around reducing the amount of data that is ingested. That should be the main recommendation and most immediate takeaway of this page. It's also not clear who this page is geared towards: I'd expect it to be an intro for new users to increase the chance of a successful onboarding, rather than a path to production (e.g. how you spread objects across clusters is less important when you start, and we typically wouldn't want new users to think about clusters until they have to). I'll take some time to work some of the suggestions above in this week. |
Hey @morsapaes -- thanks much!
|
operation](#snapshotting) to initially populate the source in Materialize. | ||
Snapshotting is a resource-intensive operation that can require a significant | ||
amount of CPU and memory. Consider using a larger cluster size during | ||
snapshotting. Once the snapshotting operation is complete, you can downsize |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For future verifications w.r.t. increasing size:
- It may not be the case for certain sources.
- Maybe after a certain size, it might not buy as much improvement.
- For upsert sources, emphasize this a bit more + dedicated cluster for upsert sources (as in separate these sources from other sources)
|
||
- If possible, schedule creating new sources during off-peak hours to mitigate | ||
the impact of snapshotting on both the upstream system and the Materialize | ||
cluster. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to verify about scheduling (namely, if there are exceptions that can be made (if necessary) to the rule depending on the source type)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a weird one, and again only applies to production environments. Even then, it should be safe to e.g. add a new subsource to a source during peak hours.
|
||
#### Dedicate a cluster for the sources | ||
|
||
If possible, dedicate a cluster just for sources. That is, avoid using the same |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
couple of things for future:
- even within the umbrella of dedicated source clusters, further division with dedicated upsert source clusters.
If using a single source for all three ...
- OOM'ing for compute, etc. may impact upstream.
- won't be able to do blue/green
- So, you have a source cluster and it's sized for current steady state. What is the happy route for adding a new source to that cluster.
@@ -0,0 +1,244 @@ | |||
--- | |||
title: "Ingesting data" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
title: "Ingesting data" | |
title: "Ingest data" |
@@ -0,0 +1,244 @@ | |||
--- | |||
title: "Ingesting data" | |||
description: "How to ingest data into Materialize from external systems." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
description: "How to ingest data into Materialize from external systems." | |
description: "Best practices for ingesting data into Materialize from external systems." |
Materialize can ingest data from various external systems: | ||
|
||
{{< multilinkbox >}} | ||
{{< linkbox title="Databases (CDC)" >}} | ||
- [PostgreSQL](/ingest-data/postgres/) | ||
- [MySQL](/ingest-data/mysql/) | ||
- [SQL Server](/ingest-data/cdc-sql-server/) | ||
- [CockroachDB](/ingest-data/cdc-cockroachdb/) | ||
{{</ linkbox >}} | ||
{{< linkbox title="Message Brokers" >}} | ||
- [Kafka](/ingest-data/kafka/) | ||
- [Redpanda](/sql/create-source/kafka) | ||
- [Other message brokers](/integrations/#message-brokers) | ||
{{</ linkbox >}} | ||
{{< linkbox title="Webhooks" >}} | ||
- [Amazon EventBridge](/ingest-data/webhooks/amazon-eventbridge/) | ||
- [Segment](/ingest-data/webhooks/segment/) | ||
- [Other webhooks](/sql/create-source/webhook) | ||
{{</ linkbox >}} | ||
{{</ multilinkbox >}} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Materialize can ingest data from various external systems: | |
{{< multilinkbox >}} | |
{{< linkbox title="Databases (CDC)" >}} | |
- [PostgreSQL](/ingest-data/postgres/) | |
- [MySQL](/ingest-data/mysql/) | |
- [SQL Server](/ingest-data/cdc-sql-server/) | |
- [CockroachDB](/ingest-data/cdc-cockroachdb/) | |
{{</ linkbox >}} | |
{{< linkbox title="Message Brokers" >}} | |
- [Kafka](/ingest-data/kafka/) | |
- [Redpanda](/sql/create-source/kafka) | |
- [Other message brokers](/integrations/#message-brokers) | |
{{</ linkbox >}} | |
{{< linkbox title="Webhooks" >}} | |
- [Amazon EventBridge](/ingest-data/webhooks/amazon-eventbridge/) | |
- [Segment](/ingest-data/webhooks/segment/) | |
- [Other webhooks](/sql/create-source/webhook) | |
{{</ linkbox >}} | |
{{</ multilinkbox >}} | |
You can ingest data into Materialize from various external systems: | |
{{< multilinkbox >}} | |
{{< linkbox title="Databases (CDC)" >}} | |
- [PostgreSQL](/ingest-data/postgres/) | |
- [MySQL](/ingest-data/mysql/) | |
- [SQL Server](/ingest-data/cdc-sql-server/) | |
- [MongoDB](https://github.com/MaterializeIncLabs/materialize-mongodb-debezium) | |
- [CockroachDB](/ingest-data/cdc-cockroachdb/) | |
- [Other databases](/integrations/#databases) | |
{{</ linkbox >}} | |
{{< linkbox title="Message Brokers" >}} | |
- [Kafka](/ingest-data/kafka/) | |
- [Redpanda](/sql/create-source/kafka) | |
- [Other message brokers](/integrations/#message-brokers) | |
{{</ linkbox >}} | |
{{< linkbox title="Webhooks" >}} | |
- [Amazon EventBridge](/ingest-data/webhooks/amazon-eventbridge/) | |
- [Segment](/ingest-data/webhooks/segment/) | |
- [Other webhooks](/sql/create-source/webhook) | |
{{</ linkbox >}} | |
{{</ multilinkbox >}} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a follow-up: MongoDB and Azure EventHubs should also have integration guides, since they come up somewhat frequently. I'll update the state of Azure EventHubs in the integrations overview, which isn't accurate.
Snapshotting refers to the initial population of a source in Materialize. When | ||
you create a new source, Materialize takes a snapshot of the data from the | ||
upstream (i.e., external) system at a given offset and loads it into | ||
Materialize. | ||
|
||
For the offset, Materialize chooses the latest available offset. For the | ||
available offset, Materialize uses: | ||
|
||
- Log Sequence Number (LSN) for PostgreSQL sources. | ||
|
||
- The number of transactions committed across all servers in the cluster for | ||
MySQL sources. | ||
|
||
- Kafka message offset for Kafka sources. | ||
|
||
Materialize captures/commits a snapshot of all historical data up to that point. | ||
Snapshot is persisted in the storage layer, and all records in this initial load | ||
have the same ingestion timestamp. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Snapshotting refers to the initial population of a source in Materialize. When | |
you create a new source, Materialize takes a snapshot of the data from the | |
upstream (i.e., external) system at a given offset and loads it into | |
Materialize. | |
For the offset, Materialize chooses the latest available offset. For the | |
available offset, Materialize uses: | |
- Log Sequence Number (LSN) for PostgreSQL sources. | |
- The number of transactions committed across all servers in the cluster for | |
MySQL sources. | |
- Kafka message offset for Kafka sources. | |
Materialize captures/commits a snapshot of all historical data up to that point. | |
Snapshot is persisted in the storage layer, and all records in this initial load | |
have the same ingestion timestamp. | |
When a new source is created, Materialize performs a sync of all data available | |
in the external system before it starts ingesting new data — an operation known | |
as _snapshotting_. Because the initial snapshot is persisted in the storage | |
layer atomically (i.e., at the same ingestion timestamp), you are **not able to | |
query the source until snapshotting is complete**. | |
Depending on the volume of data in the initial snapshot and the size of the | |
cluster the source is hosted in, this operation can take anywhere from a few | |
minutes up to several hours, and might require more compute resources than | |
steady-state. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think the unit of progress per source is relevant here (this is documented in the reference documentation for each source type). Also, I'd keep the original blurb from the previous PR, which gives a more comprehensive overview of the snapshotting operation and gives a TL;DR of the sections below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For completeness, in MySQL the unit of progress is not the number of transactions, but the transaction ID + the lower and upper range of GTIDs.
The snapshotting operation duration depends on the snapshot dataset size and the | ||
size of the Materialize cluster that is hosting the source. For very large | ||
sources that contain hundreds of GB of data, it can take up to an hour or even | ||
several hours to complete. | ||
|
||
{{% tip %}} | ||
|
||
- If possible, schedule creating new sources during off-peak hours to mitigate | ||
the impact of snapshotting on both the upstream system and the Materialize | ||
cluster. | ||
|
||
- Consider using a larger cluster size during snapshotting. Once the | ||
snapshottingis complete, you can downsize the cluster to align with the volume | ||
of changes being replicated from your upstream in steady-state. | ||
|
||
- See also [Best practices](#best-practices). | ||
|
||
{{% /tip %}} | ||
|
||
If you create your source from the Materialize Console, the overview page for | ||
the source displays the snapshotting progress. Alternatively, you can run a | ||
query to monitor its progress. See [Monitoring the snapshotting progress](/ingest-data/monitoring-data-ingestion/#monitoring-the-snapshotting-progress). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The snapshotting operation duration depends on the snapshot dataset size and the | |
size of the Materialize cluster that is hosting the source. For very large | |
sources that contain hundreds of GB of data, it can take up to an hour or even | |
several hours to complete. | |
{{% tip %}} | |
- If possible, schedule creating new sources during off-peak hours to mitigate | |
the impact of snapshotting on both the upstream system and the Materialize | |
cluster. | |
- Consider using a larger cluster size during snapshotting. Once the | |
snapshottingis complete, you can downsize the cluster to align with the volume | |
of changes being replicated from your upstream in steady-state. | |
- See also [Best practices](#best-practices). | |
{{% /tip %}} | |
If you create your source from the Materialize Console, the overview page for | |
the source displays the snapshotting progress. Alternatively, you can run a | |
query to monitor its progress. See [Monitoring the snapshotting progress](/ingest-data/monitoring-data-ingestion/#monitoring-the-snapshotting-progress). | |
The duration of the snapshotting operation depends on the volume of data in the | |
initial snapshot and the size of the cluster the source is hosted in. To reduce | |
the operational burden of snapshotting on the upstream system and guarantee | |
you're only bringing in the volume of data you effectively need in Materialize, | |
we recommend: | |
- If possible, running source creation operations during **off-peak hours** to | |
minimize operational risk in both the upstream system and Materialize. | |
- **Limiting the volume of data** that is synced into Materialize on source | |
creation. This will help speed up snapshotting, as well as make data | |
exploration more lightweight. See [Limit the volume of data](#limit-the-volume-of-data) | |
for best practices. | |
- **Overprovisioning the ingestion cluster** for snapshotting, then | |
right-sizing once the snapshot is complete and you have a better grasp on the | |
resource needs of your source(s) in steady-state. See [Limit the volume of | |
data](#use-a-larger-cluster-for-snapshotting) for best practices. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a big fan of jamming a lot of text into a tip annotation — it feels like a sign that we need to improve how the text is organized.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, I think we should already be surfacing best practices like limiting data under this section, so it's less likely they'll be missed.
|
||
{{% /tip %}} | ||
|
||
If you create your source from the Materialize Console, the overview page for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you create your source from the Materialize Console, the overview page for the source displays the snapshotting progress.
For shared context, this isn't correct. The Console will always display snapshotting progress, regardless of what interface you used to create the source.
### CPU and memory utilization | ||
|
||
Snapshotting may require more compute resources from the Materialize cluster | ||
than steady-state. If there are signs of resource exhaustion (that is, the | ||
cluster restarts because it ran out of memory), resize the cluster. | ||
|
||
{{% tip %}} | ||
Consider using a larger cluster size during snapshotting. Once the snapshotting | ||
operation is complete, you can downsize your source cluster to align with the | ||
volume of changes being replicated from your upstream in steady-state. | ||
{{% /tip %}} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
### CPU and memory utilization | |
Snapshotting may require more compute resources from the Materialize cluster | |
than steady-state. If there are signs of resource exhaustion (that is, the | |
cluster restarts because it ran out of memory), resize the cluster. | |
{{% tip %}} | |
Consider using a larger cluster size during snapshotting. Once the snapshotting | |
operation is complete, you can downsize your source cluster to align with the | |
volume of changes being replicated from your upstream in steady-state. | |
{{% /tip %}} | |
### Monitoring progress | |
While snapshotting is taking place, you can monitor the progress of the | |
operation in the **overview page** for the source in the [Materialize Console] | |
(https://console.materialize.com/). Alternatively, you can manually keep track | |
of using information from the system catalog. See [Monitoring the snapshotting | |
progress](/ingest-data/monitoring-data-ingestion/#monitoring-the-snapshotting-progress) | |
for guidance. | |
It's also important to **monitor CPU and memory utilization** for the cluster | |
hosting the source during snapshotting. If there are signs of resource | |
exhaustion, you may need to [resize the cluster](#use-a-larger-cluster-for-snapshotting). |
While a source is snapshotting, the source (and the associated subsources) | ||
cannot serve queries. That is, queries issued to the snapshotting source (and | ||
its subsources) will return after the snapshotting completes (unless the user | ||
breaks out of the query). If the user does not break out of the query, the | ||
returned query results will reflect the data from the snapshot. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While a source is snapshotting, the source (and the associated subsources) | |
cannot serve queries. That is, queries issued to the snapshotting source (and | |
its subsources) will return after the snapshotting completes (unless the user | |
breaks out of the query). If the user does not break out of the query, the | |
returned query results will reflect the data from the snapshot. | |
Because the initial snapshot is persisted atomically, you are **not able to | |
query the source until snapshotting is complete**. This means that queries | |
issued against (sub)sources undergoing snapshotting will hang until the | |
operation completes. Once the initial snapshot has been ingested, you can start | |
querying your (sub)sources and Materialize will continue ingesting any new data | |
as it arrives, in real time. |
## Running/steady-state | ||
|
||
Once snapshotting completes, Materialize transitions to Running state. During | ||
this state, Materialize continually ingests changes from the upstream system. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
## Running/steady-state | |
Once snapshotting completes, Materialize transitions to Running state. During | |
this state, Materialize continually ingests changes from the upstream system. |
Rebranched from #27061 + added updates.