Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: Add ingest data landing page #31468

Open
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

kay-kim
Copy link
Contributor

@kay-kim kay-kim commented Feb 12, 2025

Rebranched from #27061 + added updates.

@kay-kim kay-kim requested a review from a team as a code owner February 12, 2025 15:25
depending on the volume of the upstream changes, Materialize may lag behind the
upstream system. If the lag is significant, queries may block until Materialize
has caught up sufficiently with the upstream system when using the default
[isolation level](/get-started/isolation-level/) of [strict
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI -- I'm going to bubble up the isolation + arrangements pages (i.e., make an architecture section in the docs) sometime next week.

@@ -0,0 +1,244 @@
---
title: "Ingesting data"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

identifier: ingest-monitoring
parent: ingest-data
weight: 39
---
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link

@ala2134 ala2134 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only one grammar change, otherwise just some take it or leave it comments 😄

cluster.

- Consider using a larger cluster size during snapshotting. Once the
snapshottingis complete, you can downsize the cluster to align with the volume
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs a space between snapshottingis


{{% /tip %}}

If you create your source from the Materialize Console, the overview page for
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we link out to docs about monitoring in the Console? Like here? https://materialize.com/docs/console/data/

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah ignore, I just saw the next link :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you create your source from the Materialize Console, the overview page for the source displays the snapshotting progress.

For shared context, this isn't correct. The Console will always display snapshotting progress, regardless of what interface you used to create the source.


In the Materialize Console, you can see a source's data freshness from the
**Data Explorer** screen. Alternatively, you can run a query to monitor the lag.
See [Monitoring the snapshotting progress](/ingest-data/monitoring-data-ingestion).
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

**Data Explorer** screen. Alternatively, you can run a query to monitor the lag.
See [Monitoring the snapshotting progress](/ingest-data/monitoring-data-ingestion).

## Rehydration
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We no longer use the term "rehydration", only "hydration".

Copy link
Contributor Author

@kay-kim kay-kim Feb 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As for hydration vs rehydration ... I did ask about that since you had mentioned it to me in an earlier pr. There seemed to be some ambiguity as to when we use the terms. There actually was a draft where I went with Hydration/rehydration and even (re)Hydration 😄

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could go with Hydration and in that first sentence, add "(sometimes referred to as rehydration)."

Copy link
Contributor

@morsapaes morsapaes Feb 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(sometimes referred to as rehydration).

We should have a single term for user-facing documentation (or anything), and we've previously agreed that it should be simply hydration. Sources are either snapshotting (from the external system) or hydrating (from storage), compute objects just hydrate. No need to differentiate and potentially confuse users, since this is just our internal lingo coming through; you can see that hydration is consistently used in the system catalog, too.

@morsapaes
Copy link
Contributor

Thanks for putting this up, @kay-kim! This PR reads more like reference documentation than guidance, and is missing some important bits we laid out in the previous draft — in particular, recommendations around reducing the amount of data that is ingested. That should be the main recommendation and most immediate takeaway of this page. It's also not clear who this page is geared towards: I'd expect it to be an intro for new users to increase the chance of a successful onboarding, rather than a path to production (e.g. how you spread objects across clusters is less important when you start, and we typically wouldn't want new users to think about clusters until they have to).

I'll take some time to work some of the suggestions above in this week.

@kay-kim
Copy link
Contributor Author

kay-kim commented Feb 19, 2025

Hey @morsapaes -- thanks much!

  • w.r.t. limiting the amount of data ... it's still there in the best practice section in the Limit the volume of data section
  • As for intended purpose ...
    • I was going to pop out the best practices later into either operational guide. There's some slack discussion - I will be pulling all the best practices/recommendations/etc. into a centralized place -- some will be duplicated ... others will be single sourced into the central location. But, since I wasn't going to start that until end of Feb ... I figured I'd have some of the recommendations mentioned in this PR vetted here. This way, I have enough material to help better organize the new guide.
    • I also don't think the page should be only for new users. For new users onboarding ... we can include things geared towards them (like the limit the data volume) into the tutorials themselves as that's where new users would tend to spend time and we can best help them onboard.

operation](#snapshotting) to initially populate the source in Materialize.
Snapshotting is a resource-intensive operation that can require a significant
amount of CPU and memory. Consider using a larger cluster size during
snapshotting. Once the snapshotting operation is complete, you can downsize
Copy link
Contributor Author

@kay-kim kay-kim Feb 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For future verifications w.r.t. increasing size:

  • It may not be the case for certain sources.
  • Maybe after a certain size, it might not buy as much improvement.
  • For upsert sources, emphasize this a bit more + dedicated cluster for upsert sources (as in separate these sources from other sources)


- If possible, schedule creating new sources during off-peak hours to mitigate
the impact of snapshotting on both the upstream system and the Materialize
cluster.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to verify about scheduling (namely, if there are exceptions that can be made (if necessary) to the rule depending on the source type)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a weird one, and again only applies to production environments. Even then, it should be safe to e.g. add a new subsource to a source during peak hours.


#### Dedicate a cluster for the sources

If possible, dedicate a cluster just for sources. That is, avoid using the same
Copy link
Contributor Author

@kay-kim kay-kim Feb 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

couple of things for future:

  1. even within the umbrella of dedicated source clusters, further division with dedicated upsert source clusters.
    If using a single source for all three ...
  • OOM'ing for compute, etc. may impact upstream.
  • won't be able to do blue/green
  1. So, you have a source cluster and it's sized for current steady state. What is the happy route for adding a new source to that cluster.

@@ -0,0 +1,244 @@
---
title: "Ingesting data"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
title: "Ingesting data"
title: "Ingest data"

@@ -0,0 +1,244 @@
---
title: "Ingesting data"
description: "How to ingest data into Materialize from external systems."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
description: "How to ingest data into Materialize from external systems."
description: "Best practices for ingesting data into Materialize from external systems."

Comment on lines +12 to +31
Materialize can ingest data from various external systems:

{{< multilinkbox >}}
{{< linkbox title="Databases (CDC)" >}}
- [PostgreSQL](/ingest-data/postgres/)
- [MySQL](/ingest-data/mysql/)
- [SQL Server](/ingest-data/cdc-sql-server/)
- [CockroachDB](/ingest-data/cdc-cockroachdb/)
{{</ linkbox >}}
{{< linkbox title="Message Brokers" >}}
- [Kafka](/ingest-data/kafka/)
- [Redpanda](/sql/create-source/kafka)
- [Other message brokers](/integrations/#message-brokers)
{{</ linkbox >}}
{{< linkbox title="Webhooks" >}}
- [Amazon EventBridge](/ingest-data/webhooks/amazon-eventbridge/)
- [Segment](/ingest-data/webhooks/segment/)
- [Other webhooks](/sql/create-source/webhook)
{{</ linkbox >}}
{{</ multilinkbox >}}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Materialize can ingest data from various external systems:
{{< multilinkbox >}}
{{< linkbox title="Databases (CDC)" >}}
- [PostgreSQL](/ingest-data/postgres/)
- [MySQL](/ingest-data/mysql/)
- [SQL Server](/ingest-data/cdc-sql-server/)
- [CockroachDB](/ingest-data/cdc-cockroachdb/)
{{</ linkbox >}}
{{< linkbox title="Message Brokers" >}}
- [Kafka](/ingest-data/kafka/)
- [Redpanda](/sql/create-source/kafka)
- [Other message brokers](/integrations/#message-brokers)
{{</ linkbox >}}
{{< linkbox title="Webhooks" >}}
- [Amazon EventBridge](/ingest-data/webhooks/amazon-eventbridge/)
- [Segment](/ingest-data/webhooks/segment/)
- [Other webhooks](/sql/create-source/webhook)
{{</ linkbox >}}
{{</ multilinkbox >}}
You can ingest data into Materialize from various external systems:
{{< multilinkbox >}}
{{< linkbox title="Databases (CDC)" >}}
- [PostgreSQL](/ingest-data/postgres/)
- [MySQL](/ingest-data/mysql/)
- [SQL Server](/ingest-data/cdc-sql-server/)
- [MongoDB](https://github.com/MaterializeIncLabs/materialize-mongodb-debezium)
- [CockroachDB](/ingest-data/cdc-cockroachdb/)
- [Other databases](/integrations/#databases)
{{</ linkbox >}}
{{< linkbox title="Message Brokers" >}}
- [Kafka](/ingest-data/kafka/)
- [Redpanda](/sql/create-source/kafka)
- [Other message brokers](/integrations/#message-brokers)
{{</ linkbox >}}
{{< linkbox title="Webhooks" >}}
- [Amazon EventBridge](/ingest-data/webhooks/amazon-eventbridge/)
- [Segment](/ingest-data/webhooks/segment/)
- [Other webhooks](/sql/create-source/webhook)
{{</ linkbox >}}
{{</ multilinkbox >}}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a follow-up: MongoDB and Azure EventHubs should also have integration guides, since they come up somewhat frequently. I'll update the state of Azure EventHubs in the integrations overview, which isn't accurate.

Comment on lines +57 to +74
Snapshotting refers to the initial population of a source in Materialize. When
you create a new source, Materialize takes a snapshot of the data from the
upstream (i.e., external) system at a given offset and loads it into
Materialize.

For the offset, Materialize chooses the latest available offset. For the
available offset, Materialize uses:

- Log Sequence Number (LSN) for PostgreSQL sources.

- The number of transactions committed across all servers in the cluster for
MySQL sources.

- Kafka message offset for Kafka sources.

Materialize captures/commits a snapshot of all historical data up to that point.
Snapshot is persisted in the storage layer, and all records in this initial load
have the same ingestion timestamp.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Snapshotting refers to the initial population of a source in Materialize. When
you create a new source, Materialize takes a snapshot of the data from the
upstream (i.e., external) system at a given offset and loads it into
Materialize.
For the offset, Materialize chooses the latest available offset. For the
available offset, Materialize uses:
- Log Sequence Number (LSN) for PostgreSQL sources.
- The number of transactions committed across all servers in the cluster for
MySQL sources.
- Kafka message offset for Kafka sources.
Materialize captures/commits a snapshot of all historical data up to that point.
Snapshot is persisted in the storage layer, and all records in this initial load
have the same ingestion timestamp.
When a new source is created, Materialize performs a sync of all data available
in the external system before it starts ingesting new data — an operation known
as _snapshotting_. Because the initial snapshot is persisted in the storage
layer atomically (i.e., at the same ingestion timestamp), you are **not able to
query the source until snapshotting is complete**.
Depending on the volume of data in the initial snapshot and the size of the
cluster the source is hosted in, this operation can take anywhere from a few
minutes up to several hours, and might require more compute resources than
steady-state.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the unit of progress per source is relevant here (this is documented in the reference documentation for each source type). Also, I'd keep the original blurb from the previous PR, which gives a more comprehensive overview of the snapshotting operation and gives a TL;DR of the sections below.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For completeness, in MySQL the unit of progress is not the number of transactions, but the transaction ID + the lower and upper range of GTIDs.

Comment on lines +78 to +99
The snapshotting operation duration depends on the snapshot dataset size and the
size of the Materialize cluster that is hosting the source. For very large
sources that contain hundreds of GB of data, it can take up to an hour or even
several hours to complete.

{{% tip %}}

- If possible, schedule creating new sources during off-peak hours to mitigate
the impact of snapshotting on both the upstream system and the Materialize
cluster.

- Consider using a larger cluster size during snapshotting. Once the
snapshottingis complete, you can downsize the cluster to align with the volume
of changes being replicated from your upstream in steady-state.

- See also [Best practices](#best-practices).

{{% /tip %}}

If you create your source from the Materialize Console, the overview page for
the source displays the snapshotting progress. Alternatively, you can run a
query to monitor its progress. See [Monitoring the snapshotting progress](/ingest-data/monitoring-data-ingestion/#monitoring-the-snapshotting-progress).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The snapshotting operation duration depends on the snapshot dataset size and the
size of the Materialize cluster that is hosting the source. For very large
sources that contain hundreds of GB of data, it can take up to an hour or even
several hours to complete.
{{% tip %}}
- If possible, schedule creating new sources during off-peak hours to mitigate
the impact of snapshotting on both the upstream system and the Materialize
cluster.
- Consider using a larger cluster size during snapshotting. Once the
snapshottingis complete, you can downsize the cluster to align with the volume
of changes being replicated from your upstream in steady-state.
- See also [Best practices](#best-practices).
{{% /tip %}}
If you create your source from the Materialize Console, the overview page for
the source displays the snapshotting progress. Alternatively, you can run a
query to monitor its progress. See [Monitoring the snapshotting progress](/ingest-data/monitoring-data-ingestion/#monitoring-the-snapshotting-progress).
The duration of the snapshotting operation depends on the volume of data in the
initial snapshot and the size of the cluster the source is hosted in. To reduce
the operational burden of snapshotting on the upstream system and guarantee
you're only bringing in the volume of data you effectively need in Materialize,
we recommend:
- If possible, running source creation operations during **off-peak hours** to
minimize operational risk in both the upstream system and Materialize.
- **Limiting the volume of data** that is synced into Materialize on source
creation. This will help speed up snapshotting, as well as make data
exploration more lightweight. See [Limit the volume of data](#limit-the-volume-of-data)
for best practices.
- **Overprovisioning the ingestion cluster** for snapshotting, then
right-sizing once the snapshot is complete and you have a better grasp on the
resource needs of your source(s) in steady-state. See [Limit the volume of
data](#use-a-larger-cluster-for-snapshotting) for best practices.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a big fan of jamming a lot of text into a tip annotation — it feels like a sign that we need to improve how the text is organized.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I think we should already be surfacing best practices like limiting data under this section, so it's less likely they'll be missed.


{{% /tip %}}

If you create your source from the Materialize Console, the overview page for
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you create your source from the Materialize Console, the overview page for the source displays the snapshotting progress.

For shared context, this isn't correct. The Console will always display snapshotting progress, regardless of what interface you used to create the source.

Comment on lines +101 to +111
### CPU and memory utilization

Snapshotting may require more compute resources from the Materialize cluster
than steady-state. If there are signs of resource exhaustion (that is, the
cluster restarts because it ran out of memory), resize the cluster.

{{% tip %}}
Consider using a larger cluster size during snapshotting. Once the snapshotting
operation is complete, you can downsize your source cluster to align with the
volume of changes being replicated from your upstream in steady-state.
{{% /tip %}}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
### CPU and memory utilization
Snapshotting may require more compute resources from the Materialize cluster
than steady-state. If there are signs of resource exhaustion (that is, the
cluster restarts because it ran out of memory), resize the cluster.
{{% tip %}}
Consider using a larger cluster size during snapshotting. Once the snapshotting
operation is complete, you can downsize your source cluster to align with the
volume of changes being replicated from your upstream in steady-state.
{{% /tip %}}
### Monitoring progress
While snapshotting is taking place, you can monitor the progress of the
operation in the **overview page** for the source in the [Materialize Console]
(https://console.materialize.com/). Alternatively, you can manually keep track
of using information from the system catalog. See [Monitoring the snapshotting
progress](/ingest-data/monitoring-data-ingestion/#monitoring-the-snapshotting-progress)
for guidance.
It's also important to **monitor CPU and memory utilization** for the cluster
hosting the source during snapshotting. If there are signs of resource
exhaustion, you may need to [resize the cluster](#use-a-larger-cluster-for-snapshotting).

Comment on lines +115 to +119
While a source is snapshotting, the source (and the associated subsources)
cannot serve queries. That is, queries issued to the snapshotting source (and
its subsources) will return after the snapshotting completes (unless the user
breaks out of the query). If the user does not break out of the query, the
returned query results will reflect the data from the snapshot.
Copy link
Contributor

@morsapaes morsapaes Feb 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
While a source is snapshotting, the source (and the associated subsources)
cannot serve queries. That is, queries issued to the snapshotting source (and
its subsources) will return after the snapshotting completes (unless the user
breaks out of the query). If the user does not break out of the query, the
returned query results will reflect the data from the snapshot.
Because the initial snapshot is persisted atomically, you are **not able to
query the source until snapshotting is complete**. This means that queries
issued against (sub)sources undergoing snapshotting will hang until the
operation completes. Once the initial snapshot has been ingested, you can start
querying your (sub)sources and Materialize will continue ingesting any new data
as it arrives, in real time.

Comment on lines +121 to +124
## Running/steady-state

Once snapshotting completes, Materialize transitions to Running state. During
this state, Materialize continually ingests changes from the upstream system.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
## Running/steady-state
Once snapshotting completes, Materialize transitions to Running state. During
this state, Materialize continually ingests changes from the upstream system.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants