docs: Add ingest data landing page #31468

kay-kim · 2025-02-12T15:25:15Z

Rebranched from #27061 + added updates.

kay-kim · 2025-02-12T15:30:31Z

doc/user/content/ingest-data/_index.md

+depending on the volume of the upstream changes, Materialize may lag behind the
+upstream system. If the lag is significant, queries may block until Materialize
+has caught up sufficiently with the upstream system when using the default
+[isolation level](/get-started/isolation-level/) of [strict


FYI -- I'm going to bubble up the isolation + arrangements pages (i.e., make an architecture section in the docs) sometime next week.

kay-kim · 2025-02-12T15:39:15Z

doc/user/content/ingest-data/_index.md

@@ -0,0 +1,244 @@
+---
+title: "Ingesting data"


https://preview.materialize.com/materialize/31468/ingest-data/

kay-kim · 2025-02-12T15:39:53Z

doc/user/content/ingest-data/monitoring-data-ingestion.md

+    identifier: ingest-monitoring
+    parent: ingest-data
+    weight: 39
+---


https://preview.materialize.com/materialize/31468/ingest-data/monitoring-data-ingestion/

ala2134

Only one grammar change, otherwise just some take it or leave it comments 😄

ala2134 · 2025-02-14T13:47:29Z

doc/user/content/ingest-data/_index.md

+  cluster.
+
+- Consider using a larger cluster size during snapshotting. Once the
+  snapshottingis complete, you can downsize the cluster to align with the volume


Needs a space between snapshottingis

ala2134 · 2025-02-14T13:49:36Z

doc/user/content/ingest-data/_index.md

+
+{{% /tip %}}
+
+If you create your source from the Materialize Console, the overview page for


Should we link out to docs about monitoring in the Console? Like here? https://materialize.com/docs/console/data/

Ah ignore, I just saw the next link :)

If you create your source from the Materialize Console, the overview page for the source displays the snapshotting progress.

For shared context, this isn't correct. The Console will always display snapshotting progress, regardless of what interface you used to create the source.

ala2134 · 2025-02-14T13:53:39Z

doc/user/content/ingest-data/_index.md

+
+In the Materialize Console, you can see a source's data freshness from the
+**Data Explorer** screen. Alternatively, you can run a query to monitor the lag.
+See [Monitoring the snapshotting progress](/ingest-data/monitoring-data-ingestion).


Should we instead link to the steady state portion of this page? https://preview.materialize.com/materialize/31468/ingest-data/monitoring-data-ingestion/#monitoring-rehydrationdata-freshness-status

morsapaes · 2025-02-19T12:06:05Z

doc/user/content/ingest-data/_index.md

+**Data Explorer** screen. Alternatively, you can run a query to monitor the lag.
+See [Monitoring the snapshotting progress](/ingest-data/monitoring-data-ingestion).
+
+## Rehydration


We no longer use the term "rehydration", only "hydration".

As for hydration vs rehydration ... I did ask about that since you had mentioned it to me in an earlier pr. There seemed to be some ambiguity as to when we use the terms. There actually was a draft where I went with Hydration/rehydration and even (re)Hydration 😄

I could go with Hydration and in that first sentence, add "(sometimes referred to as rehydration)."

(sometimes referred to as rehydration).

We should have a single term for user-facing documentation (or anything), and we've previously agreed that it should be simply hydration. Sources are either snapshotting (from the external system) or hydrating (from storage), compute objects just hydrate. No need to differentiate and potentially confuse users, since this is just our internal lingo coming through; you can see that hydration is consistently used in the system catalog, too.

morsapaes · 2025-02-19T12:14:22Z

Thanks for putting this up, @kay-kim! This PR reads more like reference documentation than guidance, and is missing some important bits we laid out in the previous draft — in particular, recommendations around reducing the amount of data that is ingested. That should be the main recommendation and most immediate takeaway of this page. It's also not clear who this page is geared towards: I'd expect it to be an intro for new users to increase the chance of a successful onboarding, rather than a path to production (e.g. how you spread objects across clusters is less important when you start, and we typically wouldn't want new users to think about clusters until they have to).

I'll take some time to work some of the suggestions above in this week.

kay-kim · 2025-02-19T13:43:51Z

Hey @morsapaes -- thanks much!

w.r.t. limiting the amount of data ... it's still there in the best practice section in the Limit the volume of data section
As for intended purpose ...
- I was going to pop out the best practices later into either operational guide. There's some slack discussion - I will be pulling all the best practices/recommendations/etc. into a centralized place -- some will be duplicated ... others will be single sourced into the central location. But, since I wasn't going to start that until end of Feb ... I figured I'd have some of the recommendations mentioned in this PR vetted here. This way, I have enough material to help better organize the new guide.
- I also don't think the page should be only for new users. For new users onboarding ... we can include things geared towards them (like the limit the data volume) into the tutorials themselves as that's where new users would tend to spend time and we can best help them onboard.

kay-kim · 2025-02-20T02:03:58Z

doc/user/content/ingest-data/_index.md

+  operation](#snapshotting) to initially populate the source in Materialize.
+  Snapshotting is a resource-intensive operation that can require a significant
+  amount of CPU and memory. Consider using a larger cluster size during
+  snapshotting. Once the snapshotting operation is complete, you can downsize


For future verifications w.r.t. increasing size:

It may not be the case for certain sources.

Maybe after a certain size, it might not buy as much improvement.

For upsert sources, emphasize this a bit more + dedicated cluster for upsert sources (as in separate these sources from other sources)

kay-kim · 2025-02-20T02:05:57Z

doc/user/content/ingest-data/_index.md

+
+- If possible, schedule creating new sources during off-peak hours to mitigate
+  the impact of snapshotting on both the upstream system and the Materialize
+  cluster.


Need to verify about scheduling (namely, if there are exceptions that can be made (if necessary) to the rule depending on the source type)

This is a weird one, and again only applies to production environments. Even then, it should be safe to e.g. add a new subsource to a source during peak hours.

kay-kim · 2025-02-20T02:08:02Z

doc/user/content/ingest-data/_index.md

+
+#### Dedicate a cluster for the sources
+
+If possible, dedicate a cluster just for sources. That is, avoid using the same


couple of things for future:

even within the umbrella of dedicated source clusters, further division with dedicated upsert source clusters.
If using a single source for all three ...

OOM'ing for compute, etc. may impact upstream.

won't be able to do blue/green

So, you have a source cluster and it's sized for current steady state. What is the happy route for adding a new source to that cluster.

morsapaes · 2025-02-20T16:17:13Z

doc/user/content/ingest-data/_index.md

@@ -0,0 +1,244 @@
+---
+title: "Ingesting data"


Suggested change

title: "Ingesting data"

title: "Ingest data"

morsapaes · 2025-02-20T16:19:09Z

doc/user/content/ingest-data/_index.md

@@ -0,0 +1,244 @@
+---
+title: "Ingesting data"
+description: "How to ingest data into Materialize from external systems."


Suggested change

description: "How to ingest data into Materialize from external systems."

description: "Best practices for ingesting data into Materialize from external systems."

morsapaes · 2025-02-20T16:25:44Z

doc/user/content/ingest-data/_index.md

+Materialize can ingest data from various external systems:
+
+{{< multilinkbox >}}
+{{< linkbox title="Databases (CDC)" >}}
+- [PostgreSQL](/ingest-data/postgres/)
+- [MySQL](/ingest-data/mysql/)
+- [SQL Server](/ingest-data/cdc-sql-server/)
+- [CockroachDB](/ingest-data/cdc-cockroachdb/)
+{{</ linkbox >}}
+{{< linkbox title="Message Brokers" >}}
+- [Kafka](/ingest-data/kafka/)
+- [Redpanda](/sql/create-source/kafka)
+- [Other message brokers](/integrations/#message-brokers)
+{{</ linkbox >}}
+{{< linkbox title="Webhooks" >}}
+- [Amazon EventBridge](/ingest-data/webhooks/amazon-eventbridge/)
+- [Segment](/ingest-data/webhooks/segment/)
+- [Other webhooks](/sql/create-source/webhook)
+{{</ linkbox >}}
+{{</ multilinkbox >}}


Suggested change

Materialize can ingest data from various external systems:

{{< multilinkbox >}}

{{< linkbox title="Databases (CDC)" >}}

- [PostgreSQL](/ingest-data/postgres/)

- [MySQL](/ingest-data/mysql/)

- [SQL Server](/ingest-data/cdc-sql-server/)

- [CockroachDB](/ingest-data/cdc-cockroachdb/)

{{</ linkbox >}}

{{< linkbox title="Message Brokers" >}}

- [Kafka](/ingest-data/kafka/)

- [Redpanda](/sql/create-source/kafka)

- [Other message brokers](/integrations/#message-brokers)

{{</ linkbox >}}

{{< linkbox title="Webhooks" >}}

- [Amazon EventBridge](/ingest-data/webhooks/amazon-eventbridge/)

- [Segment](/ingest-data/webhooks/segment/)

- [Other webhooks](/sql/create-source/webhook)

{{</ linkbox >}}

{{</ multilinkbox >}}

You can ingest data into Materialize from various external systems:

{{< multilinkbox >}}

{{< linkbox title="Databases (CDC)" >}}

- [PostgreSQL](/ingest-data/postgres/)

- [MySQL](/ingest-data/mysql/)

- [SQL Server](/ingest-data/cdc-sql-server/)

- [MongoDB](https://github.com/MaterializeIncLabs/materialize-mongodb-debezium)

- [CockroachDB](/ingest-data/cdc-cockroachdb/)

- [Other databases](/integrations/#databases)

{{</ linkbox >}}

{{< linkbox title="Message Brokers" >}}

- [Kafka](/ingest-data/kafka/)

- [Redpanda](/sql/create-source/kafka)

- [Other message brokers](/integrations/#message-brokers)

{{</ linkbox >}}

{{< linkbox title="Webhooks" >}}

- [Amazon EventBridge](/ingest-data/webhooks/amazon-eventbridge/)

- [Segment](/ingest-data/webhooks/segment/)

- [Other webhooks](/sql/create-source/webhook)

{{</ linkbox >}}

{{</ multilinkbox >}}

As a follow-up: MongoDB and Azure EventHubs should also have integration guides, since they come up somewhat frequently. I'll update the state of Azure EventHubs in the integrations overview, which isn't accurate.

morsapaes · 2025-02-20T16:43:49Z

doc/user/content/ingest-data/_index.md

+Snapshotting refers to the initial population of a source in Materialize. When
+you create a new source, Materialize takes a snapshot of the data from the
+upstream (i.e., external) system at a given offset and loads it into
+Materialize.
+
+For the offset, Materialize chooses the latest available offset. For the
+available offset, Materialize uses:
+
+- Log Sequence Number (LSN) for PostgreSQL sources.
+
+- The number of transactions committed across all servers in the cluster for
+  MySQL sources.
+
+- Kafka message offset for Kafka sources.
+
+Materialize captures/commits a snapshot of all historical data up to that point.
+Snapshot is persisted in the storage layer, and all records in this initial load
+have the same ingestion timestamp.


Suggested change

Snapshotting refers to the initial population of a source in Materialize. When

you create a new source, Materialize takes a snapshot of the data from the

upstream (i.e., external) system at a given offset and loads it into

Materialize.

For the offset, Materialize chooses the latest available offset. For the

available offset, Materialize uses:

- Log Sequence Number (LSN) for PostgreSQL sources.

- The number of transactions committed across all servers in the cluster for

MySQL sources.

- Kafka message offset for Kafka sources.

Materialize captures/commits a snapshot of all historical data up to that point.

Snapshot is persisted in the storage layer, and all records in this initial load

have the same ingestion timestamp.

When a new source is created, Materialize performs a sync of all data available

in the external system before it starts ingesting new data — an operation known

as _snapshotting_. Because the initial snapshot is persisted in the storage

layer atomically (i.e., at the same ingestion timestamp), you are **not able to

query the source until snapshotting is complete**.

Depending on the volume of data in the initial snapshot and the size of the

cluster the source is hosted in, this operation can take anywhere from a few

minutes up to several hours, and might require more compute resources than

steady-state.

I don't think the unit of progress per source is relevant here (this is documented in the reference documentation for each source type). Also, I'd keep the original blurb from the previous PR, which gives a more comprehensive overview of the snapshotting operation and gives a TL;DR of the sections below.

For completeness, in MySQL the unit of progress is not the number of transactions, but the transaction ID + the lower and upper range of GTIDs.

morsapaes · 2025-02-20T17:08:23Z

doc/user/content/ingest-data/_index.md

+The snapshotting operation duration depends on the snapshot dataset size and the
+size of the Materialize cluster that is hosting the source. For very large
+sources that contain hundreds of GB of data, it can take up to an hour or even
+several hours to complete.
+
+{{% tip %}}
+
+- If possible, schedule creating new sources during off-peak hours to mitigate
+  the impact of snapshotting on both the upstream system and the Materialize
+  cluster.
+
+- Consider using a larger cluster size during snapshotting. Once the
+  snapshottingis complete, you can downsize the cluster to align with the volume
+  of changes being replicated from your upstream in steady-state.
+
+- See also [Best practices](#best-practices).
+
+{{% /tip %}}
+
+If you create your source from the Materialize Console, the overview page for
+the source displays the snapshotting progress. Alternatively, you can run a
+query to monitor its progress. See [Monitoring the snapshotting progress](/ingest-data/monitoring-data-ingestion/#monitoring-the-snapshotting-progress).


Suggested change

The snapshotting operation duration depends on the snapshot dataset size and the

size of the Materialize cluster that is hosting the source. For very large

sources that contain hundreds of GB of data, it can take up to an hour or even

several hours to complete.

{{% tip %}}

- If possible, schedule creating new sources during off-peak hours to mitigate

the impact of snapshotting on both the upstream system and the Materialize

cluster.

- Consider using a larger cluster size during snapshotting. Once the

snapshottingis complete, you can downsize the cluster to align with the volume

of changes being replicated from your upstream in steady-state.

- See also [Best practices](#best-practices).

{{% /tip %}}

If you create your source from the Materialize Console, the overview page for

the source displays the snapshotting progress. Alternatively, you can run a

query to monitor its progress. See [Monitoring the snapshotting progress](/ingest-data/monitoring-data-ingestion/#monitoring-the-snapshotting-progress).

The duration of the snapshotting operation depends on the volume of data in the

initial snapshot and the size of the cluster the source is hosted in. To reduce

the operational burden of snapshotting on the upstream system and guarantee

you're only bringing in the volume of data you effectively need in Materialize,

we recommend:

- If possible, running source creation operations during **off-peak hours** to

minimize operational risk in both the upstream system and Materialize.

- **Limiting the volume of data** that is synced into Materialize on source

creation. This will help speed up snapshotting, as well as make data

exploration more lightweight. See [Limit the volume of data](#limit-the-volume-of-data)

for best practices.

- **Overprovisioning the ingestion cluster** for snapshotting, then

right-sizing once the snapshot is complete and you have a better grasp on the

resource needs of your source(s) in steady-state. See [Limit the volume of

data](#use-a-larger-cluster-for-snapshotting) for best practices.

Not a big fan of jamming a lot of text into a tip annotation — it feels like a sign that we need to improve how the text is organized.

Also, I think we should already be surfacing best practices like limiting data under this section, so it's less likely they'll be missed.

morsapaes · 2025-02-20T17:20:44Z

doc/user/content/ingest-data/_index.md

+
+{{% /tip %}}
+
+If you create your source from the Materialize Console, the overview page for


If you create your source from the Materialize Console, the overview page for the source displays the snapshotting progress.

For shared context, this isn't correct. The Console will always display snapshotting progress, regardless of what interface you used to create the source.

morsapaes · 2025-02-20T17:22:04Z

doc/user/content/ingest-data/_index.md

+### CPU and memory utilization
+
+Snapshotting may require more compute resources from the Materialize cluster
+than steady-state. If there are signs of resource exhaustion (that is, the
+cluster restarts because it ran out of memory), resize the cluster.
+
+{{% tip %}}
+Consider using a larger cluster size during snapshotting. Once the snapshotting
+operation is complete, you can downsize your source cluster to align with the
+volume of changes being replicated from your upstream in steady-state.
+{{% /tip %}}


Suggested change

### CPU and memory utilization

Snapshotting may require more compute resources from the Materialize cluster

than steady-state. If there are signs of resource exhaustion (that is, the

cluster restarts because it ran out of memory), resize the cluster.

{{% tip %}}

Consider using a larger cluster size during snapshotting. Once the snapshotting

operation is complete, you can downsize your source cluster to align with the

volume of changes being replicated from your upstream in steady-state.

{{% /tip %}}

### Monitoring progress

While snapshotting is taking place, you can monitor the progress of the

operation in the **overview page** for the source in the [Materialize Console]

(https://console.materialize.com/). Alternatively, you can manually keep track

of using information from the system catalog. See [Monitoring the snapshotting

progress](/ingest-data/monitoring-data-ingestion/#monitoring-the-snapshotting-progress)

for guidance.

It's also important to **monitor CPU and memory utilization** for the cluster

hosting the source during snapshotting. If there are signs of resource

exhaustion, you may need to [resize the cluster](#use-a-larger-cluster-for-snapshotting).

morsapaes · 2025-02-20T17:27:56Z

doc/user/content/ingest-data/_index.md

+While a source is snapshotting, the source (and the associated subsources)
+cannot serve queries. That is, queries issued to the snapshotting source (and
+its subsources) will return after the snapshotting completes (unless the user
+breaks out of the query). If the user does not break out of the query, the
+returned query results will reflect the data from the snapshot.


Suggested change

While a source is snapshotting, the source (and the associated subsources)

cannot serve queries. That is, queries issued to the snapshotting source (and

its subsources) will return after the snapshotting completes (unless the user

breaks out of the query). If the user does not break out of the query, the

returned query results will reflect the data from the snapshot.

Because the initial snapshot is persisted atomically, you are **not able to

query the source until snapshotting is complete**. This means that queries

issued against (sub)sources undergoing snapshotting will hang until the

operation completes. Once the initial snapshot has been ingested, you can start

querying your (sub)sources and Materialize will continue ingesting any new data

as it arrives, in real time.

morsapaes · 2025-02-20T17:28:59Z

doc/user/content/ingest-data/_index.md

+## Running/steady-state
+
+Once snapshotting completes, Materialize transitions to Running state. During
+this state, Materialize continually ingests changes from the upstream system.


Suggested change

## Running/steady-state

Once snapshotting completes, Materialize transitions to Running state. During

this state, Materialize continually ingests changes from the upstream system.

sthm and others added 10 commits February 12, 2025 10:23

doc/user: add guidance for creating large sources

8dacb28

Remove trailing whitespace

bebafcc

Fix title

0b9b937

Tidy intro

ec87e08

Add hydration to troubleshooting section

43698a0

Typos

213ea72

Fix typos

09ae475

Minor tweaks

2f05dbf

Dump local changes

af40f33

Updates to ingest landing page + add a monitoring page

74a93f4

kay-kim requested a review from a team as a code owner February 12, 2025 15:25

kay-kim requested a review from frankmcsherry February 12, 2025 15:27

kay-kim commented Feb 12, 2025

View reviewed changes

ala2134 requested changes Feb 14, 2025

View reviewed changes

morsapaes reviewed Feb 19, 2025

View reviewed changes

kay-kim commented Feb 20, 2025

View reviewed changes

morsapaes requested changes Feb 20, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: Add ingest data landing page #31468

docs: Add ingest data landing page #31468

kay-kim commented Feb 12, 2025 •

edited

Loading

kay-kim Feb 12, 2025

kay-kim Feb 12, 2025

kay-kim Feb 12, 2025

ala2134 left a comment

ala2134 Feb 14, 2025

ala2134 Feb 14, 2025

ala2134 Feb 14, 2025

morsapaes Feb 20, 2025

ala2134 Feb 14, 2025

morsapaes Feb 19, 2025

kay-kim Feb 19, 2025 •

edited

Loading

kay-kim Feb 19, 2025

morsapaes Feb 20, 2025 •

edited

Loading

morsapaes commented Feb 19, 2025

kay-kim commented Feb 19, 2025 •

edited

Loading

kay-kim Feb 20, 2025 •

edited

Loading

kay-kim Feb 20, 2025

morsapaes Feb 20, 2025

kay-kim Feb 20, 2025 •

edited

Loading

morsapaes Feb 20, 2025

morsapaes Feb 20, 2025

morsapaes Feb 20, 2025

morsapaes Feb 20, 2025

morsapaes Feb 20, 2025

morsapaes Feb 20, 2025

morsapaes Feb 20, 2025

morsapaes Feb 20, 2025

morsapaes Feb 20, 2025

morsapaes Feb 20, 2025

morsapaes Feb 20, 2025

morsapaes Feb 20, 2025

morsapaes Feb 20, 2025 •

edited

Loading

morsapaes Feb 20, 2025


		{{% /tip %}}

		If you create your source from the Materialize Console, the overview page for


		#### Dedicate a cluster for the sources

		If possible, dedicate a cluster just for sources. That is, avoid using the same

	description: "How to ingest data into Materialize from external systems."
	description: "Best practices for ingesting data into Materialize from external systems."

-While a source is snapshotting, the source (and the associated subsources)
-cannot serve queries. That is, queries issued to the snapshotting source (and
-its subsources) will return after the snapshotting completes (unless the user
-breaks out of the query). If the user does not break out of the query, the
-returned query results will reflect the data from the snapshot.
+Because the initial snapshot is persisted atomically, you are **not able to
+query the source until snapshotting is complete**. This means that queries
+issued against (sub)sources undergoing snapshotting will hang until the
+operation completes. Once the initial snapshot has been ingested, you can start
+querying your (sub)sources and Materialize will continue ingesting any new data
+as it arrives, in real time.

docs: Add ingest data landing page #31468

Are you sure you want to change the base?

docs: Add ingest data landing page #31468

Conversation

kay-kim commented Feb 12, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ala2134 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kay-kim Feb 19, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

morsapaes Feb 20, 2025 • edited Loading

Choose a reason for hiding this comment

morsapaes commented Feb 19, 2025

kay-kim commented Feb 19, 2025 • edited Loading

kay-kim Feb 20, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kay-kim Feb 20, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

morsapaes Feb 20, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kay-kim commented Feb 12, 2025 •

edited

Loading

kay-kim Feb 19, 2025 •

edited

Loading

morsapaes Feb 20, 2025 •

edited

Loading

kay-kim commented Feb 19, 2025 •

edited

Loading

kay-kim Feb 20, 2025 •

edited

Loading

kay-kim Feb 20, 2025 •

edited

Loading

morsapaes Feb 20, 2025 •

edited

Loading