Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DO NOT MERGE] Experimental Iceberg sharding runs #34020

Draft
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

ahmedabu98
Copy link
Contributor

@ahmedabu98 ahmedabu98 commented Feb 19, 2025

Experimenting with GroupIntoBatches' key parallelism when writing to Iceberg.

Main method is in IcebergShardingRuns.java.
Choose your own PROJECT, DATASET, and WAREHOUSE then configure the following options to run under different scenarios:

// ======== experiment with these numbers ===========
int numShards = 1;
long payloadSize = 1 << 10; // 1KB
int numIcebergPartitions = 0;
// ==================================================

Try to figure out what the best throughput-per-shard is that leads to large files while maintaining good performance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant