Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support source/sink for plain Parquet/ORC/Avro Tables #166

Open
anoopj opened this issue Nov 3, 2023 · 9 comments
Open

Support source/sink for plain Parquet/ORC/Avro Tables #166

anoopj opened this issue Nov 3, 2023 · 9 comments
Labels
enhancement New feature or request good first issue Good for newcomers

Comments

@anoopj
Copy link

anoopj commented Nov 3, 2023

Supporting plain Parquet/ORC/Avro (partitioned as well as unpartitioned) may be useful for "upgrading" legacy data to table formats. Sink may be useful for exporting a specific snapshot for interoperability reasons.

This feature is lower priority, as Iceberg/Delta etc have native support for metadata-only conversions and offer Spark procedures.

@the-other-tim-brown
Copy link
Contributor

@anoopj what would the metadata look like for a sink export?

I like the idea of a generic bootstrap so that users could take existing data and try out all 3 formats if they want to do some testing with other tools.

@anoopj
Copy link
Author

anoopj commented Nov 6, 2023

@anoopj what would the metadata look like for a sink export?

Sink could be based on manifest files in SymlinkTextInputFormat. BigQuery also now supports manifest files.

I like the idea of a generic bootstrap so that users could take existing data and try out all 3 formats if they want to do some testing with other tools.

Yes, bootstrap is probably higher priority than sink.

@the-other-tim-brown the-other-tim-brown added enhancement New feature or request good first issue Good for newcomers labels Nov 6, 2023
@the-other-tim-brown
Copy link
Contributor

@jackwener any interest in looking into something like this?

@marqub
Copy link

marqub commented Apr 30, 2024

@the-other-tim-brown I'm trying to find a good first issue to ramp up on XTable. Can I take a look at this one? Perhaps we can split it into different issues. One initial task could be to add support for the Parquet input data format, for example? I'm not sure what the code looks like, but ultimately, we can create something modular enough to extend to AVRO or other formats later, if it hasn't been done already. I would be interested to discuss of the possible approaches to fill up the partitioning and statistics info...

However, just to check that I understand the scenario correctly: if today I wanted to bootstrap 2 different systems, Hudi and Iceberg, with existing Parquet files, couldn't I use the native capabilities of either system for a 1st initial import, and then use the current XTable to generate the metafiles for the remaining system?

@the-other-tim-brown
Copy link
Contributor

@the-other-tim-brown I'm trying to find a good first issue to ramp up on XTable. Can I take a look at this one? Perhaps we can split it into different issues. One initial task could be to add support for the Parquet input data format, for example? I'm not sure what the code looks like, but ultimately, we can create something modular enough to extend to AVRO or other formats later, if it hasn't been done already. I would be interested to discuss of the possible approaches to fill up the partitioning and statistics info...

I think it makes sense to start with just one of the file formats like Parquet. We can discuss how to get the info you would need.

However, just to check that I understand the scenario correctly: if today I wanted to bootstrap 2 different systems, Hudi and Iceberg, with existing Parquet files, couldn't I use the native capabilities of either system for a 1st initial import, and then use the current XTable to generate the metafiles for the remaining system?

Yes you could do that as well.

There is another issue I had my eye on that I could guide you through as well if you are interested: #411

@marqub
Copy link

marqub commented May 2, 2024

I think it makes sense to start with just one of the file formats like Parquet. We can discuss how to get the info you would need.

However, just to check that I understand the scenario correctly: if today I wanted to bootstrap 2 different systems, Hudi and Iceberg, with existing Parquet files, couldn't I use the native capabilities of either system for a 1st initial import, and then use the current XTable to generate the metafiles for the remaining system?

Yes you could do that as well.

Ok, if you agree that we want to move away from this workaround approach, then I think supporting Parquet is a good first issue for me to smooth the learning curve.

There is another issue I had my eye on that I could guide you through as well if you are interested: #411

ok, this one could be a good next step, but for now, I prefer to limit the amount of novelty.

I should have some time to start on the parquet issue next week.
How do you prefer to communicate? Is there a slack channel?

@the-other-tim-brown
Copy link
Contributor

@marqub we do not have a slack setup for the project yet, I can shoot you an email to connect and discuss any of the details in the meantime.

@Reactor11
Copy link

Hi, Is someone working on it? I am new to this project and would like to get started.

@the-other-tim-brown
Copy link
Contributor

Hi, Is someone working on it? I am new to this project and would like to get started.

@Reactor11 there is a similar effort for a parquet file source that is being worked on: #553

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

4 participants