Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replacing SessionState with Session and progress towards moving FileFormatFactory out of datasource #14517

Closed
wants to merge 7 commits into from

Conversation

logan-keede
Copy link
Contributor

Which issue does this PR close?

Rationale for this change

  1. SessionState needs to be replaced with Session in datasource to split it out of core.
  2. get_file_format_factory needs to be implemented for Session
    pub trait FileFormatFactory: Sync + Send + GetExt + Debug {
  3. for 2, we need to move FileScanConfig to datafusion-catalog-listing(or some other place, but out of core)

What changes are included in this PR?

Refactoring,

  • minor SessionState -> dyn Session + some downcast to be removed later.
  • major move FileScanConfig to datafusion-catalog-listing

Are these changes tested?

Are there any user-facing changes?

There should not be

@logan-keede logan-keede marked this pull request as draft February 5, 2025 18:13
@github-actions github-actions bot added the core Core DataFusion crate label Feb 5, 2025
#[derive(Debug)]
struct FileGroupsDisplay<'a>(&'a [Vec<PartitionedFile>]);

impl DisplayAs for FileGroupsDisplay<'_> {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe, keeping a separate file for addition on old file_scan_config would be better?

@logan-keede
Copy link
Contributor Author

@alamb do you think this is correct approach to make progress?
PS: I am working on fixing CI

@logan-keede logan-keede marked this pull request as ready for review February 5, 2025 19:29
Copy link
Member

@davisp davisp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just a first pass super quick skim of this and I'm not seeing anything crazy. I'll have to read through again while I have time to think harder and double check the places where code is moving before I give a +1 though,.

@@ -26,13 +26,13 @@

use std::sync::Arc;

use crate::datasource::listing::PartitionedFile;
use crate::PartitionedFile;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is great

@@ -28,6 +28,7 @@ rust-version.workspace = true
version.workspace = true

[dependencies]
apache-avro = { version = "0.17", optional = true }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is unfortunate -- is there any way we can avoid the (direct) avro dependency on the listing table? If not it is fine for this PR I was just wondering

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, I think this OR had major conflicts with

@logan-keede is there any chance you are willing to take a look at the new structure and rework this PR?

@logan-keede
Copy link
Contributor Author

Unfortunately, I think this OR had major conflicts with

@logan-keede is there any chance you are willing to take a look at the new structure and rework this PR?

Sure, I was kinda expecting this.

@logan-keede
Copy link
Contributor Author

logan-keede commented Feb 6, 2025

https://github.com/apache/datafusion/blame/bab0f54daa99830339c0000a19c1b4e3489278ad/datafusion/core/src/datasource/physical_plan/file_scan_config.rs#L605

this downcast introduced in #14224, probably needs to be moved/removed(for this issue), but I am not sure where/how, or can we make do without it(like before)?

@mertak-synnada do you have any Idea?

@logan-keede
Copy link
Contributor Author

One solution, I can think of right now, is to have multiple SpecificFileScanConfig structs and make current struct GenericFileScanConfig a trait and then add specific impl, among other thing that should add some flexibility for future work or user specific implementation. (if it works like I think, I am new to rust's idea of polymorphism)

@mertak-synnada
Copy link
Contributor

mertak-synnada commented Feb 7, 2025

this downcast introduced in #14224, probably needs to be moved/removed(for this issue), but I am not sure where/how, or can we make do without it(like before)?

Sorry I'm pretty new to this issue's context, Is it a problem because of dependency? If so, with an ugly solution, we can check the self.source.file_type() == "avro" here. Or we may create, another level of abstraction and ask that API the file type. In the early implementation of #14224, file_type() was returning a dyn FileType trait but since it was only being used in to_str and these kind type checkings we removed it.

Edit: On a second thought, calling self.source.supports_repartition() here and defining in FileSource

    fn supports_repartition(&self) -> bool {
        true
    }

And overriding this in AvroSource is a better solution

@logan-keede
Copy link
Contributor Author

I figured it would be better to start anew, so #14543

@logan-keede logan-keede closed this Feb 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants