Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor: move DataSource to datafusion-datasource #14671

Open
wants to merge 17 commits into
base: main
Choose a base branch
from

Conversation

logan-keede
Copy link
Contributor

@logan-keede logan-keede commented Feb 14, 2025

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

refactor of DataSource

Are these changes tested?

not completely, only cargo check work for now

Are there any user-facing changes?

yes, but only if they are building from source.

@logan-keede logan-keede marked this pull request as draft February 14, 2025 18:27
@github-actions github-actions bot added physical-expr Physical Expressions optimizer Optimizer rules core Core DataFusion crate substrait proto Related to proto crate labels Feb 14, 2025
@alamb
Copy link
Contributor

alamb commented Feb 15, 2025

So exciting...

@logan-keede
Copy link
Contributor Author

logan-keede commented Feb 15, 2025

@alamb any Idea on what to do with unit tests in physical-plan? I cant think of anything except pulling functions that depend on the component that I moved, to core_integration tests or maybe to datasource crate. is there a better method?

@alamb
Copy link
Contributor

alamb commented Feb 15, 2025

@alamb any Idea on what to do with unit tests in physical-plan? I cant think of anything except pulling functions that depend on the component that I moved, to core_integration tests or maybe to datasource crate. is there a better method?

I looked at the errors, and most of them appear to be related to the tests in physical_plan using MemoryExec (well now DataSource with a MemoryExec).

Some other ideas:

  1. Add a dev-dependency between datafusion-physical-plananddatafusion-datasource`
  2. (better) Add a mock ExecutionPlan only for testing in physical_plan -- this would end up with some amount of duplicate code with DataSourceExec but I think that would be ok for testing.
  3. Move the tests, as you say, into tests in core_integration or something

@alamb
Copy link
Contributor

alamb commented Feb 15, 2025

BTW thank you for working on this -- I suggest we try the "make a mock ExecutionPlan" approach -- that would be the cleanest I think and would be a good way for you to learn more about how ExecutionPlans worked

I can try and find time to sketch it out if needed

@logan-keede
Copy link
Contributor Author

BTW thank you for working on this -- I suggest we try the "make a mock ExecutionPlan" approach -- that would be the cleanest I think and would be a good way for you to learn more about how ExecutionPlans worked

Thanks for the suggestions. I will try to make MockExecutionPlan.

@logan-keede
Copy link
Contributor Author

@alamb, I have implemented a MockMemorySourceConfig and MockDataSource which is essentially same as what was previously in physical_plan but now for tests only. You can find the implementation here. The tests are passing on my PC, although there are some warnings, so cleaning up that part is still pending.

I think this is different from what you were suggesting. It would be pretty helpful if you can point me in the right direction.

Thanks,
Logan

@alamb
Copy link
Contributor

alamb commented Feb 16, 2025

@alamb, I have implemented a MockMemorySourceConfig and MockDataSource which is essentially same as what was previously in physical_plan but now for tests only. You can find the implementation here. The tests are passing on my PC, although there are some warnings, so cleaning up that part is still pending.

I think this is different from what you were suggesting. It would be pretty helpful if you can point me in the right direction.

Thanks, Logan

I think this sounds reasonable -- I was just saying that the split between MockMemorySourceConfig and MockDataSource is likely not necessary. There is a split in the current DataSourceConfig and DataSourceExec to permit sharing between difefrent data sources

@logan-keede
Copy link
Contributor Author

I think this sounds reasonable -- I was just saying that the split between MockMemorySourceConfig and MockDataSource is likely not necessary.

It does look that way, I will try to make it one Mock struct.

@logan-keede logan-keede changed the title WIP: move DataSource to datafusion-datasource refactor: move DataSource to datafusion-datasource Feb 18, 2025
@logan-keede logan-keede marked this pull request as ready for review February 18, 2025 19:11
@logan-keede
Copy link
Contributor Author

@alamb What do we want to do with benches? (they are causing circular dependency)
except that I think this is ready for review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate optimizer Optimizer rules physical-expr Physical Expressions proto Related to proto crate substrait
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants