-
Notifications
You must be signed in to change notification settings - Fork 166
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a "Generate" relation #745
Comments
I wonder whether this should simply be a table function relation and table function definitions. Definitely think it is unrelated to project and the window example is not a good comparison. (People think of relations as record level operations but they are set level operations. A window function may have visibility over the entire set but it doesn't change input cardinality, same as any other scalar expression.) |
Doing more research it looks like the following is true:
|
Hi, I implemented unnest in flowtide that uses substrait, there as reference I created this custom relation:
If it has an input as in:
It uses ExtensionSingleRel. If it is:
it uses ExtensionLeafRel. Not sure if its the best implementation, but I wanted to post it as reference for this discussion. |
Hello! I just found this discussion while trying to see if there is any general I am going to hack something together in the short term, but wanted to share some preliminary thoughts here. I'll open a separate issue in which I will be more detailed about my approach and try to publicly share my iterations. I think table functions should be able to go anywhere any other relational operator can go in the plan. E.g. in duckdb reading data from an Arrow IPC file can be done in a
Looking at @Ulimo 's approach above, I think there are definitely a variety of things to design and agree upon before anything gets upstreamed to substrait, but hopefully some quick-ish hacking will highlight some do's and don'ts. |
We'd need something that can support Explode/Unnest, ie. taking a row and generating multiple rows based on it, for example by splitting an array column into one element per row.
This is separate from Expand, at least in that in "Generate", each input row can produce a different number of output rows, including 0.
Spark calls the relation GenerateExec. DataFusion has implemented the array-unnesting with LogicalPlan::Unnest. "Generate" sounds more general, and in fact Spark allows e.g. user-defined generators, while "Unnest" is probably rather a specific case of a generator.
Should we add a GenerateRel?
As @EpsilonPrime pointed out, Gluten has one in their fork of Substrait, we could probably use the same:
Expression generator
is the function that takes in a row and produces multiple rows, in Spark that could be e.g.explode
orexplode_outer
, in DFunnest
. I guess it'd be basically always a ScalarFunction? Or a new type of a function?bool outer
indicates whether a row should be produced for the cases wheregenerator
produces an empty set of rows, or not.Not sure what the
child_output
is for here yet 😅An alternative would be to just include the generator functions in Project clauses. This would be somewhat analogous to having WindowFunctionInvocations in a Project. The producer and consumer would likely need to map from some special relation (e.g. Spark's Generate) into a Project, and then back to another special relation (e.g. DF's Unnest) from the Substrait Project. It seems to me that this "should work", but whether it's the right thing to do or not is a different question. It also doesn't allow for specifying e.g. "bool outer", but maybe that can be handled through the function invocations. Or maybe there should be a GeneratorFunctionInvokation that's one option for an Expression and can be included in a ProjectRel?
Ref https://substrait.slack.com/archives/C02D7CTQXHD/p1731956935857829
The text was updated successfully, but these errors were encountered: