Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a "Generate" relation #745

Open
Blizzara opened this issue Nov 18, 2024 · 4 comments
Open

Add a "Generate" relation #745

Blizzara opened this issue Nov 18, 2024 · 4 comments

Comments

@Blizzara
Copy link
Contributor

Blizzara commented Nov 18, 2024

We'd need something that can support Explode/Unnest, ie. taking a row and generating multiple rows based on it, for example by splitting an array column into one element per row.

This is separate from Expand, at least in that in "Generate", each input row can produce a different number of output rows, including 0.

Spark calls the relation GenerateExec. DataFusion has implemented the array-unnesting with LogicalPlan::Unnest. "Generate" sounds more general, and in fact Spark allows e.g. user-defined generators, while "Unnest" is probably rather a specific case of a generator.

Should we add a GenerateRel?

As @EpsilonPrime pointed out, Gluten has one in their fork of Substrait, we could probably use the same:

message GenerateRel {
  RelCommon common = 1;
  Rel input = 2;
  
  Expression generator = 3;
  repeated Expression child_output = 4;
  bool outer = 5;

  substrait.extensions.AdvancedExtension advanced_extension = 10;
}

Expression generator is the function that takes in a row and produces multiple rows, in Spark that could be e.g. explode or explode_outer, in DF unnest. I guess it'd be basically always a ScalarFunction? Or a new type of a function?

bool outer indicates whether a row should be produced for the cases where generator produces an empty set of rows, or not.

Not sure what the child_output is for here yet 😅

An alternative would be to just include the generator functions in Project clauses. This would be somewhat analogous to having WindowFunctionInvocations in a Project. The producer and consumer would likely need to map from some special relation (e.g. Spark's Generate) into a Project, and then back to another special relation (e.g. DF's Unnest) from the Substrait Project. It seems to me that this "should work", but whether it's the right thing to do or not is a different question. It also doesn't allow for specifying e.g. "bool outer", but maybe that can be handled through the function invocations. Or maybe there should be a GeneratorFunctionInvokation that's one option for an Expression and can be included in a ProjectRel?

Ref https://substrait.slack.com/archives/C02D7CTQXHD/p1731956935857829

@jacques-n
Copy link
Contributor

I wonder whether this should simply be a table function relation and table function definitions.

Definitely think it is unrelated to project and the window example is not a good comparison. (People think of relations as record level operations but they are set level operations. A window function may have visibility over the entire set but it doesn't change input cardinality, same as any other scalar expression.)

@jacques-n
Copy link
Contributor

Doing more research it looks like the following is true:

  • spark has generate relation and generators. Generate is very much akin to a table function relation. Generators are table functions. Unnest is called explode. One of the generators is HiveUDTF (supporting the fact they are the same kind of thing).
  • DuckDB has unnest specifically. Not clear table functions as yet.

@Ulimo
Copy link

Ulimo commented Dec 2, 2024

Hi, I implemented unnest in flowtide that uses substrait, there as reference I created this custom relation:

message TableFunctionRelation {
    // Table function to use
    TableFunction table_function = 1;
    

    // Only required if used with an input.
    // Only left and inner join supported at this time.
    JoinType type = 6;

    // Optional, but only usable with an input.
    substrait.Expression join_condition = 4;

    enum JoinType {
      JOIN_TYPE_UNSPECIFIED = 0;
      JOIN_TYPE_INNER = 1;
      JOIN_TYPE_LEFT = 3;
    }
}

message TableFunction {
    // Points to a function_anchor defined in this plan, which must refer
    // to a table function in the associated YAML file. Required; 0 is
    // considered to be a valid anchor/reference.
    uint32 function_reference = 1;

    // Schema of the output table from the function
    substrait.NamedStruct table_schema = 2;

    // Arguments for the table function
    repeated substrait.FunctionArgument arguments = 7;
}

If it has an input as in:

SELECT * from table t
JOIN unnest(t.c1) element_value

It uses ExtensionSingleRel.

If it is:

SELECT * FROM unnest(list(1,2,3))

it uses ExtensionLeafRel.

Not sure if its the best implementation, but I wanted to post it as reference for this discussion.

@drin
Copy link
Member

drin commented Dec 6, 2024

Hello! I just found this discussion while trying to see if there is any general TableFunction design, documentation, or implementation yet available (or at least in progress).

I am going to hack something together in the short term, but wanted to share some preliminary thoughts here. I'll open a separate issue in which I will be more detailed about my approach and try to publicly share my iterations.

I think table functions should be able to go anywhere any other relational operator can go in the plan. E.g. in duckdb reading data from an Arrow IPC file can be done in a TableFunction that fills the role of a ReadRel operator (scan_arrow_ipc documentation and registration of scan_arrow_ipc table function). Expanding on this a bit more, I think the various extension rels are a great place to start and I am going to explore the following design:

  • A generic (maybe even "typical") table function as an ExtensionSingleRel.
  • A table function in the spirit of scan_arrow_ipc as an ExtensionLeafRel
  • A table function in the spirit of fused operators (e.g. "GroupJoin" in Accelerating Queries with... Join by GroupJoin) as an ExtensionMultiRel

Looking at @Ulimo 's approach above, I think there are definitely a variety of things to design and agree upon before anything gets upstreamed to substrait, but hopefully some quick-ish hacking will highlight some do's and don'ts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants