Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

array_agg cannot perform both distinct and order_by #12371

Open
timsaucer opened this issue Sep 7, 2024 · 4 comments · May be fixed by #14413
Open

array_agg cannot perform both distinct and order_by #12371

timsaucer opened this issue Sep 7, 2024 · 4 comments · May be fixed by #14413
Assignees
Labels
bug Something isn't working

Comments

@timsaucer
Copy link
Contributor

timsaucer commented Sep 7, 2024

Describe the bug

Currently we can use array_agg with distinct() or order_by() but not both. Attempting to do so will create an error.

To Reproduce

This minimal code will reproduce the problem:


use arrow::array::Int32Array;
use datafusion::arrow::datatypes::{DataType, Field, Schema};
use datafusion::arrow::record_batch::RecordBatch;
use datafusion::error::Result;
use datafusion::functions_aggregate::array_agg::array_agg;
use datafusion::prelude::SessionContext;
use datafusion_expr::{col, ExprFunctionExt};
use std::sync::Arc;

#[tokio::main]
async fn main() -> Result<()> {

    let schema = Arc::new(Schema::new(vec![
        Field::new("a", DataType::Int32, false),
        Field::new("b", DataType::Int32, false),
    ]));

    let batch = RecordBatch::try_new(
        schema,
        vec![
            Arc::new(Int32Array::from(vec![1, 2, 3])),
            Arc::new(Int32Array::from(vec![4, 4, 6])),
        ],
    )?;

    let ctx = SessionContext::new();

    ctx.register_batch("t", batch)?;
    let df = ctx.table("t").await?;

    df.aggregate(vec![], vec![
        array_agg(col("b"))
            .order_by(vec![col("a").sort(true, true)])
            .build()?
            .alias("order_by"),
        array_agg(col("b"))
            .distinct()
            .build()?
            .alias("distinct"),
        // array_agg(col("b"))
        //     .order_by(vec![col("a").sort(true, true)])
        //     .distinct()
        //     .build()?
        //     .alias("both"),
    ])?.show().await?;

    Ok(())
}

You can run the code as is and it produces the expected result. If you uncomment the section where we try to do both you get

Error: Internal("expects single batch")

Expected behavior

A user should be able to apply both order_by and distinct in a single operation.

Additional context

In the code in datafusion/functions-aggregate/src/array_agg.rs

We have either DistinctArrayAggAccumulator or OrderSensitiveArrayAggAccumulator but no option for both.

This is related to #8583

@timsaucer timsaucer added the bug Something isn't working label Sep 7, 2024
@Rachelint
Copy link
Contributor

take

@Rachelint
Copy link
Contributor

Plan to impl it following the behavior in duckdb.

D select id, list(distinct foo order by bar) from (   values (1, '10', 2), (1, '10', 1), (1, '15', 1), (2, '10', 1) , (1, '10', 1)) v (id, foo, bar) group by all order by id;
┌───────┬─────────────────────────────────┐
│  id   │ list(DISTINCT foo ORDER BY bar) │
│ int32 │            varchar[]            │
├───────┼─────────────────────────────────┤
│     1 │ [10, 15, 10]                    │
│     2 │ [10]                            │
└───────┴─────────────────────────────────┘
D select id, list(distinct foo order by bar) from (   values (1, '10', 2), (1, '10', 1), (1, '15', 1), (2, '10', 1) , (1, '11', 1)) v (id, foo, bar) group by all order by id;
┌───────┬─────────────────────────────────┐
│  id   │ list(DISTINCT foo ORDER BY bar) │
│ int32 │            varchar[]            │
├───────┼─────────────────────────────────┤
│     1 │ [10, 15, 11, 10]                │
│     2 │ [10]                            │
└───────┴─────────────────────────────────┘

@gabotechs gabotechs linked a pull request Feb 2, 2025 that will close this issue
@gabotechs
Copy link
Contributor

gabotechs commented Feb 2, 2025

Hi @Rachelint 👋, not sure if you were still looking at this, I need support for DISTINCT + ORDER BY in ARRAY_AGG as a prerequisite for #14412, so I ended up implementing it myself in #14413. If you are still working on this please let me know.

@Rachelint
Copy link
Contributor

Hi @Rachelint 👋, not sure if you were still looking at this, I need support for DISTINCT + ORDER BY in ARRAY_AGG as a prerequisite for #14412, so I ended up implementing it myself in #14413. If you are still working on this please let me know.

@gabotechs Oh, sorry, i forget this actually...
Just feel free to take it.

@Rachelint Rachelint assigned gabotechs and unassigned Rachelint Feb 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants