Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement predicate pruning for not like expressions #14567

Merged
merged 4 commits into from
Feb 15, 2025
Merged

Conversation

UBarney
Copy link
Contributor

@UBarney UBarney commented Feb 10, 2025

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

For predicate col NOT LIKE 'const_prefix%', we rewrite it as (col_min NOT LIKE 'const_prefix%' OR col_max NOT LIKE 'const_prefix%'). If both col_min and col_max have the prefix foo, we skip the entire row group (as we can be certain that all data in this row group has the prefix foo).

Are these changes tested?

Yes

Are there any user-facing changes?

No

@github-actions github-actions bot added optimizer Optimizer rules core Core DataFusion crate labels Feb 10, 2025
@UBarney UBarney marked this pull request as ready for review February 10, 2025 04:48
@alamb
Copy link
Contributor

alamb commented Feb 10, 2025

FYI @adriangb

Copy link
Contributor

@adriangb adriangb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume this should also work with no wildcards col not like 'foo'? Or do we optimize like expressions without wildcards into an equality check? If we don't already do that optimization we probably should 😄

Comment on lines +113 to +118
#[tokio::test]
async fn test_utf8_not_like_ecsape() {
Utf8Test::new(|value| col("a").not_like(lit(format!("\\%{}%", value))))
.run()
.await;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love these fuzz tests! These 3 lines of code give me great confidence that this PR does the right thing 😄

Operator::NotLikeMatch => {
build_not_like_match(expr_builder).ok_or_else(|| {
plan_datafusion_err!(
"The NOT LIKE expression with wildcards is only supported at the end of the pattern"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a general note, not necessarily for this PR: I do think we should somehow make these errors (also for the existing LIKE case) a bit better if there is any other reason they might be rejected? Or if this is the only reason, a more generic error might be better? I realize these errors likely never surface to users but I think it'd be unfortunate to see an error like this one when the actual reason it wasn't supported has nothing to do with it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree. Now it return Result<Arc<dyn PhysicalExpr>>

Comment on lines 1737 to 1734
// Example: For pattern "foo%bar", the row group might include values like
// ["foobar", "food", "foodbar"], making it unsafe to prune.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you expand upon the examples here? My naive gut instinct (which I think is wrong, I tried before 😓) is that it should be able to truncate foo%bar to foo%. Maybe an example that can show what would go wrong if you tried that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a comment to explain why it goes wrong. Please take another look.

Copy link
Contributor

@adriangb adriangb Feb 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense thanks. @findepi do you know if Trino handles this and if so does it do it in similar way?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trino currently does not generate a pruning predicate for NOT LIKE
https://github.com/trinodb/trino/blob/232916b75d415a5eb643cf922492eb8513d99aae/core/trino-main/src/main/java/io/trino/sql/planner/DomainTranslator.java#L1016-L1018

The logic is easy to write for <constant-prefix>% pattern case. And perhaps infeasible for any other pattern class.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Pitor.

It seems to me then that it is sensible to move forward with this. The generated predicates are an internal implementation detail so if we want to change this or add support for more complex cases in the future we can do so. There is little downside to supporting only the <constant-prefix>% pattern case for now.

@UBarney UBarney marked this pull request as draft February 11, 2025 01:54
@UBarney UBarney marked this pull request as ready for review February 11, 2025 05:40
@UBarney
Copy link
Contributor Author

UBarney commented Feb 11, 2025

I assume this should also work with no wildcards col not like 'foo'?

Yes. add some test to demonstrate it
There's also an optimization to rewrite col not like 'foo' as col != 'foo'

> explain select * from values ('foo'), ('bar') where column1 not like 'foo';
+---------------+-------------------------------------------------------+
| plan_type     | plan                                                  |
+---------------+-------------------------------------------------------+
| logical_plan  | Filter: column1 != Utf8("foo")                        |
|               |   Values: (Utf8("foo")), (Utf8("bar"))                |
| physical_plan | CoalesceBatchesExec: target_batch_size=8192           |
|               |   FilterExec: column1@0 != foo                        |
|               |     DataSourceExec: partitions=1, partition_sizes=[1] |
|               |                                                       |
+---------------+-------------------------------------------------------+
2 row(s) fetched. 
Elapsed 0.006 seconds.

@adriangb

@UBarney UBarney requested a review from adriangb February 11, 2025 05:42
}
}

if chars.last() == Some(&'_') && (chars.len() > 1 && chars[chars.len() - 2] != '\\') {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Handling of constant LIKE patterns (no wildcards or escaped wildcards) here only complicates the code. This should be handled separately.

Here we should be dealing with constant-prefix followed by single unescaped %.
It would be great if the code's told this intention more directly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for example we could have a function

/// Returns unescaped constant prefix of a LIKE pattern (possibly empty) and the remaining pattern (possibly empty)
fn split_constant_prefix<'a>(like_pattern: &'a str) -> (String /* or Cow */, &'a str) {
  todo!()
}

then the logic would be

let (constant_prefix, rest_of_the_pattern) = split_constant_prefix(pattern);
if &constant_prefix == "" || (rest_of_the_pattern != "" && rest_of_the_pattern != "%") [
 give up
} else {
 produce replacement
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that handling constant LIKE expressions should only be done at the optimizer level and it's safe for us to just not support it here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now it only handle constant-prefix% . Adding split_constant_prefix

@UBarney UBarney marked this pull request as draft February 12, 2025 04:48
@UBarney UBarney force-pushed the prune branch 2 times, most recently from b61c35c to 78dce9e Compare February 12, 2025 11:14
@UBarney UBarney marked this pull request as ready for review February 12, 2025 11:20
@UBarney UBarney marked this pull request as draft February 12, 2025 11:23
@UBarney UBarney marked this pull request as ready for review February 12, 2025 11:52
@UBarney UBarney requested review from findepi and adriangb February 12, 2025 13:16
@@ -1710,6 +1711,76 @@ fn build_like_match(
Some(combined)
}

// For predicate `col NOT LIKE 'const_prefix%'`, we rewrite it as `(col_min NOT LIKE 'const_prefix%' OR col_max NOT LIKE 'const_prefix%')`. If both col_min and col_max have the prefix const_prefix, we skip the entire row group (as we can be certain that all data in this row group has the prefix const_prefix).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is entirely not true. Please update to reflect actual logic

fn build_not_like_match(
expr_builder: &mut PruningExpressionBuilder<'_>,
) -> Result<Arc<dyn PhysicalExpr>> {
// col NOT LIKE 'const_prefix%' -> !(col_min LIKE 'const_prefix%' && col_max LIKE 'const_prefix%') -> (col_min NOT LIKE 'const_prefix%' || col_max NOT LIKE 'const_prefix%')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not true either

the truth is (or should be):

col NOT LIKE 'const_prefix%' ->
   col_max < 'const_prefix' OR col_min >= 'const_prefiy'
   (notice the last character of the rightmost constant is different)

Copy link
Contributor Author

@UBarney UBarney Feb 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not true either

Why? Does it prune row groups that contain data that does not match the pattern? Or is it just inefficient?

First, let's clarify one point: if a row group contains any data that does not match the pattern, then this row group must not be pruned. The expected behavior is to return all data that does not match. If the row group gets pruned, data loss will occur (see this PR).

For `col NOT LIKE 'const_prefix%'`:  
It applies the condition:  
`col_max < 'const_prefix' OR col_min >= 'const_prefiy'`  
(Note that the last character of the rightmost constant is different.)  

I think this mistakenly prunes row groups that contain data not matching the pattern.

Consider this case: col NOT LIKE 'const_prefix%'. The row group might include values such as ["aaa", "b", "const_prefix"]. The condition col_max < 'const_prefix' OR col_min >= 'const_prefiy' evaluates to false, causing the row group to be pruned. However, this is incorrect because "aaa" does not match the pattern and should be included in the result.

Comment on lines +1731 to +1741
// we can not handle `%` at the beginning or in the middle of the pattern
// Example: For pattern "foo%bar", the row group might include values like
// ["foobar", "food", "foodbar"], making it unsafe to prune.
// Even if the min/max values in the group (e.g., "foobar" and "foodbar")
// match the pattern, intermediate values like "food" may not
// match the full pattern "foo%bar", making pruning unsafe.
// (truncate foo%bar to foo% have same problem)

// we can not handle pattern containing `_`
// Example: For pattern "foo_", row groups might contain ["fooa", "fooaa", "foob"],
// which means not every row is guaranteed to match the pattern.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good explanation for the code we didn't write (or we did, but it was deleted).
This can be written more teresly as

Suggested change
// we can not handle `%` at the beginning or in the middle of the pattern
// Example: For pattern "foo%bar", the row group might include values like
// ["foobar", "food", "foodbar"], making it unsafe to prune.
// Even if the min/max values in the group (e.g., "foobar" and "foodbar")
// match the pattern, intermediate values like "food" may not
// match the full pattern "foo%bar", making pruning unsafe.
// (truncate foo%bar to foo% have same problem)
// we can not handle pattern containing `_`
// Example: For pattern "foo_", row groups might contain ["fooa", "fooaa", "foob"],
// which means not every row is guaranteed to match the pattern.
// We handle `NOT LIKE '<constant-prefix>%'` case only, because only this case
// can be converted to range predicates that can be then compared with row group min/max values.

Comment on lines +1747 to +1751
let min_col_not_like_epxr = Arc::new(phys_expr::LikeExpr::new(
true,
false,
Arc::clone(&min_column_expr),
Arc::clone(scalar_expr),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be

col_min >= 'const_prefiy'

(and renamed)

Comment on lines +1754 to +1759
let max_col_not_like_expr = Arc::new(phys_expr::LikeExpr::new(
true,
false,
Arc::clone(&max_column_expr),
Arc::clone(scalar_expr),
));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be

col_max < 'const_prefix'

@alamb
Copy link
Contributor

alamb commented Feb 12, 2025

This is on my list of PRs to review tomorrow

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @UBarney @findepi and @adriangb

I am not sure that the transformation in this PR is correct. Here is a counter example.

Given the data

col
aa
ab
ac

And a predicate like col NOT LIKE 'ac%' which matches the first two rows

> select x, x NOT LIKE 'ac%' from foo;
+----+----------------------------+
| x  | foo.x NOT LIKE Utf8("ac%") |
+----+----------------------------+
| aa | true                       |
| ab | true                       |
| ac | false                      |
+----+----------------------------+
3 row(s) fetched.
Elapsed 0.001 seconds.

This PR would rewrite the predicate like

col NOT LIKE 'ac%'

-- PruningPredicate rewrite -->

(col_min NOT LIKE 'ac%' OR  col_min NOT LIKE 'ac%') 

-- Substitution -->

('aa' NOT LIKE 'ac%' OR 'ac' NOT LIKE 'ac%') 

-- Evaluation -->

(true OR false) 

-- Evaluation -->

true (prune container)

However, the values aa and ab both match the predicate so this should not be pruned.

Maybe i am missing something 🤔

@UBarney
Copy link
Contributor Author

UBarney commented Feb 13, 2025

@alamb Thanks for such a detailed example and explanation.
I think this transformation is correct in this case.

according to this doc, returning True will KEEP the container.

When the min/max values are actually substituted in to this expression and evaluated, the result means
true: there MAY be rows that pass the predicate, KEEPS the container

Here's end to end test (manually😓)

> explain select * from 't.parquet' where c1 not like 'ac%';

+---------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| plan_type     | plan                                                                                                                                                                                                                                                                                              |
+---------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| logical_plan  | Filter: t.parquet.c1 NOT LIKE Utf8View("ac%")                                                                                                                                                                                                                                                     |
|               |   TableScan: t.parquet projection=[c1], partial_filters=[t.parquet.c1 NOT LIKE Utf8View("ac%")]                                                                                                                                                                                                   |
| physical_plan | CoalesceBatchesExec: target_batch_size=8192                                                                                                                                                                                                                                                       |
|               |   FilterExec: c1@0 NOT LIKE ac%                                                                                                                                                                                                                                                                   |
|               |     RepartitionExec: partitioning=RoundRobinBatch(24), input_partitions=1                                                                                                                                                                                                                         |
|               |       DataSourceExec: file_groups={1 group: [[home/lv/code/datafusion/datafusion-cli/t.parquet]]}, projection=[c1], file_type=parquet, predicate=c1@0 NOT LIKE ac%, pruning_predicate=c1_null_count@2 != row_count@3 AND (c1_min@0 NOT LIKE ac% OR c1_max@1 NOT LIKE ac%), required_guarantees=[] |
|               |                                                                                                                                                                                                                                                                                                   |
+---------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
2 row(s) fetched. 
Elapsed 0.012 seconds.
> select * from 't.parquet' where c1 not like 'ac%';
+----+
| c1 |
+----+
| aa |
| ab |
+----+
2 row(s) fetched. 
Elapsed 0.013 seconds.

t.parquet in is zip file (github doesn't allow upload parquet file 😓)

@alamb
Copy link
Contributor

alamb commented Feb 13, 2025

@alamb Thanks for such a detailed example and explanation. I think this transformation is correct in this case.

according to this doc, returning True will KEEP the container.

When the min/max values are actually substituted in to this expression and evaluated, the result means
true: there MAY be rows that pass the predicate, KEEPS the container

yes, I apologize, you are correct 🤦

I am thinking about other potential counter examples. I am also trying to come up with some example where this transformation would not prune a container that a transoformation like the one from @findepi would do

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @UBarney . It is so cool to see the pruning logic getting better and better -- this is quite a nice trick.

After some thought in the 🚿 I have convinced myself this transformation is true and understand the intuition.

Thank you for your patience

In terms of @findepi 's suggestion that there may be other equivalent / better rewrites I think we can address that in a follow on PR. I wasn't able to come up with an example where some other expression on the min/max values would be able to prune containers where this one would not.

datafusion/physical-optimizer/src/pruning.rs Outdated Show resolved Hide resolved
];
prune_with_expr(expr, &schema, &statistics, expected_ret);

let expr = col("s1").not_like(lit("A\\%%"));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please add some negative tests that verify predicates like NOT LIKE 'foo%bar' and NOT LIKE 'foo%bar_ (aka non empty additional patterns) can not be used to prune anything?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add not like A\u{10ffff}%\u{10ffff}_. I think existing testcase col("s1").not_like(lit("A\u{10ffff}%\u{10ffff}")) is equivalent to NOT LIKE 'foo%bar'

@alamb
Copy link
Contributor

alamb commented Feb 13, 2025

t.parquet in is zip file (github doesn't allow upload parquet file 😓)

BTW you can create such files using datafusion via

> copy (values ('aa'), ('ab'), ('ac')) to 't.parquet';

@UBarney
Copy link
Contributor Author

UBarney commented Feb 14, 2025

@alamb Here's an example where @findepi's transformation would prune a container, but this won't.
However, all values match the predicate so this should not be pruned.

Given data

col
a
b
z

And a predicate col NOT LIKE 'const_prefix%'

@UBarney's transformation @findepi's transformation
PruningPredicate rewrite (col_min NOT LIKE 'const_prefix%' OR  col_min NOT LIKE 'const_prefix%') col_max < 'const_prefix' OR col_min >= 'const_prefiy
Substitution 'a' NOT LIKE 'const_prefix%' OR  'z' NOT LIKE 'const_prefix%' 'z' < 'const_prefix' OR 'a' >= 'const_prefiy'
Evaluation true OR true -> true false OR false -> false
Result  the container was KEPT  the container was PRUNED

@alamb
Copy link
Contributor

alamb commented Feb 14, 2025

However, all values match the predicate so this should not be pruned.

I agree -- the transformation as specifically proposed by @findepi would incorrectly filter out some ranges.

I also tried to think of any cases where we could prove that no values could match the predicate based on min/max values but the transformation proposed in this PR would not prune. I could not think of any

@alamb
Copy link
Contributor

alamb commented Feb 14, 2025

I'll plan to merge this PR later today or tomorrow unless anyone would like additional time to review

@alamb alamb merged commit c6656f2 into apache:main Feb 15, 2025
24 checks passed
@alamb
Copy link
Contributor

alamb commented Feb 15, 2025

Thanks again @UBarney and @adriangb and @findepi

@UBarney UBarney deleted the prune branch February 17, 2025 03:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate optimizer Optimizer rules
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support predicate pruning for NOT LIKE expressions
4 participants