-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement predicate pruning for not like expressions #14567
Conversation
FYI @adriangb |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume this should also work with no wildcards col not like 'foo'
? Or do we optimize like expressions without wildcards into an equality check? If we don't already do that optimization we probably should 😄
#[tokio::test] | ||
async fn test_utf8_not_like_ecsape() { | ||
Utf8Test::new(|value| col("a").not_like(lit(format!("\\%{}%", value)))) | ||
.run() | ||
.await; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Love these fuzz tests! These 3 lines of code give me great confidence that this PR does the right thing 😄
Operator::NotLikeMatch => { | ||
build_not_like_match(expr_builder).ok_or_else(|| { | ||
plan_datafusion_err!( | ||
"The NOT LIKE expression with wildcards is only supported at the end of the pattern" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a general note, not necessarily for this PR: I do think we should somehow make these errors (also for the existing LIKE case) a bit better if there is any other reason they might be rejected? Or if this is the only reason, a more generic error might be better? I realize these errors likely never surface to users but I think it'd be unfortunate to see an error like this one when the actual reason it wasn't supported has nothing to do with it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree. Now it return Result<Arc<dyn PhysicalExpr>>
// Example: For pattern "foo%bar", the row group might include values like | ||
// ["foobar", "food", "foodbar"], making it unsafe to prune. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you expand upon the examples here? My naive gut instinct (which I think is wrong, I tried before 😓) is that it should be able to truncate foo%bar
to foo%
. Maybe an example that can show what would go wrong if you tried that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a comment to explain why it goes wrong. Please take another look.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense thanks. @findepi do you know if Trino handles this and if so does it do it in similar way?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Trino currently does not generate a pruning predicate for NOT LIKE
https://github.com/trinodb/trino/blob/232916b75d415a5eb643cf922492eb8513d99aae/core/trino-main/src/main/java/io/trino/sql/planner/DomainTranslator.java#L1016-L1018
The logic is easy to write for <constant-prefix>%
pattern case. And perhaps infeasible for any other pattern class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Pitor.
It seems to me then that it is sensible to move forward with this. The generated predicates are an internal implementation detail so if we want to change this or add support for more complex cases in the future we can do so. There is little downside to supporting only the <constant-prefix>%
pattern case for now.
|
} | ||
} | ||
|
||
if chars.last() == Some(&'_') && (chars.len() > 1 && chars[chars.len() - 2] != '\\') { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Handling of constant LIKE patterns (no wildcards or escaped wildcards) here only complicates the code. This should be handled separately.
Here we should be dealing with constant-prefix followed by single unescaped %
.
It would be great if the code's told this intention more directly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for example we could have a function
/// Returns unescaped constant prefix of a LIKE pattern (possibly empty) and the remaining pattern (possibly empty)
fn split_constant_prefix<'a>(like_pattern: &'a str) -> (String /* or Cow */, &'a str) {
todo!()
}
then the logic would be
let (constant_prefix, rest_of_the_pattern) = split_constant_prefix(pattern);
if &constant_prefix == "" || (rest_of_the_pattern != "" && rest_of_the_pattern != "%") [
give up
} else {
produce replacement
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that handling constant LIKE expressions should only be done at the optimizer level and it's safe for us to just not support it here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now it only handle constant-prefix%
. Adding split_constant_prefix
b61c35c
to
78dce9e
Compare
@@ -1710,6 +1711,76 @@ fn build_like_match( | |||
Some(combined) | |||
} | |||
|
|||
// For predicate `col NOT LIKE 'const_prefix%'`, we rewrite it as `(col_min NOT LIKE 'const_prefix%' OR col_max NOT LIKE 'const_prefix%')`. If both col_min and col_max have the prefix const_prefix, we skip the entire row group (as we can be certain that all data in this row group has the prefix const_prefix). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is entirely not true. Please update to reflect actual logic
fn build_not_like_match( | ||
expr_builder: &mut PruningExpressionBuilder<'_>, | ||
) -> Result<Arc<dyn PhysicalExpr>> { | ||
// col NOT LIKE 'const_prefix%' -> !(col_min LIKE 'const_prefix%' && col_max LIKE 'const_prefix%') -> (col_min NOT LIKE 'const_prefix%' || col_max NOT LIKE 'const_prefix%') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is not true either
the truth is (or should be):
col NOT LIKE 'const_prefix%' ->
col_max < 'const_prefix' OR col_min >= 'const_prefiy'
(notice the last character of the rightmost constant is different)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is not true either
Why? Does it prune row groups that contain data that does not match the pattern? Or is it just inefficient?
First, let's clarify one point: if a row group contains any data that does not match the pattern, then this row group must not be pruned. The expected behavior is to return all data that does not match. If the row group gets pruned, data loss will occur (see this PR).
For `col NOT LIKE 'const_prefix%'`:
It applies the condition:
`col_max < 'const_prefix' OR col_min >= 'const_prefiy'`
(Note that the last character of the rightmost constant is different.)
I think this mistakenly prunes row groups that contain data not matching the pattern.
Consider this case: col NOT LIKE 'const_prefix%'
. The row group might include values such as ["aaa", "b", "const_prefix"]
. The condition col_max < 'const_prefix' OR col_min >= 'const_prefiy'
evaluates to false
, causing the row group to be pruned. However, this is incorrect because "aaa"
does not match the pattern and should be included in the result.
// we can not handle `%` at the beginning or in the middle of the pattern | ||
// Example: For pattern "foo%bar", the row group might include values like | ||
// ["foobar", "food", "foodbar"], making it unsafe to prune. | ||
// Even if the min/max values in the group (e.g., "foobar" and "foodbar") | ||
// match the pattern, intermediate values like "food" may not | ||
// match the full pattern "foo%bar", making pruning unsafe. | ||
// (truncate foo%bar to foo% have same problem) | ||
|
||
// we can not handle pattern containing `_` | ||
// Example: For pattern "foo_", row groups might contain ["fooa", "fooaa", "foob"], | ||
// which means not every row is guaranteed to match the pattern. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good explanation for the code we didn't write (or we did, but it was deleted).
This can be written more teresly as
// we can not handle `%` at the beginning or in the middle of the pattern | |
// Example: For pattern "foo%bar", the row group might include values like | |
// ["foobar", "food", "foodbar"], making it unsafe to prune. | |
// Even if the min/max values in the group (e.g., "foobar" and "foodbar") | |
// match the pattern, intermediate values like "food" may not | |
// match the full pattern "foo%bar", making pruning unsafe. | |
// (truncate foo%bar to foo% have same problem) | |
// we can not handle pattern containing `_` | |
// Example: For pattern "foo_", row groups might contain ["fooa", "fooaa", "foob"], | |
// which means not every row is guaranteed to match the pattern. | |
// We handle `NOT LIKE '<constant-prefix>%'` case only, because only this case | |
// can be converted to range predicates that can be then compared with row group min/max values. |
let min_col_not_like_epxr = Arc::new(phys_expr::LikeExpr::new( | ||
true, | ||
false, | ||
Arc::clone(&min_column_expr), | ||
Arc::clone(scalar_expr), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be
col_min >= 'const_prefiy'
(and renamed)
let max_col_not_like_expr = Arc::new(phys_expr::LikeExpr::new( | ||
true, | ||
false, | ||
Arc::clone(&max_column_expr), | ||
Arc::clone(scalar_expr), | ||
)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be
col_max < 'const_prefix'
This is on my list of PRs to review tomorrow |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @UBarney @findepi and @adriangb
I am not sure that the transformation in this PR is correct. Here is a counter example.
Given the data
col |
---|
aa |
ab |
ac |
And a predicate like col NOT LIKE 'ac%'
which matches the first two rows
> select x, x NOT LIKE 'ac%' from foo;
+----+----------------------------+
| x | foo.x NOT LIKE Utf8("ac%") |
+----+----------------------------+
| aa | true |
| ab | true |
| ac | false |
+----+----------------------------+
3 row(s) fetched.
Elapsed 0.001 seconds.
This PR would rewrite the predicate like
col NOT LIKE 'ac%'
-- PruningPredicate rewrite -->
(col_min NOT LIKE 'ac%' OR col_min NOT LIKE 'ac%')
-- Substitution -->
('aa' NOT LIKE 'ac%' OR 'ac' NOT LIKE 'ac%')
-- Evaluation -->
(true OR false)
-- Evaluation -->
true (prune container)
However, the values aa
and ab
both match the predicate so this should not be pruned.
Maybe i am missing something 🤔
@alamb Thanks for such a detailed example and explanation. according to this doc, returning True will KEEP the container.
Here's end to end test (manually😓)
|
yes, I apologize, you are correct 🤦 I am thinking about other potential counter examples. I am also trying to come up with some example where this transformation would not prune a container that a transoformation like the one from @findepi would do |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @UBarney . It is so cool to see the pruning logic getting better and better -- this is quite a nice trick.
After some thought in the 🚿 I have convinced myself this transformation is true and understand the intuition.
Thank you for your patience
In terms of @findepi 's suggestion that there may be other equivalent / better rewrites I think we can address that in a follow on PR. I wasn't able to come up with an example where some other expression on the min/max values would be able to prune containers where this one would not.
]; | ||
prune_with_expr(expr, &schema, &statistics, expected_ret); | ||
|
||
let expr = col("s1").not_like(lit("A\\%%")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please add some negative tests that verify predicates like NOT LIKE 'foo%bar'
and NOT LIKE 'foo%bar_
(aka non empty additional patterns) can not be used to prune anything?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add not like A\u{10ffff}%\u{10ffff}_
. I think existing testcase col("s1").not_like(lit("A\u{10ffff}%\u{10ffff}"))
is equivalent to NOT LIKE 'foo%bar'
BTW you can create such files using datafusion via > copy (values ('aa'), ('ab'), ('ac')) to 't.parquet'; |
Co-authored-by: Andrew Lamb <[email protected]>
@alamb Here's an example where @findepi's transformation would prune a container, but this won't. Given data
And a predicate
|
I agree -- the transformation as specifically proposed by @findepi would incorrectly filter out some ranges. I also tried to think of any cases where we could prove that no values could match the predicate based on min/max values but the transformation proposed in this PR would not prune. I could not think of any |
I'll plan to merge this PR later today or tomorrow unless anyone would like additional time to review |
Which issue does this PR close?
Rationale for this change
What changes are included in this PR?
For predicate
col NOT LIKE 'const_prefix%'
, we rewrite it as(col_min NOT LIKE 'const_prefix%' OR col_max NOT LIKE 'const_prefix%')
. If bothcol_min
andcol_max
have the prefixfoo
, we skip the entire row group (as we can be certain that all data in this row group has the prefixfoo
).Are these changes tested?
Yes
Are there any user-facing changes?
No