Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Always use StringViewArray as output of substr #14498

Merged
merged 5 commits into from
Feb 8, 2025

Conversation

Kev1n8
Copy link
Contributor

@Kev1n8 Kev1n8 commented Feb 5, 2025

Which issue does this PR close?

Rationale for this change

Generate StringViewArray whatever input type is for efficiency.

What changes are included in this PR?

  1. Behavior of substr, return type is always Utf8View.
  2. Unittest updated.
  3. Some changes in datafusion/sqllogictest/test_files/tpch/plans/q22.slt.part.

Are these changes tested?

Yes.

Are there any user-facing changes?

I'm not sure, the return type of substr is fixed to Utf8View now.

@github-actions github-actions bot added sqllogictest SQL Logic Tests (.slt) functions labels Feb 5, 2025
@2010YOUY01
Copy link
Contributor

Thank you, it looks good to me.
Could you resolve the conflict?

@Kev1n8
Copy link
Contributor Author

Kev1n8 commented Feb 7, 2025

Could you resolve the conflict?

Sure. Done.

@alamb
Copy link
Contributor

alamb commented Feb 7, 2025

Thank you @Kev1n8 and @2010YOUY01 -- I am running some benchmarks on this PR to see if it makes any difference

@alamb
Copy link
Contributor

alamb commented Feb 7, 2025

My benchmark results show no change in Q22 (but also clearly the data is quite noisy 🤔 )

--------------------
Benchmark tpch_sf1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃ main_base ┃ substr-always-output-utf8view ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1     │  220.66ms │                      217.19ms │     no change │
│ QQuery 2     │  120.04ms │                      118.99ms │     no change │
│ QQuery 3     │  117.00ms │                      133.73ms │  1.14x slower │
│ QQuery 4     │   82.79ms │                       84.31ms │     no change │
│ QQuery 5     │  161.05ms │                      158.10ms │     no change │
│ QQuery 6     │   43.91ms │                       61.73ms │  1.41x slower │
│ QQuery 7     │  196.49ms │                      208.38ms │  1.06x slower │
│ QQuery 8     │  161.86ms │                      168.99ms │     no change │
│ QQuery 9     │  239.92ms │                      240.70ms │     no change │
│ QQuery 10    │  207.67ms │                      203.80ms │     no change │
│ QQuery 11    │   96.42ms │                       94.76ms │     no change │
│ QQuery 12    │  104.79ms │                      119.54ms │  1.14x slower │
│ QQuery 13    │  203.66ms │                      208.01ms │     no change │
│ QQuery 14    │   71.92ms │                       70.81ms │     no change │
│ QQuery 15    │  118.13ms │                      130.72ms │  1.11x slower │
│ QQuery 16    │   68.90ms │                       72.23ms │     no change │
│ QQuery 17    │  216.46ms │                      213.50ms │     no change │
│ QQuery 18    │  319.42ms │                      322.68ms │     no change │
│ QQuery 19    │  116.85ms │                      115.95ms │     no change │
│ QQuery 20    │  135.26ms │                      123.83ms │ +1.09x faster │
│ QQuery 21    │  279.46ms │                      275.56ms │     no change │
│ QQuery 22    │   68.17ms │                       67.74ms │     no change │
└──────────────┴───────────┴───────────────────────────────┴───────────────┘

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense to me -- thank you @Kev1n8 and @2010YOUY01

@2010YOUY01
Copy link
Contributor

My benchmark results show no change in Q22 (but also clearly the data is quite noisy 🤔 )

--------------------
Benchmark tpch_sf1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃ main_base ┃ substr-always-output-utf8view ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1     │  220.66ms │                      217.19ms │     no change │
│ QQuery 2     │  120.04ms │                      118.99ms │     no change │
│ QQuery 3     │  117.00ms │                      133.73ms │  1.14x slower │
│ QQuery 4     │   82.79ms │                       84.31ms │     no change │
│ QQuery 5     │  161.05ms │                      158.10ms │     no change │
│ QQuery 6     │   43.91ms │                       61.73ms │  1.41x slower │
│ QQuery 7     │  196.49ms │                      208.38ms │  1.06x slower │
│ QQuery 8     │  161.86ms │                      168.99ms │     no change │
│ QQuery 9     │  239.92ms │                      240.70ms │     no change │
│ QQuery 10    │  207.67ms │                      203.80ms │     no change │
│ QQuery 11    │   96.42ms │                       94.76ms │     no change │
│ QQuery 12    │  104.79ms │                      119.54ms │  1.14x slower │
│ QQuery 13    │  203.66ms │                      208.01ms │     no change │
│ QQuery 14    │   71.92ms │                       70.81ms │     no change │
│ QQuery 15    │  118.13ms │                      130.72ms │  1.11x slower │
│ QQuery 16    │   68.90ms │                       72.23ms │     no change │
│ QQuery 17    │  216.46ms │                      213.50ms │     no change │
│ QQuery 18    │  319.42ms │                      322.68ms │     no change │
│ QQuery 19    │  116.85ms │                      115.95ms │     no change │
│ QQuery 20    │  135.26ms │                      123.83ms │ +1.09x faster │
│ QQuery 21    │  279.46ms │                      275.56ms │     no change │
│ QQuery 22    │   68.17ms │                       67.74ms │     no change │
└──────────────┴───────────┴───────────────────────────────┴───────────────┘

I believe this is expected to have no change:
We already have StringView -> substr() -> StringView
This PR just changed StringArray -> substr() -> StringArray to StringArray -> substr() -> StringView

The benchmark runs on Parquet, which read to StringView by default.
The TPCH test result changes due to this PR because it runs on CSV (.tbl). Allowing the CSV scanner to produce StringView seems like an optimization we could do

@2010YOUY01 2010YOUY01 merged commit 91c0975 into apache:main Feb 8, 2025
26 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
functions sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Use StringViewArray as output of substr when input was StringArray
3 participants