Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Enhancement] introduce show data distribution command to display data distribution #55588

Open
wants to merge 19 commits into
base: main
Choose a base branch
from

Conversation

MatthewH00
Copy link
Contributor

@MatthewH00 MatthewH00 commented Feb 6, 2025

Why I'm doing:

When data skew occurs, it can lead to cluster instability, but currently there are no good methods to detect it.

What I'm doing:

introduce show data distribution command to display data distribution at the bucket level.
notes : the command is facing the normal user(like data analyst), when occurs data skew , could use the command to detect data skew, then adjust table schema(bucket key) to fix.
syntax: SHOW DATA DISTRIBUTION FROM [db_name.]tbl_name [PARTITION (p1, ...)];
desc:
PartitionName: when partition table display partition name, when unpartition table display table name
BucketId: bucket id
RowCount: row count
RowCount%: at the current partition, current bucket's row count / total row count
DataSize: data size
DataSize%: at the current partition, current bucket's data size / total data size

1.partition table:
1)entire table
e.g. show data distribution from partition_tbl_name
2)single partition
e.g. show data distribution from partition_tbl_name partition(p1)
3)several partition
e.g. show data distribution from partition_tbl_name partition(p1,p2)
1_N

2.unpartition table
1)entire table
e.g. show data distribution from unpartition_tbl_name
e.g. show data distribution from unpartition_tbl_name partition(unpartition_tbl_name)
2_N

3.special case
1)not exist db
return error like: Getting analyzing error. Detail message: Database db_name does not exsit.
2)not exist table
return error like: Getting analyzing error. Detail message: Table does not exist or is not native table: table_name.
3)not exist partition
return error like: Getting analyzing error. Detail message: Partition does not exist: partition_name.
4)not privilege table
return error like: Access denied; you need (at least one of) the ANY privilege(s) on TABLE table_name for this operation.
5)invalid sql
return error like: Getting syntax error at line 1, column 15. Detail message: Unexpected input 'table_name', the most similar input is {'FROM'}.

Fixes #issue

What type of PR is this:

  • BugFix
  • Feature
  • Enhancement
  • Refactor
  • UT
  • Doc
  • Tool

Does this PR entail a change in behavior?

  • Yes, this PR will result in a change in behavior.
  • No, this PR will not result in a change in behavior.

If yes, please specify the type of change:

  • Interface/UI changes: syntax, type conversion, expression evaluation, display information
  • Parameter changes: default values, similar parameters but with different default values
  • Policy changes: use new policy to replace old one, functionality automatically enabled
  • Feature removed
  • Miscellaneous: upgrade & downgrade compatibility, etc.

Checklist:

  • I have added test cases for my bug fix or my new feature
  • This pr needs user documentation (for new or modified features or behaviors)
    • I have added documentation for my new feature or new function
  • This is a backport pr

Bugfix cherry-pick branch check:

  • I have checked the version labels which the pr will be auto-backported to the target branch
    • 3.4
    • 3.3
    • 3.2
    • 3.1
    • 3.0

Signed-off-by: hmx <[email protected]>
Signed-off-by: hmx <[email protected]>
@MatthewH00
Copy link
Contributor Author

@kevincai Hi Could you help review the pr ?
Data skew would lead to cluster instability, but currently there are no good methods to detect it.
the pr would detect the data skew when it occurs by introducing a new command(syntax: show data distribution from ...).
And help to check the UT, it failed , but UT report shows all case passed.

@kevincai
Copy link
Contributor

kevincai commented Feb 6, 2025

may be change it to admin show data distribution from ..., there is a admin show tablets distribution ..., the syntax will be consistent.

@MatthewH00
Copy link
Contributor Author

may be change it to admin show data distribution from ..., there is a admin show tablets distribution ..., the syntax will be consistent.

i think the two syntax is facing two diffrent role.

  1. admin show tablets distribution ...: i think is good for cluster administritor , because administritor need know the tablet distribution on the be node. so the command 's result is tablet 's distribution for be node level.
  2. show data distribution from ...: i think is good for user like Data Analyst. they operate data on cluster. when occurs data skew , they could use the command to detect data skew, then adjust table schema to fix it. so the command 's result is data 's distribution for bucket level.

And i know the error in the UT, i would fix it.

Signed-off-by: hmx <[email protected]>
Signed-off-by: hmx <[email protected]>
Signed-off-by: hmx <[email protected]>
Signed-off-by: hmx <[email protected]>
Signed-off-by: hmx <[email protected]>
Signed-off-by: hmx <[email protected]>
Signed-off-by: hmx <[email protected]>
Signed-off-by: hmx <[email protected]>
Signed-off-by: hmx <[email protected]>
Signed-off-by: hmx <[email protected]>
Signed-off-by: hmx <[email protected]>
Signed-off-by: hmx <[email protected]>
Copy link

sonarqubecloud bot commented Feb 8, 2025

Quality Gate Failed Quality Gate failed

Failed conditions
10.6% Duplication on New Code (required ≤ 3%)

See analysis details on SonarQube Cloud

Copy link

github-actions bot commented Feb 8, 2025

[Java-Extensions Incremental Coverage Report]

pass : 0 / 0 (0%)

Copy link

github-actions bot commented Feb 8, 2025

[FE Incremental Coverage Report]

pass : 99 / 101 (98.02%)

file detail

path covered_line new_line coverage not_covered_line_detail
🔵 com/starrocks/sql/ast/ShowDataDistributionStmt.java 19 21 90.48% [38, 39]
🔵 com/starrocks/sql/analyzer/ShowStmtAnalyzer.java 4 4 100.00% []
🔵 com/starrocks/catalog/MetadataViewer.java 61 61 100.00% []
🔵 com/starrocks/sql/ast/AstVisitor.java 1 1 100.00% []
🔵 com/starrocks/sql/analyzer/AuthorizerStmtVisitor.java 1 1 100.00% []
🔵 com/starrocks/qe/ShowExecutor.java 13 13 100.00% []

Copy link

github-actions bot commented Feb 8, 2025

[BE Incremental Coverage Report]

pass : 0 / 0 (0%)

@MatthewH00
Copy link
Contributor Author

@kevincai Hi Could you please help review the pr when have free time?
Data skew would bring risks to cluster stability, the new command introduced in PR may be a good way to detect data skew from the perspective of the normal user(like data analyst), would help to enhance the cluster stability.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants