Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SUPPORT] Does Hudi re-create record level index during an upsert operation? #12783

Open
dataproblems opened this issue Feb 5, 2025 · 4 comments
Labels
index metadata metadata table priority:critical production down; pipelines stalled; Need help asap.

Comments

@dataproblems
Copy link

Describe the problem you faced

I created a hudi table with record level index and performed upsert operation on it. Now, the first time when I performed the upsert operation, it read the record index file, figured out which files needed an update, and wrote the files to S3. The second time when I performed an upsert on the same table, I saw the record index folder being deleted and recreated under the metadata folder. My hudi table is quite large and this re-creation of the entire record level index is too expensive to support during an upsert operation.

To Reproduce

Steps to reproduce the behavior:

  1. Create hudi table using insert mode and specify record level index
  2. Perform an upsert
  3. Perform another upsert

Expected behavior

I expect any number of upserts after the initial creation of the record level index to just update the index as required and not re-create the whole index.

Environment Description

  • Hudi version : 0.15.0

  • Spark version : 3.4.1

  • Hive version :

  • Hadoop version : 3.3.6

  • Storage (HDFS/S3/GCS..) : S3

  • Running on Docker? (yes/no) : no

Additional context

Config I used for upsert operation:

    hoodie.embed.timeline.server -> false, 
    hoodie.parquet.small.file.limit -> 1073741824, 
    hoodie.metadata.record.index.enable -> true, 
    hoodie.datasource.write.precombine.field -> $timestampField, 
    hoodie.datasource.write.payload.class -> org.apache.hudi.common.model.OverwriteWithLatestAvroPayload,   
    hoodie.metadata.index.column.stats.enable -> true, 
    hoodie.parquet.max.file.size -> 2147483648, 
    hoodie.metadata.enable -> true, 
    hoodie.index.type -> RECORD_INDEX, 
    hoodie.datasource.write.operation -> upsert, 
    hoodie.parquet.compression.codec -> snappy, 
    hoodie.datasource.write.recordkey.field -> $recordKeyField, 
    hoodie.table.name -> $tableName, 
    hoodie.datasource.write.table.type -> COPY_ON_WRITE, 
    hoodie.datasource.write.hive_style_partitioning -> true, 
    hoodie.write.markers.type -> DIRECT, 
    hoodie.populate.meta.fields -> true, 
    hoodie.datasource.write.keygenerator.class -> org.apache.hudi.keygen.SimpleKeyGenerator, 
    hoodie.upsert.shuffle.parallelism -> 10000, 
    hoodie.datasource.write.partitionpath.field -> $partitionField

I do not see any errors but it doesn't make sense that hudi will clear away my index and recreate it.

@danny0405
Copy link
Contributor

The MDT is updated incrementally for each time of upsert, the reason MDT got re-initialized should be some data consistency issue between MDT and data table.

@dataproblems
Copy link
Author

Is there a log line that I could search for to determine what might have caused it? Since there is only one writer that writes to this hudi table, is there a way to know what caused the inconsistency?

@ad1happy2go
Copy link
Collaborator

@dataproblems Are you saying the record_index directory itself getting deleted and created ? If yes, that is not expected for sure.

Can you share hudi configurations? Can you share driver logs also.

@ad1happy2go ad1happy2go added index metadata metadata table priority:critical production down; pipelines stalled; Need help asap. labels Feb 7, 2025
@github-project-automation github-project-automation bot moved this to ⏳ Awaiting Triage in Hudi Issue Support Feb 7, 2025
@dataproblems
Copy link
Author

@ad1happy2go the hudi configuration I used was this:

    hoodie.embed.timeline.server -> false, 
    hoodie.parquet.small.file.limit -> 1073741824, 
    hoodie.metadata.record.index.enable -> true, 
    hoodie.datasource.write.precombine.field -> $timestampField, 
    hoodie.datasource.write.payload.class -> org.apache.hudi.common.model.OverwriteWithLatestAvroPayload,   
    hoodie.metadata.index.column.stats.enable -> true, 
    hoodie.parquet.max.file.size -> 2147483648, 
    hoodie.metadata.enable -> true, 
    hoodie.index.type -> RECORD_INDEX, 
    hoodie.datasource.write.operation -> upsert, 
    hoodie.parquet.compression.codec -> snappy, 
    hoodie.datasource.write.recordkey.field -> $recordKeyField, 
    hoodie.table.name -> $tableName, 
    hoodie.datasource.write.table.type -> COPY_ON_WRITE, 
    hoodie.datasource.write.hive_style_partitioning -> true, 
    hoodie.write.markers.type -> DIRECT, 
    hoodie.populate.meta.fields -> true, 
    hoodie.datasource.write.keygenerator.class -> org.apache.hudi.keygen.SimpleKeyGenerator, 
    hoodie.upsert.shuffle.parallelism -> 10000, 
    hoodie.datasource.write.partitionpath.field -> $partitionField

I may not be able to share driver logs as I saw this for our production table. However, is there a specific error message that I can search for in the logs? I can confirm if something like that exists.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
index metadata metadata table priority:critical production down; pipelines stalled; Need help asap.
Projects
Status: Awaiting Triage
Development

No branches or pull requests

3 participants