You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We encountered the following three issues when using Hudi MOR bucketed tables:
After synchronizing historical data using Spark's insert_bulk mode, we started a Flink task in upsert mode to write incremental data. We found that the compaction operation could only complete when each bucket had data. Additionally, after the compaction operation was completed, each bucket contained only one file.
When querying the Hudi table through Hive, we found that only the data after compaction was readable. If the number of buckets is large, the time taken for compaction significantly delays the availability of data.
After compaction, each bucket contains only one file, and historical files are cleaned up. This happens even though we have configured the following cleaning policies:
options.put("hoodie.clean.automatic", "true");
options.put("hoodie.cleaner.policy", "KEEP_LATEST_COMMITS");
options.put("hoodie.cleaner.commits.retained", "5");
options.put("hoodie.clean.async", "true");
We would like to get answers to the following questions:
Does the compaction operation have to wait until all buckets have files before it can complete?
Is it expected behavior that Hive can only read data after the compaction operation is completed?
After compaction, is it expected that each bucket contains only one file? Is there a way to retain more historical files?
To Reproduce
Steps to reproduce:
Use Spark in insert_bulk mode to write historical data into the Hudi MOR bucketed table.
Start Flink in upsert mode to incrementally write new data.
After incremental data is written, trigger the compaction operation.
Use Hive to query the Hudi table and find that only data after compaction is readable.
Check the file storage and find that only one file is retained in each bucket, and historical files are cleaned up.
Flink-related parameters:
options.put("hoodie.write.concurrency.mode","optimistic_concurrency_control" );
options.put("hoodie.upsert.shuffle.parallelism", "20");
options.put("hoodie.insert.shuffle.parallelism", "20");
options.put("write.operation", "upsert");
options.put("write.tasks", "2");
We encountered the following three issues when using Hudi MOR bucketed tables:
options.put("hoodie.clean.automatic", "true");
options.put("hoodie.cleaner.policy", "KEEP_LATEST_COMMITS");
options.put("hoodie.cleaner.commits.retained", "5");
options.put("hoodie.clean.async", "true");
We would like to get answers to the following questions:
To Reproduce
Steps to reproduce:
Flink-related parameters:
options.put("hoodie.write.concurrency.mode","optimistic_concurrency_control" );
options.put("hoodie.upsert.shuffle.parallelism", "20");
options.put("hoodie.insert.shuffle.parallelism", "20");
options.put("write.operation", "upsert");
options.put("write.tasks", "2");
options.put("index.type","BUCKET");
options.put("hoodie.bucket.index.num.buckets","10");
options.put("hoodie.index.bucket.engine","SIMPLE");
options.put("hoodie.clean.automatic", "true");
options.put("hoodie.cleaner.policy", "KEEP_LATEST_COMMITS");
options.put("hoodie.cleaner.commits.retained", "5");
options.put("hoodie.clean.async", "true");
options.put("hoodie.archive.min.commits", "20");
options.put("hoodie.archive.max.commits", "30");
options.put("hoodie.clean.parallelism", "20");
options.put("hoodie.archive.parallelism", "20");
options.put("hoodie.compact.inline", "false");
options.put("hoodie.compact.inline.max.delta.commits", "1");
options.put("hoodie.compact.schedule.inline", "true");
Expected behavior
Environment Description
● Hudi version: 0.14.0
● Spark version: 3.2.1
● Hive version: 3.1.2
● Hadoop version: 3.2.2
● Storage: HDFS
● Running on Docker?: No
Additional context
Stacktrace
There are no specific error logs; the issue is a question about functional behavior.
The text was updated successfully, but these errors were encountered: