[SUPPORT] Queries are very memory intensive due to low read parallelism in HoodieMergeOnReadRDD #12434

mzheng-plaid · 2024-12-05T20:04:22Z

Describe the problem you faced

We have jobs that read from a MOR table using the following pyspark pseudo-code (event_table_rt is the MOR table):

partitions = ["2023-11-13", "2023-11-14", "2023-11-15", "2023-11-16", "2023-11-17"]
event_df = spark.sql("select * from event_table_rt").filter(
    F.col("dt").isin(partitions)
)
user_df = spark.read.format("csv").option("header", "true").load(users_path)
filtered_events_df = df.join(
    F.broadcast(user_df),
    on=df["user_id"] == user_df["id"],
    how="inner",
)
filtered_events_df.write.format("parquet").save("s3://...")

We're running into a bottleneck on HoodieMergeOnReadRDD (https://github.com/apache/hudi/blob/release-0.14.2/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieMergeOnReadRDD.scala#L37) where the number of tasks in the stage reading event_df seems to be non-configurable and (I think) equal to the number of files being read. This is causing massive disk/memory spill and bottlenecking performance.

Is it possible to configure the read parallelism to be higher or is this a fundamental limitation of Hudi with MOR tables? What is the recommendation for how to tune resourcing for readers of MOR tables?

Environment Description

Hudi version : 0.14.1-amzn-1 (EMR 7.2.0)
Spark version : 3.5.1
Hive version : 3.1.3
Hadoop version : 3.3.6
Storage (HDFS/S3/GCS..) : S3
Running on Docker? (yes/no) : no

The text was updated successfully, but these errors were encountered:

ad1happy2go · 2024-12-06T14:18:49Z

@mzheng-plaid The parallelism should be equal to the number of file groups as we can have one task reading one parquet file. Want to understand more if the parquet files / log files are properly sized then whey you will face bottleneck at task level?

mzheng-plaid · 2024-12-06T18:01:22Z

We set these to:

            "hoodie.parquet.max.file.size": (512 * 1024 * 1024),  # 512 mb
            "hoodie.parquet.block.size": (
                512 * 1024 * 1024
            ),  # 512 mb
            "hoodie.parquet.small.file.limit": (256 * 1024 * 1024),

So with parquet there is spark.sql.files.maxPartitionBytes and if I'm understanding apache/iceberg#8922 (comment) correctly Iceberg also supports split size - does Hudi have any similar support for reading large files? As far as I can tell spark.sql.files.maxPartitionBytes is ignored

ad1happy2go · 2024-12-08T14:05:42Z

@mzheng-plaid I dont think HoodieMergeOnReadRDD has a way to split filegroups further during read. Any way it will be difficult with snapshot read, as log files has to be applied on the parquet records.

mzheng-plaid · 2024-12-09T21:08:08Z

This is problematic even on the read optimized table (ie. just the base parquet files), which is really surprising

I tried:

A read-optimized query on the Hudi table
Calling spark.read.format("parquet").load({s3_path})

And just reading the parquet files directly was much less memory intensive and faster (ie. not spilling to disk) when I tuned spark.sql.files.maxPartitionBytes. I understand this will read multiple versions of the file groups but its surprising how much worse read performance is with Hudi.

mzheng-plaid · 2025-01-06T17:47:16Z

@ad1happy2go bump on this, is there any workaround for read-optimized queries? That behavior is surprising

ad1happy2go added performance reader-core labels Dec 6, 2024

ad1happy2go added this to Hudi Issue Support Dec 6, 2024

github-project-automation bot moved this to ⏳ Awaiting Triage in Hudi Issue Support Dec 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SUPPORT] Queries are very memory intensive due to low read parallelism in HoodieMergeOnReadRDD #12434

[SUPPORT] Queries are very memory intensive due to low read parallelism in HoodieMergeOnReadRDD #12434

mzheng-plaid commented Dec 5, 2024

ad1happy2go commented Dec 6, 2024

mzheng-plaid commented Dec 6, 2024

ad1happy2go commented Dec 8, 2024

mzheng-plaid commented Dec 9, 2024

mzheng-plaid commented Jan 6, 2025

[SUPPORT] Queries are very memory intensive due to low read parallelism in HoodieMergeOnReadRDD #12434

[SUPPORT] Queries are very memory intensive due to low read parallelism in HoodieMergeOnReadRDD #12434

Comments

mzheng-plaid commented Dec 5, 2024

ad1happy2go commented Dec 6, 2024

mzheng-plaid commented Dec 6, 2024

ad1happy2go commented Dec 8, 2024

mzheng-plaid commented Dec 9, 2024

mzheng-plaid commented Jan 6, 2025