[HUDI-8664] adapt TestSparkSqlCoreFlow for hudi stream API #12602

Davis-Zhang-Onehouse · 2025-01-08T18:51:11Z

Change Logs

Fix broken test as we changed hudi_table_change to spark streaming.

Impact

TestSparkSqlCoreFlow all green now

Risk level (write none, low medium or high below)

none

Documentation Update

none

Contributor's checklist

Read through contributor's guide
Change Logs and Impact were stated clearly
Adequate tests were added if applicable
CI passed

yihua · 2025-01-09T00:09:28Z

hudi-spark-datasource/hudi-spark/src/main/java/org/apache/hudi/HoodieDataSourceHelpers.java

@@ -76,14 +76,22 @@ public static List<String> listCommitsSince(HoodieStorage storage, String basePa

  // this is used in the integration test script: docker/demo/sparksql-incremental.commands
  public static List<String> listCompletionTimeSince(FileSystem fs, String basePath,
-      String instantTimestamp) {
+                                                     String instantTimestamp) {


Could this method use listCompletedInstantSince (to return Stream<HoodieInstant>) to avoid code duplication on the same logic?

yihua · 2025-01-09T00:10:35Z

hudi-spark-datasource/hudi-spark/src/main/java/org/apache/hudi/HoodieDataSourceHelpers.java

+   * Returns the last successful write operation's completed instant.
+   */
+  @PublicAPIMethod(maturity = ApiMaturityLevel.EVOLVING)
+  public static HoodieInstant latestCompletedCommitCompletionTime(FileSystem fs, String basePath) {


Similar here on extracting the common functionality with latestCommit (two methods above).

yihua · 2025-01-09T00:10:58Z

...k-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestSparkSqlCoreFlow.scala

@@ -30,16 +30,15 @@ import org.apache.hudi.common.testutils.RawTripTestPayload.recordsToStrings
 import org.apache.hudi.hadoop.fs.HadoopFSUtils
 import org.apache.hudi.keygen.NonpartitionedKeyGenerator
 import org.apache.hudi.testutils.HoodieClientTestUtils.createMetaClient
-import org.apache.hudi.{DataSourceReadOptions, HoodieSparkUtils}
-
+import org.apache.hudi.DataSourceReadOptions


nit: keep import grouping

yihua · 2025-01-09T00:11:30Z

...k-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestSparkSqlCoreFlow.scala

@@ -91,87 +90,101 @@ class TestSparkSqlCoreFlow extends HoodieSparkSqlTestBase {
    val dataGen = new HoodieTestDataGenerator(HoodieTestDataGenerator.TRIP_NESTED_EXAMPLE_SCHEMA, 0xDEED)

    //Bulk insert first set of records
-    val inputDf0 = generateInserts(dataGen, "000", 100).cache()
+    val inputDf0 = generateInserts(dataGen, "000", 10).cache()


keep the record count the same as before

same for other places

yihua · 2025-01-09T00:16:16Z

hudi-spark-datasource/hudi-spark/src/main/java/org/apache/hudi/HoodieDataSourceHelpers.java

+   * Returns the last successful write operation's completed instant.
+   */
+  @PublicAPIMethod(maturity = ApiMaturityLevel.EVOLVING)
+  public static HoodieInstant latestCompletedCommitCompletionTime(FileSystem fs, String basePath) {


Suggested change

public static HoodieInstant latestCompletedCommitCompletionTime(FileSystem fs, String basePath) {

public static HoodieInstant latestCompletedCommit(FileSystem fs, String basePath) {

yihua · 2025-01-09T00:16:29Z

hudi-spark-datasource/hudi-spark/src/main/java/org/apache/hudi/HoodieDataSourceHelpers.java

+    return timeline.lastInstant().get();
+  }
+
+  public static HoodieInstant latestCompletedCommitCompletionTime(HoodieStorage storage, String basePath) {


Suggested change

public static HoodieInstant latestCompletedCommitCompletionTime(HoodieStorage storage, String basePath) {

public static HoodieInstant latestCompletedCommit(HoodieStorage storage, String basePath) {

yihua · 2025-01-09T00:17:09Z

...k-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestSparkSqlCoreFlow.scala

-    assertEquals(100, snapshotDf2.count())
-    compareUpdateDfWithHudiDf(updateDf, snapshotDf2, snapshotDf1)
-    snapshotDf2.unpersist(true)
+    val commitCompletedInstant2 = latestCompletedCommitCompletionTime(fs, tableBasePath)


Suggested change

val commitCompletedInstant2 = latestCompletedCommitCompletionTime(fs, tableBasePath)

val commitInstant2 = latestCompletedCommitCompletionTime(fs, tableBasePath)

done for all

yihua · 2025-01-09T00:18:36Z

...k-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestSparkSqlCoreFlow.scala

@@ -91,87 +90,101 @@ class TestSparkSqlCoreFlow extends HoodieSparkSqlTestBase {
    val dataGen = new HoodieTestDataGenerator(HoodieTestDataGenerator.TRIP_NESTED_EXAMPLE_SCHEMA, 0xDEED)

    //Bulk insert first set of records
-    val inputDf0 = generateInserts(dataGen, "000", 100).cache()
+    val inputDf0 = generateInserts(dataGen, "000", 10).cache()


same for other places

yihua · 2025-01-09T00:19:03Z

...k-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestSparkSqlCoreFlow.scala

    val uniqueKeyCnt2 = inputDf2.select("_row_key").distinct().count()
    insertInto(tableName, tableBasePath, inputDf2, UPSERT, isMetadataEnabled, 3)
-    val commitInstantTime3 = latestCommit(fs, tableBasePath)
+    val commitCompletedInstant3 = latestCompletedCommitCompletionTime(fs, tableBasePath)


Suggested change

val commitCompletedInstant3 = latestCompletedCommitCompletionTime(fs, tableBasePath)

val commitInstant3 = latestCompletedCommitCompletionTime(fs, tableBasePath)

yihua · 2025-01-09T00:20:10Z

...k-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestSparkSqlCoreFlow.scala


    // Read Incremental Query, uses hudi_table_changes() table valued function for spark sql
    // we have 2 commits, try pulling the first commit (which is not the latest)
    //HUDI-5266
-    val firstCommit = listCommitsSince(fs, tableBasePath, "000").get(0)
+    val firstCommitInstant = listCompletedInstantSince(fs, tableBasePath, "000").get(0)
+    val firstCommit = firstCommitInstant.getCompletionTime


Suggested change

val firstCommit = firstCommitInstant.getCompletionTime

val firstCommitCompletionTime = firstCommitInstant.getCompletionTime

Let's make sure all variables have consistent naming.

yihua · 2025-01-09T00:30:01Z

...k-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestSparkSqlCoreFlow.scala

+    val beforeRowMap = beforeRows.map(row => getRowKey(row) -> row).toMap
+
+    // Check that all input rows exist in hudiRows
+    inputRows.foreach { inputRow =>


Could we just sort the list by record key and do row comparison? Will that code be easier to understand?

I was thinking the same but it is not the case:
we need to keep 3 idx for the 3 arrays, search in both inputRows and beforeRows for each row in hudiRows. Also need to handle various cases where the key cannot be found in total it leads to ~100 lines of code.

I can do that if required, the current one is the most concise one (but not the most efficient one since we are just handling couple of hundred rows)

yihua · 2025-01-09T00:32:10Z

...k-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestSparkSqlCoreFlow.scala

-    val hudiDfToCompare = spark.sqlContext.sql("select " + colsToCompare + " from hudiTbl")
-    val inputDfToCompare = spark.sqlContext.sql("select " + colsToCompare + " from inputTbl")
-    val beforeDfToCompare = spark.sqlContext.sql("select " + colsToCompare + " from beforeTbl")
+  def compareUpdateDfWithHudiRows(inputRows: Array[Row], hudiRows: Array[Row], beforeRows: Array[Row]): Unit = {


Does this achieves almost the same functionality as compareUpdateRowsWithHudiRows? Could we keep one of them only?

I can do that, it will requires refactoring other consumers as well

yihua · 2025-01-09T00:34:09Z

...k-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestSparkSqlCoreFlow.scala

@@ -331,6 +401,32 @@ class TestSparkSqlCoreFlow extends HoodieSparkSqlTestBase {
    assertEquals(hudiDfToCompare.except(inputDfToCompare).count, 0)
  }

+  private def compareEntireInputRowsWithHudiRows(snapshotDf2Rows: Array[Row], timeTravelDfRows: Array[Row]): Unit = {


Suggested change

private def compareEntireInputRowsWithHudiRows(snapshotDf2Rows: Array[Row], timeTravelDfRows: Array[Row]): Unit = {

private def compareEntireInputRowsWithHudiRows(expectedRows: Array[Row], actualRows: Array[Row]): Unit = {

yihua · 2025-01-09T00:34:53Z

...k-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestSparkSqlCoreFlow.scala

+    }
+  }
+
+  def compareUpdateRowsWithHudiRows(inputRows: Array[Row], hudiRows: Array[Row], beforeRows: Array[Row]): Unit = {


Suggested change

def compareUpdateRowsWithHudiRows(inputRows: Array[Row], hudiRows: Array[Row], beforeRows: Array[Row]): Unit = {

def compareUpdateRowsWithHudiRows(expectedRows: Array[Row], actualUpdateRows: Array[Row], actualRows: Array[Row]): Unit = {

Could you name them properly based on how they are used?

yihua · 2025-01-09T00:36:49Z

...k-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestSparkSqlCoreFlow.scala

@@ -331,6 +401,32 @@ class TestSparkSqlCoreFlow extends HoodieSparkSqlTestBase {
    assertEquals(hudiDfToCompare.except(inputDfToCompare).count, 0)
  }

+  private def compareEntireInputRowsWithHudiRows(snapshotDf2Rows: Array[Row], timeTravelDfRows: Array[Row]): Unit = {


Is it possible to consolidate this with compareUpdateRowsWithHudiRows as well?

hudi-bot · 2025-01-09T04:46:27Z

CI report:

3192ada UNKNOWN
4d43cbd UNKNOWN
efc1e23 Azure: SUCCESS

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

adapt TestSparkSqlCoreFlow for hudi stream API

c9c705c

github-actions bot added the size:M PR with lines of changes in (100, 300] label Jan 8, 2025

Davis-Zhang-Onehouse force-pushed the HUDI-8664 branch from c9c705c to f77c625 Compare January 8, 2025 21:07

yihua reviewed Jan 9, 2025

View reviewed changes

Davis-Zhang-Onehouse force-pushed the HUDI-8664 branch 3 times, most recently from 3192ada to 4d43cbd Compare January 9, 2025 02:09

address PR comment

efc1e23

Davis-Zhang-Onehouse force-pushed the HUDI-8664 branch from 4d43cbd to efc1e23 Compare January 9, 2025 02:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[HUDI-8664] adapt TestSparkSqlCoreFlow for hudi stream API #12602

[HUDI-8664] adapt TestSparkSqlCoreFlow for hudi stream API #12602

Davis-Zhang-Onehouse commented Jan 8, 2025

yihua Jan 9, 2025

Davis-Zhang-Onehouse Jan 9, 2025

yihua Jan 9, 2025

Davis-Zhang-Onehouse Jan 9, 2025

yihua Jan 9, 2025

Davis-Zhang-Onehouse Jan 9, 2025

yihua Jan 9, 2025

yihua Jan 9, 2025

Davis-Zhang-Onehouse Jan 9, 2025

yihua Jan 9, 2025

Davis-Zhang-Onehouse Jan 9, 2025

yihua Jan 9, 2025

Davis-Zhang-Onehouse Jan 9, 2025

yihua Jan 9, 2025

Davis-Zhang-Onehouse Jan 9, 2025

yihua Jan 9, 2025

yihua Jan 9, 2025

Davis-Zhang-Onehouse Jan 9, 2025

yihua Jan 9, 2025

yihua Jan 9, 2025

Davis-Zhang-Onehouse Jan 9, 2025

yihua Jan 9, 2025

Davis-Zhang-Onehouse Jan 9, 2025

yihua Jan 9, 2025

Davis-Zhang-Onehouse Jan 9, 2025

yihua Jan 9, 2025

Davis-Zhang-Onehouse Jan 9, 2025

yihua Jan 9, 2025

yihua Jan 9, 2025

Davis-Zhang-Onehouse Jan 9, 2025

yihua Jan 9, 2025

Davis-Zhang-Onehouse Jan 9, 2025

hudi-bot commented Jan 9, 2025

	public static HoodieInstant latestCompletedCommitCompletionTime(FileSystem fs, String basePath) {
	public static HoodieInstant latestCompletedCommit(FileSystem fs, String basePath) {

	val commitCompletedInstant2 = latestCompletedCommitCompletionTime(fs, tableBasePath)
	val commitInstant2 = latestCompletedCommitCompletionTime(fs, tableBasePath)

	val commitCompletedInstant3 = latestCompletedCommitCompletionTime(fs, tableBasePath)
	val commitInstant3 = latestCompletedCommitCompletionTime(fs, tableBasePath)

	val firstCommit = firstCommitInstant.getCompletionTime
	val firstCommitCompletionTime = firstCommitInstant.getCompletionTime

	private def compareEntireInputRowsWithHudiRows(snapshotDf2Rows: Array[Row], timeTravelDfRows: Array[Row]): Unit = {
	private def compareEntireInputRowsWithHudiRows(expectedRows: Array[Row], actualRows: Array[Row]): Unit = {

	def compareUpdateRowsWithHudiRows(inputRows: Array[Row], hudiRows: Array[Row], beforeRows: Array[Row]): Unit = {
	def compareUpdateRowsWithHudiRows(expectedRows: Array[Row], actualUpdateRows: Array[Row], actualRows: Array[Row]): Unit = {

[HUDI-8664] adapt TestSparkSqlCoreFlow for hudi stream API #12602

Are you sure you want to change the base?

[HUDI-8664] adapt TestSparkSqlCoreFlow for hudi stream API #12602

Conversation

Davis-Zhang-Onehouse commented Jan 8, 2025

Change Logs

Impact

Risk level (write none, low medium or high below)

Documentation Update

Contributor's checklist

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hudi-bot commented Jan 9, 2025

CI report: