Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unexpected behavior when execute CREATE OR REPLACE AS SELECT with REST Catalog #12738

Open
1 of 3 tasks
baotran306 opened this issue Apr 7, 2025 · 0 comments
Open
1 of 3 tasks
Labels
bug Something isn't working

Comments

@baotran306
Copy link

Apache Iceberg version

1.7.2

Query engine

Spark

Please describe the bug 🐞

Describe bug

When I try this commandCREATE OR REPLACE TABLE AS SELECT, it works well for several cases including:

  • Replace table with structure only (I try to limit 0 in my select statement)
  • Replace table from non-partition table to partition table
  • Replace table from identity partition column to another identity partition column

However, there's one case I've tested that was unsuccessful: attempting to replace a table with a transform partitiontion to non-partitioned.

Step reproduce

Firstly, I created an empty non-partitioned table using the CREATE TABLE command, followed by inserting test data. Next, I created another partitioned table with a different structure using the same command as in the first step.

Image

Then, I tried to replace the existing partitioned table using the CREATE OR REPLACE TABLE AS SELECT command along with my non-partitioned table. My expectation was that the table would be completely replaced, including both the schema and data, with the non-partitioned table successfully. However, it didn't work and raised org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 24.0 failed 1 times, most recent failure: Lost task 0.0 in stage 24.0 (TID 50) (10.16.14.4 executor driver): org.apache.iceberg.exceptions.ValidationException: Cannot find source column for partition field: 1000: id_bucket: bucket[16](1)

Image

Okay, I tried one more time. Before replacing the table, I dropped all partition columns using ALTER TABLE DROP PARTITION FIELD. However, it still resulted in the same errors. I don't understand why it is still checking for partition fields when I had already dropped them before replacing the table.(I've already checked metadata file, and there's no partition spec now as I dropped before)

Image

My code: Spark + Rest Catalog(catalog_name = 'lakekeeper')

# Create non-partition table
spark.sql("""
    CREATE TABLE IF NOT EXISTS lakekeeper.risk_mon_out.my_data_non_partitioned(
        vendor_id int,
        product string,
        price double,
        event_time timestamp      
    )
""")
spark.sql("""
    INSERT INTO lakekeeper.risk_mon_out.my_data_non_partitioned VALUES
          (1, "Car", 2.34, TIMESTAMP '2025-03-01 11:03:01'),
          (2, "Bike", 5.43, TIMESTAMP '2025-03-01 11:19:21'),
          (3, "House", 100.4, TIMESTAMP '2025-03-01 12:43:17'),
          (4, "Plane", 124.4, TIMESTAMP '2025-03-01 13:05:29')
""")
spark.sql("SELECT * FROM lakekeeper.risk_mon_out.my_data_non_partitioned").show()

# Create partition table with transformation
spark.sql("""
    CREATE TABLE lakekeeper.risk_mon_out.data_sample_test(
        id int,
        category string,
        event_time timestamp      
    )
    PARTITIONED BY (
        bucket(16, id),
        category
    )
""")
spark.sql("""
    INSERT INTO lakekeeper.risk_mon_out.data_sample_test VALUES
          (5, "HOVal", TIMESTAMP '2025-03-01 11:03:09'),
          (6, "NextMi", TIMESTAMP '2025-03-01 12:19:29'),
          (7, "OGel", TIMESTAMP '2025-01-09 06:43:17'),
          (8, "AAAA", TIMESTAMP '2025-01-09 06:05:29')
""")
spark.sql("SELECT * FROM lakekeeper.risk_mon_out.data_sample_test").show()

# Replace partition table to non-partition table
try:
    spark.sql("""
        CREATE OR REPLACE TABLE lakekeeper.risk_mon_out.data_sample_test
        USING iceberg
        AS SELECT * FROM lakekeeper.risk_mon_out.my_data_non_partitioned
    """)
except Exception as e:
    print(f"Error 111: {e}")

try:
    spark.sql("ALTER TABLE lakekeeper.risk_mon_out.data_sample_test DROP PARTITION FIELD id_bucket")
    spark.sql("ALTER TABLE lakekeeper.risk_mon_out.data_sample_test DROP PARTITION FIELD category")
    print("Drop done")
    spark.sql("""
        CREATE OR REPLACE TABLE lakekeeper.risk_mon_out.data_sample_test
        USING iceberg
        AS SELECT * FROM lakekeeper.risk_mon_out.my_data_non_partitioned
    """)
except Exception as e:
    print("=====")
    print(f"Error 222: {e}")

Expected Behavior

Table will be replaced entirely with both schema and data.

Environment

Apache Iceberg Version: 1.7.2
Execution Engines Tested: Spark 3.5.2
Catalog Type: REST catalog
Storage: S3-compatible

Willingness to contribute

  • I can contribute a fix for this bug independently
  • I would be willing to contribute a fix for this bug with guidance from the Iceberg community
  • I cannot contribute a fix for this bug at this time
@baotran306 baotran306 added the bug Something isn't working label Apr 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant