Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SUPPORT]Difference in accuracy between the results fromspark and the hive #12370

Open
suxiaogang223 opened this issue Nov 29, 2024 · 6 comments
Labels
data-consistency phantoms, duplicates, write skew, inconsistent snapshot priority:critical production down; pipelines stalled; Need help asap. schema-and-data-types

Comments

@suxiaogang223
Copy link

suxiaogang223 commented Nov 29, 2024

Tips before filing an issue

  • Have you gone through our FAQs?
    The url is invalid and page not found :(

  • Join the mailing list to engage in conversations and get faster support at dev-subscribe@hudi.apache.org.

  • If you have triaged this as a bug, then file an issue directly.

Describe the problem you faced

When I use spark to write a table with timestamp type, there is a difference in accuracy between the results found using spark and the hive results.

create table test_timestamp(id int, time timestamp)using hudi;
insert into test_timestamp values(1,timestamp('2024-11-28 12:00:00.123456'));

select time from test_timestamp;
-- result
2024-11-28 12:00:00.123456

-- result from hive
2024-11-28 04:00:00.123

Is this behavior expected, and are there any plans to improve it in the future?
To Reproduce

Steps to reproduce the behavior:

Expected behavior

A clear and concise description of what you expected to happen.

Environment Description

  • Hudi version :

  • hudi-0.15

  • Spark version :

  • 3.4.2

  • Hive version :

  • 3.1.3

  • Hadoop version :

  • 3.1

  • Storage (HDFS/S3/GCS..) :

  • Running on Docker? (yes/no) :

Additional context

Add any other context about the problem here.

Stacktrace

Add the stacktrace of the error.

@rangareddy
Copy link

rangareddy commented Dec 2, 2024

Hi @suxiaogang223

Thanks for reporting this issue. I am able to reproduce it. Please allow some time to provide a solution.

Mean while could you please check what is the timezone where you are running hive shell from the terminal?

timedatectl

When i check the parquet file data it is having in micro seconds format.

_hoodie_commit_time: string
_hoodie_commit_seqno: string
_hoodie_record_key: string
_hoodie_partition_path: string
_hoodie_file_name: string
id: int32
time: timestamp[us, tz=UTC]
----
_hoodie_commit_time: [["20241202103846073"]]
_hoodie_commit_seqno: [["20241202103846073_0_0"]]
_hoodie_record_key: [["20241202103846073_0_0"]]
_hoodie_partition_path: [[""]]
_hoodie_file_name: [["730743c3-73b3-473b-82f8-fa242d7e78b4-0_0-13-56_20241202103846073.parquet"]]
id: [[1]]
time: [[2024-11-28 12:00:00.123456Z]]

@ad1happy2go ad1happy2go added schema-and-data-types priority:critical production down; pipelines stalled; Need help asap. data-consistency phantoms, duplicates, write skew, inconsistent snapshot labels Dec 3, 2024
@github-project-automation github-project-automation bot moved this to ⏳ Awaiting Triage in Hudi Issue Support Dec 3, 2024
@suxiaogang223
Copy link
Author

Hi @suxiaogang223

Thanks for reporting this issue. I am able to reproduce it. Please allow some time to provide a solution.

Mean while could you please check what is the timezone where you are running hive shell from the terminal?

timedatectl

When i check the parquet file data it is having in micro seconds format.

_hoodie_commit_time: string
_hoodie_commit_seqno: string
_hoodie_record_key: string
_hoodie_partition_path: string
_hoodie_file_name: string
id: int32
time: timestamp[us, tz=UTC]
----
_hoodie_commit_time: [["20241202103846073"]]
_hoodie_commit_seqno: [["20241202103846073_0_0"]]
_hoodie_record_key: [["20241202103846073_0_0"]]
_hoodie_partition_path: [[""]]
_hoodie_file_name: [["730743c3-73b3-473b-82f8-fa242d7e78b4-0_0-13-56_20241202103846073.parquet"]]
id: [[1]]
time: [[2024-11-28 12:00:00.123456Z]]

This inconsistency is happened due to hive query engine.

Thanks for replay
The result is

timedatectl
               Local time: Wed 2024-12-04 11:40:17 CST
           Universal time: Wed 2024-12-04 03:40:17 UTC
                 RTC time: Wed 2024-12-04 03:40:16
                Time zone: Asia/Shanghai (CST, +0800)
System clock synchronized: yes
              NTP service: active
          RTC in local TZ: no

The difference in time is expected behavior, because spark does not parse timestamp according to UTC format, mainly the difference between precision, is there a way to make hive output accurate to microseconds?

@rangareddy
Copy link

Hi @suxiaogang223

The main issue is that Hive converts the timestamp value from UTC to your machine's timezone. Could you please set the timezone to UTC and see if it works?

export HS2_OPTS="-Duser.timezone=$HS2_USER_TZ -Dhive.local.time.zone=$HIVE_LOCAL_TZ"

https://community.cloudera.com/t5/Support-Questions/Can-we-change-default-hive-hbase-timestamp-from-UTC-to-other/m-p/336202

@suxiaogang223
Copy link
Author

Hi @suxiaogang223

The main issue is that Hive converts the timestamp value from UTC to your machine's timezone. Could you please set the timezone to UTC and see if it works?

export HS2_OPTS="-Duser.timezone=$HS2_USER_TZ -Dhive.local.time.zone=$HIVE_LOCAL_TZ"

https://community.cloudera.com/t5/Support-Questions/Can-we-change-default-hive-hbase-timestamp-from-UTC-to-other/m-p/336202

It works, but precision is also wrong, is there a way to make hive output accurate to microseconds?

@rangareddy
Copy link

Hi @suxiaogang223

If you create any Hive table with microseconds, it will work, but on the Hudi side, it is not functioning. I will discuss this with the team internally and create a bug report.

@rangareddy
Copy link

Created Hudi Jira - https://issues.apache.org/jira/browse/HUDI-8677

@ad1happy2go ad1happy2go moved this from ⏳ Awaiting Triage to 🏁 Triaged in Hudi Issue Support Dec 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data-consistency phantoms, duplicates, write skew, inconsistent snapshot priority:critical production down; pipelines stalled; Need help asap. schema-and-data-types
Projects
Status: 🏁 Triaged
Development

No branches or pull requests

3 participants