[SUPPORT] Dataloss in FlinkCDC into Hudi without any exception or other infomation #10542

xuzifu666 · 2024-01-22T03:32:39Z

Tips before filing an issue

Describe the problem you faced

When we use FlinkCDC into HUDI 0.14.0/0.14.1，would occur dataloss such as deleted data in mysql but can find the data in Hudi，or insert data to mysql but not find data in Hudi，more important not produce any exception for us to dig cause root.
sences:

Flink write to Hudi without compaction
Flink write to Hudi with compaction online
Flink write to Hudi with compaction offline
above 3 conditions all can occur the problem

To Reproduce

Steps to reproduce the behavior:

Flink write to hudi with Billions data
then check data with mysql

Expected behavior

all data should consistent with mysql

Environment Description

Flink version : 1.13.2
Hudi version : 0.14.0/0.14.1
Spark version : 3.2.0
Hive version : 1.1.0
Hadoop version : 3.2.1
Storage (HDFS/S3/GCS..) : HDFS
Running on Docker? (yes/no) :no

Additional context

Add any other context about the problem here.

Stacktrace

Not any stacktrace to dig it，flink and compaction job is fine ant the time

danny0405 · 2024-01-22T04:01:39Z

Did you try 1.0, does it fix the data loss ?

xuzifu666 · 2024-01-22T04:08:00Z

Did you try 1.0, does it fix the data loss ?

Currently only test it on 0.14.0/0.14.1，due to 1.0 version contains much problem such as many experimental config would regress functional or performance before GA version. Try best to fix it in current version is better. @danny0405

danny0405 · 2024-01-22T05:26:45Z

Maybe relevent with this fix: #10398

xuzifu666 · 2024-01-22T06:03:00Z

Maybe relevent with this fix: #10398

OK，we would try the pr，Thanks @danny0405

ad1happy2go · 2024-01-31T12:23:05Z

@xuzifu666 Were you able to resolve this issue with that PR. Please share the insights.

xuzifu666 · 2024-01-31T12:31:07Z

@xuzifu666 Were you able to resolve this issue with that PR. Please share the insights.

Hi，not yet. dataloss also in 0.14.x maybe we need have a try in 1.0 version.@ad1happy2go

xuzifu666 · 2024-02-01T02:47:36Z

@ad1happy2go According to feedbacks before，the dataloss bug was fixed in 1.0 beta version? The bug is too serious many business cannot be execute which need 100% data quality such as cdc scene. cc @danny0405

ad1happy2go · 2024-02-01T09:05:45Z

Thanks for the update @xuzifu666 . But did you tried with that PR (#10398) on top of 0.14.1 as Danny suggested ?

xuzifu666 · 2024-02-01T09:36:01Z

Thanks for the update @xuzifu666 . But did you tried with that PR (#10398) on top of 0.14.1 as Danny suggested ?

Had tried，dataloss also. @ad1happy2go

ad1happy2go · 2024-02-01T17:33:24Z

@xuzifu666 Can you post your table configurations or code snippet which you using to load the data. Did you tried to reproduce the same with small data or you are seeing this behaviour with billion of records only. I will also try once in case I can reproduce.

xuzifu666 · 2024-02-29T01:45:42Z

@ad1happy2go @danny0405 Hi，recently we had check dataloss case in Hudi 1.0beta，this problem had been fixed，but 0.1X version also had dataloss case. Thanks for your attention

xuzifu666 · 2024-03-04T07:25:33Z

Had been resolved in Hudi 1.0 beta，so close the issue

danny0405 · 2024-03-05T04:32:57Z

May caused by this PR: #9545

nsivabalan · 2024-03-06T01:26:25Z

hey @xuzifu666 :
do you happened to have the old data intact which had data loss. We would like to root cause this. 0.x release line will be used by lot of OSS users. So, we really wanna get to the bottom of it and fix it.

Would greatly appreciate if you an help us triage this.

Do you happened to know when exactly the data loss happens. do you see anything interesting in the timeline around the time the data loss happens.
Is it a single writer or multi-writer.
We do have some suspicion around log record reading that we are chasing. Ref ticket: [SUPPORT] Data loss due to incorrect selection of log file during compaction #10803 But I do not want to bias this one. lets get more info about when exactly data loss is seen.
Are there any task retries in general. I am not familiar w/ flink. But in spark, we might have spark task retries. Are there any such things happening in your pipeline.
Is it happening across all pipelines occasionally or very few pipelines. And if its very few, is there any common characteristics like index type, metadata enabled, etc. in comparison to other pipelines which does not have the data loss issue.
And can you confirm that these pipelines were running w/o any issues w/ older versions of hudi.
Do you happened to reproduce this in a deterministic manner?

xuzifu666 · 2024-03-06T02:00:10Z

hey @xuzifu666 : do you happened to have the old data intact which had data loss. We would like to root cause this. 0.x release line will be used by lot of OSS users. So, we really wanna get to the bottom of it and fix it.

Would greatly appreciate if you an help us triage this.

Do you happened to know when exactly the data loss happens. do you see anything interesting in the timeline around the time the data loss happens.

Is it a single writer or multi-writer.

We do have some suspicion around log record reading that we are chasing. Ref ticket: [SUPPORT] Data loss due to incorrect selection of log file during compaction #10803 But I do not want to bias this one. lets get more info about when exactly data loss is seen.

Are there any task retries in general. I am not familiar w/ flink. But in spark, we might have spark task retries. Are there any such things happening in your pipeline.

Is it happening across all pipelines occasionally or very few pipelines. And if its very few, is there any common characteristics like index type, metadata enabled, etc. in comparison to other pipelines which does not have the data loss issue.

And can you confirm that these pipelines were running w/o any issues w/ older versions of hudi.

Do you happened to reproduce this in a deterministic manner?

Hi @nsivabalan Thanks for your attention, according to your raised conditions，I list as follow：

From all the loss record timestamp，comfirm it would happend arround flink job checkpoint executing，but job state is ok，no exception in timeline. because this it is hard to tag the root.
In our case，dataloss happend in single write job.
[SUPPORT] Data loss due to incorrect selection of log file during compaction #10803 the issue had read recently,but it only produce in compaction sence，we had test in all sences about：a. flink job with compaction online; b. flink job without compaction c.flink job with compaction by spark compaction sync. These scences all could happend dataloss.
All the time job is stable without any exception. No any retried during the running time.
Pipline is about 4 or 5 number size，and we did not use mdt，table type is mor，index type is bucket.
We use Hudi version is 0.14.0
Since now we had get a deterministic manner to reproduce it because job state is very well and timeline state is OK.
If you have any other questions can leave anytime.

danny0405 · 2024-03-08T11:07:13Z

Hmm, I kind of figuring out why Flink get data loss here, Flink actually could flush multiple times for one log file in one commit, that would definitely cause data loss here, can we disable this PR for Flink.

danny0405 · 2024-03-08T11:08:14Z

Can we revert the PR first or we get a quick fix with a configuration flag and by default it is disabled.

xuzifu666 · 2024-03-08T11:19:55Z

Can we revert the PR first or we get a quick fix with a configuration flag and by default it is disabled.

OK I would revert the pr@danny0405

danny0405 added data-loss loss of data only, use data-consistency label for inconsistent view change-data-capture labels Jan 22, 2024

danny0405 added this to Hudi Issue Support Jan 22, 2024

github-project-automation bot moved this to ⏳ Awaiting Triage in Hudi Issue Support Jan 22, 2024

danny0405 added the flink Issues related to flink label Jan 22, 2024

codope moved this from ⏳ Awaiting Triage to 👤 User Action in Hudi Issue Support Jan 22, 2024

xuzifu666 closed this as completed Mar 4, 2024

github-project-automation bot moved this from 👤 User Action to ✅ Done in Hudi Issue Support Mar 4, 2024

danny0405 reopened this Mar 5, 2024

github-project-automation bot moved this from ✅ Done to ⏳ Awaiting Triage in Hudi Issue Support Mar 5, 2024

danny0405 added the function-regression label Mar 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SUPPORT] Dataloss in FlinkCDC into Hudi without any exception or other infomation #10542

[SUPPORT] Dataloss in FlinkCDC into Hudi without any exception or other infomation #10542

xuzifu666 commented Jan 22, 2024 •

edited

Loading

danny0405 commented Jan 22, 2024

xuzifu666 commented Jan 22, 2024

danny0405 commented Jan 22, 2024

xuzifu666 commented Jan 22, 2024

ad1happy2go commented Jan 31, 2024

xuzifu666 commented Jan 31, 2024 •

edited

Loading

xuzifu666 commented Feb 1, 2024 •

edited

Loading

ad1happy2go commented Feb 1, 2024

xuzifu666 commented Feb 1, 2024

ad1happy2go commented Feb 1, 2024

xuzifu666 commented Feb 29, 2024

xuzifu666 commented Mar 4, 2024

danny0405 commented Mar 5, 2024

nsivabalan commented Mar 6, 2024

xuzifu666 commented Mar 6, 2024 •

edited

Loading

danny0405 commented Mar 8, 2024

danny0405 commented Mar 8, 2024

xuzifu666 commented Mar 8, 2024

[SUPPORT] Dataloss in FlinkCDC into Hudi without any exception or other infomation #10542

[SUPPORT] Dataloss in FlinkCDC into Hudi without any exception or other infomation #10542

Comments

xuzifu666 commented Jan 22, 2024 • edited Loading

danny0405 commented Jan 22, 2024

xuzifu666 commented Jan 22, 2024

danny0405 commented Jan 22, 2024

xuzifu666 commented Jan 22, 2024

ad1happy2go commented Jan 31, 2024

xuzifu666 commented Jan 31, 2024 • edited Loading

xuzifu666 commented Feb 1, 2024 • edited Loading

ad1happy2go commented Feb 1, 2024

xuzifu666 commented Feb 1, 2024

ad1happy2go commented Feb 1, 2024

xuzifu666 commented Feb 29, 2024

xuzifu666 commented Mar 4, 2024

danny0405 commented Mar 5, 2024

nsivabalan commented Mar 6, 2024

xuzifu666 commented Mar 6, 2024 • edited Loading

danny0405 commented Mar 8, 2024

danny0405 commented Mar 8, 2024

xuzifu666 commented Mar 8, 2024

xuzifu666 commented Jan 22, 2024 •

edited

Loading

xuzifu666 commented Jan 31, 2024 •

edited

Loading

xuzifu666 commented Feb 1, 2024 •

edited

Loading

xuzifu666 commented Mar 6, 2024 •

edited

Loading