Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SUPPORT] Dataloss in FlinkCDC into Hudi without any exception or other infomation #10542

Open
xuzifu666 opened this issue Jan 22, 2024 · 18 comments
Labels
change-data-capture data-loss loss of data only, use data-consistency label for inconsistent view flink Issues related to flink function-regression

Comments

@xuzifu666
Copy link
Member

xuzifu666 commented Jan 22, 2024

Tips before filing an issue

Describe the problem you faced

When we use FlinkCDC into HUDI 0.14.0/0.14.1,would occur dataloss such as deleted data in mysql but can find the data in Hudi,or insert data to mysql but not find data in Hudi,more important not produce any exception for us to dig cause root.
sences:

  1. Flink write to Hudi without compaction
  2. Flink write to Hudi with compaction online
  3. Flink write to Hudi with compaction offline
    above 3 conditions all can occur the problem

To Reproduce

Steps to reproduce the behavior:

  1. Flink write to hudi with Billions data
  2. then check data with mysql

Expected behavior

all data should consistent with mysql

Environment Description

  • Flink version : 1.13.2

  • Hudi version : 0.14.0/0.14.1

  • Spark version : 3.2.0

  • Hive version : 1.1.0

  • Hadoop version : 3.2.1

  • Storage (HDFS/S3/GCS..) : HDFS

  • Running on Docker? (yes/no) :no

Additional context

Add any other context about the problem here.

Stacktrace

Not any stacktrace to dig it,flink and compaction job is fine ant the time

@danny0405
Copy link
Contributor

Did you try 1.0, does it fix the data loss ?

@xuzifu666
Copy link
Member Author

Did you try 1.0, does it fix the data loss ?

Currently only test it on 0.14.0/0.14.1,due to 1.0 version contains much problem such as many experimental config would regress functional or performance before GA version. Try best to fix it in current version is better. @danny0405

@danny0405 danny0405 added data-loss loss of data only, use data-consistency label for inconsistent view change-data-capture labels Jan 22, 2024
@github-project-automation github-project-automation bot moved this to ⏳ Awaiting Triage in Hudi Issue Support Jan 22, 2024
@danny0405 danny0405 added the flink Issues related to flink label Jan 22, 2024
@danny0405
Copy link
Contributor

Maybe relevent with this fix: #10398

@xuzifu666
Copy link
Member Author

Maybe relevent with this fix: #10398

OK,we would try the pr,Thanks @danny0405

@codope codope moved this from ⏳ Awaiting Triage to 👤 User Action in Hudi Issue Support Jan 22, 2024
@ad1happy2go
Copy link
Collaborator

@xuzifu666 Were you able to resolve this issue with that PR. Please share the insights.

@xuzifu666
Copy link
Member Author

xuzifu666 commented Jan 31, 2024

@xuzifu666 Were you able to resolve this issue with that PR. Please share the insights.

Hi,not yet. dataloss also in 0.14.x maybe we need have a try in 1.0 version.@ad1happy2go

@xuzifu666
Copy link
Member Author

xuzifu666 commented Feb 1, 2024

@ad1happy2go According to feedbacks before,the dataloss bug was fixed in 1.0 beta version? The bug is too serious many business cannot be execute which need 100% data quality such as cdc scene. cc @danny0405

@ad1happy2go
Copy link
Collaborator

Thanks for the update @xuzifu666 . But did you tried with that PR (#10398) on top of 0.14.1 as Danny suggested ?

@xuzifu666
Copy link
Member Author

Thanks for the update @xuzifu666 . But did you tried with that PR (#10398) on top of 0.14.1 as Danny suggested ?

Had tried,dataloss also. @ad1happy2go

@ad1happy2go
Copy link
Collaborator

@xuzifu666 Can you post your table configurations or code snippet which you using to load the data. Did you tried to reproduce the same with small data or you are seeing this behaviour with billion of records only. I will also try once in case I can reproduce.

@xuzifu666
Copy link
Member Author

@ad1happy2go @danny0405 Hi,recently we had check dataloss case in Hudi 1.0beta,this problem had been fixed,but 0.1X version also had dataloss case. Thanks for your attention

@xuzifu666
Copy link
Member Author

Had been resolved in Hudi 1.0 beta,so close the issue

@github-project-automation github-project-automation bot moved this from 👤 User Action to ✅ Done in Hudi Issue Support Mar 4, 2024
@danny0405
Copy link
Contributor

May caused by this PR: #9545

@danny0405 danny0405 reopened this Mar 5, 2024
@github-project-automation github-project-automation bot moved this from ✅ Done to ⏳ Awaiting Triage in Hudi Issue Support Mar 5, 2024
@nsivabalan
Copy link
Contributor

hey @xuzifu666 :
do you happened to have the old data intact which had data loss. We would like to root cause this. 0.x release line will be used by lot of OSS users. So, we really wanna get to the bottom of it and fix it.

Would greatly appreciate if you an help us triage this.

  • Do you happened to know when exactly the data loss happens. do you see anything interesting in the timeline around the time the data loss happens.
  • Is it a single writer or multi-writer.
  • We do have some suspicion around log record reading that we are chasing. Ref ticket: [SUPPORT] Data loss due to incorrect selection of log file during compaction #10803 But I do not want to bias this one. lets get more info about when exactly data loss is seen.
  • Are there any task retries in general. I am not familiar w/ flink. But in spark, we might have spark task retries. Are there any such things happening in your pipeline.
  • Is it happening across all pipelines occasionally or very few pipelines. And if its very few, is there any common characteristics like index type, metadata enabled, etc. in comparison to other pipelines which does not have the data loss issue.
  • And can you confirm that these pipelines were running w/o any issues w/ older versions of hudi.
  • Do you happened to reproduce this in a deterministic manner?

@xuzifu666
Copy link
Member Author

xuzifu666 commented Mar 6, 2024

hey @xuzifu666 : do you happened to have the old data intact which had data loss. We would like to root cause this. 0.x release line will be used by lot of OSS users. So, we really wanna get to the bottom of it and fix it.

Would greatly appreciate if you an help us triage this.

  • Do you happened to know when exactly the data loss happens. do you see anything interesting in the timeline around the time the data loss happens.
  • Is it a single writer or multi-writer.
  • We do have some suspicion around log record reading that we are chasing. Ref ticket: [SUPPORT] Data loss due to incorrect selection of log file during compaction #10803 But I do not want to bias this one. lets get more info about when exactly data loss is seen.
  • Are there any task retries in general. I am not familiar w/ flink. But in spark, we might have spark task retries. Are there any such things happening in your pipeline.
  • Is it happening across all pipelines occasionally or very few pipelines. And if its very few, is there any common characteristics like index type, metadata enabled, etc. in comparison to other pipelines which does not have the data loss issue.
  • And can you confirm that these pipelines were running w/o any issues w/ older versions of hudi.
  • Do you happened to reproduce this in a deterministic manner?

Hi @nsivabalan Thanks for your attention, according to your raised conditions,I list as follow:

  1. From all the loss record timestamp,comfirm it would happend arround flink job checkpoint executing,but job state is ok,no exception in timeline. because this it is hard to tag the root.
  2. In our case,dataloss happend in single write job.
  3. [SUPPORT] Data loss due to incorrect selection of log file during compaction #10803 the issue had read recently,but it only produce in compaction sence,we had test in all sences about:a. flink job with compaction online; b. flink job without compaction c.flink job with compaction by spark compaction sync. These scences all could happend dataloss.
  4. All the time job is stable without any exception. No any retried during the running time.
  5. Pipline is about 4 or 5 number size,and we did not use mdt,table type is mor,index type is bucket.
  6. We use Hudi version is 0.14.0
  7. Since now we had get a deterministic manner to reproduce it because job state is very well and timeline state is OK.
    If you have any other questions can leave anytime.

@danny0405
Copy link
Contributor

Hmm, I kind of figuring out why Flink get data loss here, Flink actually could flush multiple times for one log file in one commit, that would definitely cause data loss here, can we disable this PR for Flink.

@danny0405
Copy link
Contributor

Can we revert the PR first or we get a quick fix with a configuration flag and by default it is disabled.

@xuzifu666
Copy link
Member Author

Can we revert the PR first or we get a quick fix with a configuration flag and by default it is disabled.

OK I would revert the pr@danny0405

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
change-data-capture data-loss loss of data only, use data-consistency label for inconsistent view flink Issues related to flink function-regression
Projects
Status: Awaiting Triage
Development

No branches or pull requests

4 participants