-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SUPPORT] Dataloss in FlinkCDC into Hudi without any exception or other infomation #10542
Comments
Did you try 1.0, does it fix the data loss ? |
Currently only test it on 0.14.0/0.14.1,due to 1.0 version contains much problem such as many experimental config would regress functional or performance before GA version. Try best to fix it in current version is better. @danny0405 |
Maybe relevent with this fix: #10398 |
OK,we would try the pr,Thanks @danny0405 |
@xuzifu666 Were you able to resolve this issue with that PR. Please share the insights. |
Hi,not yet. dataloss also in 0.14.x maybe we need have a try in 1.0 version.@ad1happy2go |
@ad1happy2go According to feedbacks before,the dataloss bug was fixed in 1.0 beta version? The bug is too serious many business cannot be execute which need 100% data quality such as cdc scene. cc @danny0405 |
Thanks for the update @xuzifu666 . But did you tried with that PR (#10398) on top of 0.14.1 as Danny suggested ? |
Had tried,dataloss also. @ad1happy2go |
@xuzifu666 Can you post your table configurations or code snippet which you using to load the data. Did you tried to reproduce the same with small data or you are seeing this behaviour with billion of records only. I will also try once in case I can reproduce. |
@ad1happy2go @danny0405 Hi,recently we had check dataloss case in Hudi 1.0beta,this problem had been fixed,but 0.1X version also had dataloss case. Thanks for your attention |
Had been resolved in Hudi 1.0 beta,so close the issue |
May caused by this PR: #9545 |
hey @xuzifu666 : Would greatly appreciate if you an help us triage this.
|
Hi @nsivabalan Thanks for your attention, according to your raised conditions,I list as follow:
|
Hmm, I kind of figuring out why Flink get data loss here, Flink actually could flush multiple times for one log file in one commit, that would definitely cause data loss here, can we disable this PR for Flink. |
Can we revert the PR first or we get a quick fix with a configuration flag and by default it is disabled. |
OK I would revert the pr@danny0405 |
Tips before filing an issue
Describe the problem you faced
When we use FlinkCDC into HUDI 0.14.0/0.14.1,would occur dataloss such as deleted data in mysql but can find the data in Hudi,or insert data to mysql but not find data in Hudi,more important not produce any exception for us to dig cause root.
sences:
above 3 conditions all can occur the problem
To Reproduce
Steps to reproduce the behavior:
Expected behavior
all data should consistent with mysql
Environment Description
Flink version : 1.13.2
Hudi version : 0.14.0/0.14.1
Spark version : 3.2.0
Hive version : 1.1.0
Hadoop version : 3.2.1
Storage (HDFS/S3/GCS..) : HDFS
Running on Docker? (yes/no) :no
Additional context
Add any other context about the problem here.
Stacktrace
Not any stacktrace to dig it,flink and compaction job is fine ant the time
The text was updated successfully, but these errors were encountered: