-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: BigQuery IO Batch load using File_load causing the same job id ignoring inserts as the job_id is already completed #28219
Comments
We faced the same issue after we switched to Dataflow Runner v2. BigQuery Load job with the same jobId is used for every batch write so only the first batch is actually written. In addition to @yeshvantbhavnasi 's logs, I have the following warnings: |
Thanks for reporting this. It's clearly a bug and should be fixed.
This also suggests runner v2 somehow generates smaller load jobs. Most of time, within one window it should only have one load job. Another issue to track. |
I see, they are different window. Should also add window to job name. |
There is a bug in Dataflow Runner V2 where Line 254 in a636f25
Causing duplicated BigQuery job id. However in Dataflow Runner V1 the pane index is increasing. |
Either use legacy runner (with streaming engine enabled); or under Dataflow runner v2 and manually set sharding |
I did try using |
You're right, sorry, I messed up different test pipelines. Confirm that setting withNumFileShards issue still exists. The problem is not need GroupIntoBatches to trigger |
The problem is Downstream PCollection lost the pane info of first GBK after a ReShuffle. The pipeline involved is: Trigger->GBK->ReShuffle->Downstream ReShuffle happens here: Line 806 in 3ff66d3
After discussion we think the pane index is either work as intended or undefined behavior. While GBK will re-fire a pane, ReShuffle is implemented differently in runners. Current behavior is inconsistent among runner: Is pane info preserved in Downstream for a pipeline with Trigger->GBK->ReShuffle->Downstream?
Opened #28272 for a fix. Added randomness to job id. The implication of randomness in job id is that if the bundle gets retried it will have duplicates. Note that Python SDK implementation did not consider pane info but also uses random job id:
|
I think there may be too much trial-and-error guessing here. We can reason about this and get it right. When you have an aggregation (either a GBK or a Combine) then the elements coming out of that aggregation are uniquely identified by key + window + pane index. The pane index is how you can tell the difference between different triggerings. A "Reshuffle" has a trigger that always fires as fast as possible. The pane index should increase with each element. But if it is by random key then the actual key that was shuffled by is dropped so the pane index doesn't mean anything any more. Now the problem is that Reshuffle also breaks apart the GBK result into individual elements again. There is no guarantee that these elements will always travel together in a bundle, so again the pane index is not a reliable thing. |
removed 2.51.0 milestone as a consensus is pending |
Question: if we do assume the behavior should be like I said in the design doc and like Reuven said - the exact identity function - would this change be needed or not? |
This change is then not needed |
Can we check if this can be prioritized ? |
For Dataflow v1 the behavior is already correct. |
@kennknowles we cannot use dataflow v1 as we are using custom images for security which is supported only on dataflow v2 |
Got it. This is a high priority. It is underway but takes some work. (fixing it in the SDK does not necessarily fix it for all the runners, because they do customized things under the hood) |
To be clear: the proposed change actually would be a different behavior that we do not desire. The work here is to fix Reshuffle implementations. |
Hey @kennknowles can we point to reshuffle implementation codebase ? |
Ah, sorry I missed this message. This change is #28853 which is in Beam 2.54.0 and above. |
Hi @kennknowles @Abacn , I am using Batch load job (moving data from GCS- Cloud Storage to BigQuery) , and Job shows successfully run, but writes only a small portion of data to BigQuery table, Typically 2- 10 rows out of 1071 (expected), Is this the same issue, Another version of this job with exact same DynamicDestination configuration is running successfully with streaming setup - (Basically that pipeline is moving data from Kafka to BigQuery) Java SDK version: Runner: Tried to run with both below- but job still writes only partial output to BQ
JobID looks like:
Also pasting dataflow job describe output here
|
@pranjal5215 are you experiencing this with both - with and without V2? |
Possible to share the code here? |
What happened?
Have following code to setup file_load operation for Dataflow streaming jobs using GCS and File_notification based ingestion into BigQuery
beam_bq_job_LOAD_defaultnetworkvpcflowlogslive_e78561b1b6e147e7abafe83b9314c7f1_41dd2dbb7ce5b6e717e5aadb4244b4e0_00001_00000-0
but after this every new files are added to staging bucket new set of files have exactly
1
partition and c.pane.index seems to be0
causing the same job_id and ignoring file writesSee the screenshot for different timestamp we have the same job_id ignoring inserts
Should we add random uuid to jobId logic to fix this issue here ?
by changing this line to:
Issue Priority
Priority: 2 (default / most bugs should be filed as P2)
Issue Components
The text was updated successfully, but these errors were encountered: