Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix upload_graph on v2 #32165

Merged
merged 7 commits into from
Aug 13, 2024
Merged

Fix upload_graph on v2 #32165

merged 7 commits into from
Aug 13, 2024

Conversation

damccorm
Copy link
Contributor

@damccorm damccorm commented Aug 13, 2024

Currently, if upload_graph is specified we error out when using runner v2. This is exacerbated because we auto-opt in large graphs to this option, so some pipelines can fail without a user even enabling this option.

This fixes the problem and does the following:

  1. Doesn't auto-opt in runner v2 pipelines
  2. Warns and removes the upload_graph option for runner v2 pipelines if it is explicitly set.

Fixes #32159


Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

  • Mention the appropriate issue in your description (for example: addresses #123), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, comment fixes #<ISSUE NUMBER> instead.
  • Update CHANGES.md with noteworthy changes.
  • If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md

GitHub Actions Tests Status (on master branch)

Build python source distribution and wheels
Python tests
Java tests
Go tests

See CI.md for more information about GitHub Actions CI or the workflows README to see a list of phrases to trigger workflows.

@damccorm
Copy link
Contributor Author

Example run of real streaming jobs with this option - https://github.com/apache/beam/actions/runs/10371405701/job/28711629663 - I added it to the streaming test suite, ran the suite, then removed it after it was clear it worked.

@damccorm damccorm marked this pull request as ready for review August 13, 2024 15:13
@damccorm damccorm marked this pull request as draft August 13, 2024 15:13
@damccorm damccorm marked this pull request as ready for review August 13, 2024 15:24
@damccorm
Copy link
Contributor Author

R: @kennknowles @liferoad

Copy link
Contributor

Stopping reviewer notifications for this pull request: review requested by someone other than the bot, ceding control. If you'd like to restart, comment assign set of reviewers

@damccorm damccorm merged commit b2d26b6 into master Aug 13, 2024
18 checks passed
@damccorm damccorm deleted the users/damccorm/upload_graph_v2 branch August 13, 2024 15:54
@chamikaramj
Copy link
Contributor

I think there's a secondary fix needed (separate PR) where we just minimize the unnecessary steps submitted in the Dataflow job request when using Runner v2 so the the request does not go over the Dataflow RPC limits (Dataflow with Runner v2 do not use such steps anyways but we just submit them for historical reasons).

@damccorm
Copy link
Contributor Author

I think there's a secondary fix needed (separate PR) where we just minimize the unnecessary steps submitted in the Dataflow job request when using Runner v2 so the the request does not go over the Dataflow RPC limits (Dataflow with Runner v2 do not use such steps anyways but we just submit them for historical reasons).

I don't think we need this - upload_graph has never worked with Runner v2 and users don't seem to be running into this problem, so I don't think the problem persists in the v2 case.

There's also already some divergence here (e.g.

LOG.info("Skipping v1 transform replacements since job will run on v2.");
). There may be other places which need updating, but they at least don't seem problematic at this point

@kennknowles
Copy link
Member

I think these are the cases:

  • choose v2: we don't even do the v1 translation, and upload_graph is turned off / does nothing, portable pipeline staged to GCS
  • choose v1: upload_graph works and also we will set it automatically if necessary
  • leave up to the service: v1 graph follows the usual v1 path and portable pipeline staged to GCS

So there might be a corner case for a very large graph that does not choose v1 or v2. In this case, we will automatically turn on upload_graph and stage both the v1 dataflow graph and the portable pipeline to GCS. I am not certain if this works. What I said before would suggest it doesn't but I was probably wrong. Basically if the staging locations or service-side code paths have collisions then it won't work, otherwise it will.

@chamikaramj
Copy link
Contributor

stage both the v1 dataflow graph and the portable pipeline to GCS

Note that this will require service changes since I believe currently we use the same Dataflow job request property for both.

But agree that if we are not hitting request RPC errors with Runner v2, trimming steps in the the Dataflow job request for Runner v2 might not be a priority.

@kennknowles
Copy link
Member

If the user explicitly requests runner v2, we already clear the steps from the REST API request.

@kennknowles
Copy link
Member

(that was what caused the bug)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug]: upload_graph option does not work with Dataflow Java Runner v2
3 participants