Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CT-1015] job_retries not obeyed when job_execution_timeout_seconds is exceeded #263

Closed
jars opened this issue Aug 9, 2022 · 11 comments · Fixed by #1431
Closed

[CT-1015] job_retries not obeyed when job_execution_timeout_seconds is exceeded #263

jars opened this issue Aug 9, 2022 · 11 comments · Fixed by #1431
Assignees
Labels
bug Something isn't working help_wanted Extra attention is needed

Comments

@jars
Copy link

jars commented Aug 9, 2022

Describe the bug

The job_retries BigQuery job configuration setting is not used in conjunction with job_execution_timeout_seconds setting. Instead of retrying a job that times out, dbt immediately fails with Operation did not complete within the designated timeout.

Steps To Reproduce

Sure. First use the official dbt Docker image, and run dbt init to initialise a project:

docker build -t dbt-bigquery-issue-263 --no-cache -<<'EOF'
FROM ghcr.io/dbt-labs/dbt-bigquery:1.1.0
WORKDIR /tmp
RUN dbt init --skip-profile-setup issue_263
EOF

Make a profiles.yml file similar to this. Take special note of the job_execution_timeout_seconds value, which I've exaggerated to 1 second to force a query timeout to occur. Also note that job_retries=5 Fill in the projectkey.

issue_263:
  target: default
  outputs:
    default:
      type: bigquery
      method: service-account
      project: '<your project>'
      dataset: 'issue_263'
      threads: 1
      keyfile: /root/keyfile.json
      job_execution_timeout_seconds: 1
      job_retry_deadline_seconds: 200
      job_creation_timeout_seconds: 30
      job_retries: 5
      priority: interactive
      location: US

Then, run the built docker image, executing a dbt run on the example my_first_dbt_model model. Map your profiles.yml file, and keyfile.json into the container:

docker run  \
-v `pwd`/profiles.yml:/root/.dbt/profiles.yml \
-v `pwd`/keyfile.json:/root/keyfile.json dbt-bigquery-issue-263 run \
--project-dir /tmp/issue_263 -m my_first_dbt_model

Expected behavior

The model my_first_dbt_model would fail, and retry five times.

Screenshots and log output

After dbt errors/exits, the job continues running on BigQuery, and eventually succeeds:

image

Additional Context

The retries are obeyed in the event of other BigQuery failures. For example, if you introduce an intentional syntax error in your model, it will retry up to job_retries.

System information

The output of dbt --version:

Core:
  - installed: 1.1.0
  - latest:    1.2.0 - Update available!

  Your version of dbt-core is out of date!
  You can find instructions for upgrading here:
  https://docs.getdbt.com/docs/installation

Plugins:
  - bigquery: 1.1.0 - Update available!

  At least one plugin is out of date or incompatible with dbt-core.
  You can find instructions for upgrading here:
  https://docs.getdbt.com/docs/installation

The operating system you're using: Using the official Docker Image...

root@7fbf851eb622:/# lsb_release -a
No LSB modules are available.
Distributor ID:	Debian
Description:	Debian GNU/Linux 11 (bullseye)
Release:	11
Codename:	bullseye

The output of python --version: Python 3.10.3

@jars jars added bug Something isn't working triage labels Aug 9, 2022
@github-actions github-actions bot changed the title job_retries not obeyed when job_execution_timeout_seconds is exceeded [CT-1015] job_retries not obeyed when job_execution_timeout_seconds is exceeded Aug 9, 2022
@codigo-ergo-sum
Copy link

We have been experiencing the same issue. It also affects dbt cloud. Thank you @jars for doing the work to reproduce this issue!

@jtcohen6
Copy link
Contributor

jtcohen6 commented Aug 9, 2022

Thanks @jars!

There's a specific bug here: dbt should respect the TimeoutError when deciding whether to retry a job. If it times out, it should retry. That should be as simple as adding TimeoutError class to the set of RETRYABLE_ERRORS here:

RETRYABLE_ERRORS = (
google.cloud.exceptions.ServerError,
google.cloud.exceptions.BadRequest,
ConnectionResetError,
ConnectionError,
)

At least, I think so — but then after thinking a bit about it, I'm less sure! (If it does time out, and the query keeps going, and actually succeeds — but then dbt retries — don't we risk running the same query / performing the same operation twice? Can we lean on dbt's idempotent materializations here? Not to mention the expense...)

Another option we could try taking here: If job_execution_timeout_seconds is reached, dbt polls the BigQuery API, checking on the job ID, to see if it is still running / still in progress, even if it's exceeded the user-configured job_execution_timeout_seconds. But then what? Should dbt actually go cancel the job? Just keep waiting up to the job_retry_deadline_seconds (total amount of time dbt should spend trying to run this model, inclusive of all retries)?

I'd prefer to avoid implementing more logic here than we need.

As a larger point, we are looking to replace a lot of our current timeout/retry code in dbt-bigquery (legacy + organically developed) with the higher-level retry logic that Google's client libraries make available. As a first example: #230, #231. If users provide granular configuration around timeout + retry, we should still do our best to respect it (by passing it into Google's decorator) — but we should be moving in the direction of handling much, much less of this logic ourselves.

@jtcohen6 jtcohen6 added help_wanted Extra attention is needed and removed triage labels Aug 9, 2022
@jars
Copy link
Author

jars commented Aug 9, 2022

Thanks @jtcohen6, you've given us some ideas to chew on for the next couple of days.

Do you have insight into the priority of #231, and if it would address this specific issue? Or do you think #231 would swap-out _retry_and_handle for the Retry decorator without behaviour changes... followed by a separate PR to deal with this timeout issue?

@hui-zheng
Copy link
Contributor

hui-zheng commented Aug 11, 2022

As a larger point, we are looking to replace a lot of our current timeout/retry code in dbt-bigquery (legacy + organically developed) with the higher-level retry logic that Google's client libraries make available. As a first example: #230, #231. If users provide granular configuration around timeout + retry, we should still do our best to respect it (by passing it into Google's decorator) — but we should be moving in the direction of handling much, much less of this logic ourselves.

@jtcohen6 , I totally agree!
I think the solution for #231 is that could address this issue here, and it's the ideal solution.

Do you know if anyone is working on #231, and its priority?

this is an important issue for us. I am happy to contribute to the implementation.

@lillikulak
Copy link

Hi all! Wanted to check in on this issue, is a fix in the works? We're running into this problem as well -- so glad to see there's a thoughtful discussion about it!

@Fleid
Copy link
Contributor

Fleid commented Dec 7, 2022

@hui-zheng sorry for the delayed answer.

If you think you can contribute, please do so on #231, even if it's to scope work or a proof of concept.
We are slowly catching up on the backlog, as the team is ramping up, but any help is appreciated ;)

@github-actions
Copy link
Contributor

github-actions bot commented Jun 6, 2023

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please remove the stale label or comment on the issue, or it will be closed in 7 days.

@lillikulak
Copy link

@Fleid how do we get this issue back onto the team's backlog? can we reopen this one?

@cbyington
Copy link

Flagging that we at Superhuman are continuing to encounter this issue as @lillikulak mentions above - any update on progress toward solving this problem?

@FaatzRicco
Copy link

We are facing the same issue, regularly. I would be also very happy if this problem is acutally in scope.

@FaatzRicco
Copy link

FaatzRicco commented Dec 16, 2024

Thats a nice christmas present 🎄 Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help_wanted Extra attention is needed
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants