Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] [FlyteAdmin] [scheduledWorkflowExecutor] SQS subscriber client stopped working #198

Closed
rstanevich opened this issue Mar 9, 2020 · 8 comments
Assignees
Labels
bug Something isn't working pri1
Milestone

Comments

@rstanevich
Copy link
Contributor

rstanevich commented Mar 9, 2020

Describe the bug

Flyteadmin scheduledWorkflowExecutor worked successfully ~2 weeks with no redeployment.

But one day:

  • SQS client had been shutted down
  • Async Scheduler stopped working
  • FlyteAdmin continued running with no scheduledWorkflowExecutor
  • Cloudwatch events continued push messages to the queue and it was being increased.

What logs I'd got before the scheduler stopped working:

{"json":{"src":"base.go:138"},"level":"info","msg":"Successfully started running the scheduled workflow executor","ts":"2020-03-07T05:39:04Z"}

https://github.com/lyft/flyteadmin/blob/60b4c876ea105d4c79e3cad7d56fde6b9c208bcd/pkg/rpc/adminservice/base.go#L138L139
Normally, this log doesn't appear when FlyteAdmin starts. And if I understand correctly this log shouldn't appear because this row is not reachable.
Unfortunately, there is no more logs.

Expected behavior

Flyte component

  • [+] FlyteAdmin (Control Plane service) UPD: image: docker.io/lyft/flyteadmin:v0.2.1

To Reproduce
Steps to reproduce the behavior:

  1. Scheduler config in FlyteAdmin:
scheduler:
  eventScheduler:
    scheme: aws
    region: "us-east-1"
    scheduleRole: "arn:aws:iam::{{ AWS_ACCOUNT_ID }}:role/flyte_cron_scheduler_role"
    targetName: "arn:aws:sqs:us-east-1:{{ AWS_ACCOUNT_ID }}:flyte-cron-scheduler-queue"
  workflowExecutor:
    scheme: aws
    region: "us-east-1"
    scheduleQueueName: "flyte-cron-scheduler-queue"
    accountId: "{{ AWS_ACCOUNT_ID }}"  
  1. Run FlyteAdmin and wait for this message (I don't know what was the reason and how to reproduce it) :)
    {"json":{"src":"base.go:138"},"level":"info","msg":"Successfully started running the scheduled workflow executor","ts":"2020-03-07T05:39:04Z"}

Environment
Flyte component

  • [+] Cloud hosted
    • [+] AWS

Others
Restarting of the FlyteAdmin pod initialized new scheduledWorkflowExecutor and the SQS events were executed.

Thank you!

@rstanevich rstanevich added bug Something isn't working untriaged This issues has not yet been looked at by the Maintainers labels Mar 9, 2020
@kumare3 kumare3 added pri1 and removed untriaged This issues has not yet been looked at by the Maintainers labels Mar 10, 2020
@kumare3
Copy link
Contributor

kumare3 commented Mar 10, 2020

Hi @rstanevich thank you for the bug. This is highly unexpected and undesirable behavior. What version of FlyteAdmin are you running?

@rstanevich
Copy link
Contributor Author

Thanks @kumare3 for your response!
We use this docker image docker.io/lyft/flyteadmin:v0.2.1

@rstanevich
Copy link
Contributor Author

looks like I duplicated this issue #88

@kumare3
Copy link
Contributor

kumare3 commented Mar 11, 2020

That is right - @katrogan did you get a chance to look at the #88

@kumare3 kumare3 assigned katrogan and unassigned kumare3 and anandswaminathan Mar 11, 2020
@anandswaminathan
Copy link
Contributor

@kumare3 I looked into this a bit, and am unable to reproduce. It is hard for me to say which part is failing, as there are several integration points. I believe the only option is add more debug logs and try to repro.

cc @katrogan if you were able to find the root cause.

@kumare3
Copy link
Contributor

kumare3 commented Mar 11, 2020

Yes @katrogan confirmed she is on it. I will close this one and keep the root open @rstanevich

@kumare3 kumare3 closed this as completed Mar 11, 2020
@katrogan
Copy link
Contributor

It's on the backlog but I'm not actively taking a look. @anandswaminathan i honestly don't remember anything from when i looked at this when l5 ran into it. feel free to take this over if you want

@rstanevich
Copy link
Contributor Author

rstanevich commented Dec 9, 2020

I'd like to show our case which strange enough

from the logs:

{"json":{"src":"workflow_executor.go:260"},"level":"error","msg":"Gizmo subscriber closed channel with err: [RequestError: send request failed\ncaused by: Post https://sqs.us-east-1.amazonaws.com/: read tcp 10.200.29.153:38402-\u003e52.46.130.43:443: read: connection reset by peer]","ts":"2020-12-07T22:51:52Z"}
"json":{"src":"workflow_executor.go:178"},"level":"error","msg":"error with workflow executor err: [RequestError: send request failed\ncaused by: Post https://sqs.us-east-1.amazonaws.com/: read tcp 10.200.29.153:38402-\u003e52.46.130.43:443: read: connection reset by peer] ","ts":"2020-12-07T22:51:52Z"}
{"json":{"src":"workflow_executor.go:176"},"level":"warning","msg":"Starting workflow executor","ts":"2020-12-07T22:52:22Z"}

we see that at 22:51 the connection to SQS has been reset and at 22:52 the next subscriber loop has been started (according to https://github.com/lyft/flyteadmin/blob/v0.3.7/pkg/async/schedule/aws/workflow_executor.go#L174-L183)

But in 30mins new message appeared in SQS and it hasn't been handled by this subscriber. The messages from the queue started processing after restarting flyteadmin pod. See the screenshot
Screen Shot 2020-12-09 at 15 44 20

Do you have an idea what could happened with this client. I am also trying figuring out what the reason. I am not familiar with gizmo client, maybe the client there is some stop method or smth else?
WDYT? cc @katrogan thank you in advance!

eapolinario pushed a commit to eapolinario/flyte that referenced this issue Dec 6, 2022
eapolinario pushed a commit to eapolinario/flyte that referenced this issue Dec 6, 2022
eapolinario pushed a commit to eapolinario/flyte that referenced this issue Dec 20, 2022
eapolinario pushed a commit to eapolinario/flyte that referenced this issue Dec 20, 2022
" (flyteorg#207)

* Revert "Adopt flyteidl's ordered variable map change (flyteorg#198)"

This reverts commit d76eb15

Signed-off-by: Sean Lin <sean@union.ai>
eapolinario pushed a commit to eapolinario/flyte that referenced this issue Dec 20, 2022
Signed-off-by: Ketan Umare <ketan.umare@gmail.com>
eapolinario pushed a commit to eapolinario/flyte that referenced this issue Dec 20, 2022
* feat: add workflow versions table

Signed-off-by: csirius <davidtruong.dev@gmail.com>

* feat: workflow version details page

Signed-off-by: csirius <davidtruong.dev@gmail.com>
eapolinario pushed a commit to eapolinario/flyte that referenced this issue Dec 20, 2022
* feat: add workflow versions table

Signed-off-by: csirius <davidtruong.dev@gmail.com>
Signed-off-by: Jason Porter <jason@union.ai>

* chore(release): Release 0.25.0 [skip ci]

# [0.25.0](http://github.com/lyft/flyteconsole/compare/v0.24.0...v0.25.0) (2021-08-31)

### Features

* add workflow versions table ([flyteorg#193](http://github.com/lyft/flyteconsole/issues/193)) ([6fff87e](http://github.com/lyft/flyteconsole/commit/6fff87e40007fd15faae634eb6402045c067dd2c))

Signed-off-by: Jason Porter <jason@union.ai>

* improvement: show proper error message for aborted workflows (flyteorg#195)

* improvement: show proper error message for aborted workflows

Signed-off-by: Pianist038801 <steven@union.ai>

* improvement: show abort message in the execution list

Signed-off-by: Pianist038801 <steven@union.ai>

Co-authored-by: Pianist038801 <steven@union.ai>
Signed-off-by: Jason Porter <jason@union.ai>

* Fix/versions executions gap (flyteorg#197)

* feat: add workflow versions table

Signed-off-by: csirius <davidtruong.dev@gmail.com>

* fix: space between versions and executions table

Signed-off-by: csirius <davidtruong.dev@gmail.com>
Signed-off-by: Jason Porter <jason@union.ai>

* feat: workflow version details page

Signed-off-by: csirius <davidtruong.dev@gmail.com>
Signed-off-by: Jason Porter <jason@union.ai>

* Pre merge checkin

Signed-off-by: Jason Porter <jason@union.ai>

* fix: this is rfc, do not deploy until resolved; details in comments (flyteorg#172)

* fix: this is rfc, do not deploy until resolved; details in comments

Signed-off-by: Jason Porter <jason@union.ai>

* fix: this is rfc, do not deploy until resolved; details in comments

note: deploying with caution; the believe if that it was incorrect to use
meta.

Signed-off-by: Jason Porter <jason@union.ai>

* fix: show field types on json launch form (flyteorg#199)

Signed-off-by: Pianist038801 <steven@union.ai>

Co-authored-by: Pianist038801 <steven@union.ai>
Signed-off-by: Jason Porter <jason@union.ai>

* chore(release): Release 0.25.1 [skip ci]

## [0.25.1](http://github.com/lyft/flyteconsole/compare/v0.25.0...v0.25.1) (2021-09-13)

### Bug Fixes

* show field types on json launch form ([flyteorg#199](http://github.com/lyft/flyteconsole/issues/199)) ([a42b9f8](http://github.com/lyft/flyteconsole/commit/a42b9f8520fcd24dee752111e606ad9ae9bd88f5))
* this is rfc, do not deploy until resolved; details in comments ([flyteorg#172](http://github.com/lyft/flyteconsole/issues/172)) ([67dd183](http://github.com/lyft/flyteconsole/commit/67dd18397caf40e350da40e0672e500eaa9f338a))

Signed-off-by: Jason Porter <jason@union.ai>

* Minor fixes

Signed-off-by: Jason Porter <jason@union.ai>

* Feat/version details (flyteorg#198)

* feat: add workflow versions table

Signed-off-by: csirius <davidtruong.dev@gmail.com>

* feat: workflow version details page

Signed-off-by: csirius <davidtruong.dev@gmail.com>
Signed-off-by: Jason Porter <jason@union.ai>

* fix: repopulate struct input fields on relaunch form (flyteorg#201)

Signed-off-by: Pianist038801 <steven@union.ai>

Co-authored-by: Pianist038801 <steven@union.ai>
Signed-off-by: Jason Porter <jason@union.ai>

* Graph ux feature add legend (flyteorg#196)

* Checkin and merge to master

Signed-off-by: Jason Porter <jason@union.ai>

* fixed one more

Signed-off-by: Jason Porter <jason@union.ai>

* chore(release): Release 0.25.2 [skip ci]

## [0.25.2](http://github.com/lyft/flyteconsole/compare/v0.25.1...v0.25.2) (2021-09-16)

### Bug Fixes

* repopulate struct input fields on relaunch form ([flyteorg#201](http://github.com/lyft/flyteconsole/issues/201)) ([950e080](http://github.com/lyft/flyteconsole/commit/950e080a2c52c6294630f0ffd1410b0e45b50a8d))

Signed-off-by: Jason Porter <jason@union.ai>

* final before pr

Signed-off-by: Jason Porter <jason@union.ai>

* Fixed issues created by bad rebase

Signed-off-by: Jason Porter <jason@union.ai>

* Fixed more merge issues

Signed-off-by: Jason Porter <jason@union.ai>

Co-authored-by: csirius <davidtruong.dev@gmail.com>
Co-authored-by: flyte bot <admin@flyte.org>
Co-authored-by: pianist <26953709+Pianist038801@users.noreply.github.com>
Co-authored-by: Pianist038801 <steven@union.ai>
Co-authored-by: csirius <85753828+csirius@users.noreply.github.com>
eapolinario pushed a commit to eapolinario/flyte that referenced this issue Dec 20, 2022
Signed-off-by: Ketan Umare <ketan.umare@gmail.com>
eapolinario pushed a commit to eapolinario/flyte that referenced this issue Aug 9, 2023
eapolinario pushed a commit to eapolinario/flyte that referenced this issue Aug 21, 2023
eapolinario pushed a commit to eapolinario/flyte that referenced this issue Aug 21, 2023
eapolinario pushed a commit to eapolinario/flyte that referenced this issue Aug 21, 2023
" (flyteorg#207)

* Revert "Adopt flyteidl's ordered variable map change (flyteorg#198)"

This reverts commit 5a8e120

Signed-off-by: Sean Lin <sean@union.ai>
eapolinario pushed a commit to eapolinario/flyte that referenced this issue Apr 30, 2024
Signed-off-by: Ketan Umare <ketan.umare@gmail.com>
austin362667 pushed a commit to austin362667/flyte that referenced this issue May 7, 2024
Signed-off-by: Ketan Umare <ketan.umare@gmail.com>
robert-ulbrich-mercedes-benz pushed a commit to robert-ulbrich-mercedes-benz/flyte that referenced this issue Jul 2, 2024
Signed-off-by: Ketan Umare <ketan.umare@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working pri1
Projects
None yet
Development

No branches or pull requests

4 participants