Timed out waiting for UP message from ForkProcess #1337

gzhukov · 2023-09-22T08:15:20Z

Hello,
We tried to upgrade querybook from 2.4.0 to 3.28.0. Everything were good (errors on webserver, migrations, etc) but all requests to query-engine got stuck and such errors appeared on worker:

[2023-09-22 11:09:31,258: INFO/MainProcess] celery@bc18e9efe089 ready.
[2023-09-22 11:09:31,262: INFO/MainProcess] Task tasks.run_query.run_query_task[01756288-9a00-4d16-b7e8-17c6fb897cb7] received
[2023-09-22 11:09:31,264: INFO/MainProcess] Task tasks.run_query.run_query_task[3d377c95-0717-415e-b3a5-779d6d0bf3a0] received
[2023-09-22 11:09:32,307: INFO/ForkPoolWorker-1] POST http://querybook-elasticsearch:9200/search_query_executions_v1/_update/87508 [status:201 request:0.927s]
[2023-09-22 11:09:32,315: INFO/MainProcess] Task tasks.log_query_per_table.log_query_per_table_task[3c1a2ecd-e52b-49fd-aee6-117e25363f05] received
[2023-09-22 11:09:32,316: INFO/ForkPoolWorker-1] Task tasks.run_query.run_query_task[01756288-9a00-4d16-b7e8-17c6fb897cb7] succeeded in 1.0514106303453445s: (3, 87508)
[2023-09-22 11:09:36,553: ERROR/MainProcess] Timed out waiting for UP message from <ForkProcess(ForkPoolWorker-151, started daemon)>
[2023-09-22 11:09:36,560: ERROR/MainProcess] Process 'ForkPoolWorker-151' pid:232 exited with 'signal 9 (SIGKILL)'
[2023-09-22 11:09:40,674: ERROR/MainProcess] Timed out waiting for UP message from <ForkProcess(ForkPoolWorker-152, started daemon)>
[2023-09-22 11:09:40,680: ERROR/MainProcess] Process 'ForkPoolWorker-152' pid:233 exited with 'signal 9 (SIGKILL)'

We tried to change query-engine, start empty redis and elasticsearch but without any results.
We can find task_id in our redis:

127.0.0.1:6379[12]> keys *
1) "celery-task-meta-01756288-9a00-4d16-b7e8-17c6fb897cb7"
2) "unacked"
3) "_kombu.binding.celeryev"
4) "celery-task-meta-dbfaa772-15ef-4d20-93b4-b9723564270a"
5) "_kombu.binding.celery"
6) "unacked_index"
7) "_kombu.binding.celery.pidbox"

127.0.0.1:6379[12]> get celery-task-meta-01756288-9a00-4d16-b7e8-17c6fb897cb7
"{\"status\": \"SUCCESS\", \"result\": [3, 87508], \"traceback\": null, \"children\": [[[\"3c1a2ecd-e52b-49fd-aee6-117e25363f05\", null], null]], \"date_done\": \"2023-09-22T08:09:32.315638\", \"task_id\": \"01756288-9a00-4d16-b7e8-17c6fb897cb7\"}"

Could you please give me a hit with that issue?

The text was updated successfully, but these errors were encountered:

mlivirov · 2023-09-24T00:11:05Z

+1
faced the same issue in prod deployment, but on the local machine with dev build it works as expected.

there is a different behaviour in worker startup script depending on the presense of production flag.
https://github.com/pinterest/querybook/blob/master/querybook/server/tasks/all_tasks.py

so as workaround I've added production=false to the worker env variables which helped. not sure what side effects it may have added.

gzhukov · 2023-09-24T16:55:07Z

Thx. I have prod environment too.

adamstruck · 2023-09-25T23:08:53Z

I experienced the same issue. Adding kombu==5.3.1 in requirements/base.txt fixed the issue in my production deployment.

jczhong84 · 2023-09-26T00:29:01Z

@mlivirov @adamstruck thanks for sharing findings.

for the flag of production=false in all_tasks.py, the only difference is it will run the query execution cleanup when production is true. This reminds me of what @baumandm mentioned in https://querybook.slack.com/archives/CHCNR2Y5B/p1695153621351919.

for the package of kombu, it seems to be a dep package of celery, and we do did a celery version upgrade. Does changing kombu to 5.3.1 work for other people?

baumandm · 2023-09-26T00:59:59Z

I ran into this issue running locally (via make) only with production=true set, while I was investigating the worker startup issue.

Fortunately we haven't seen it in our production instance, but we are building our own Docker image and it's possible that the requirements are slightly different.

czgu · 2023-10-11T21:26:41Z

Hey all, we faced a similar issue to this. Turns out using celery with -P gevent would resolve this. I am defaulting to use gevent for workers to avoid this issue in future

czgu · 2023-10-12T00:40:17Z

turns out this is should be the issue celery/kombu#1785
will fix to 5.3.1 for now

czgu mentioned this issue Oct 11, 2023

Use gevent for celery worker, update timeout to GeventTimeout #1345

Merged

czgu closed this as completed in #1345 Oct 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Timed out waiting for UP message from ForkProcess #1337

Timed out waiting for UP message from ForkProcess #1337

gzhukov commented Sep 22, 2023

mlivirov commented Sep 24, 2023 •

edited

Loading

gzhukov commented Sep 24, 2023

adamstruck commented Sep 25, 2023

jczhong84 commented Sep 26, 2023 •

edited

Loading

baumandm commented Sep 26, 2023

czgu commented Oct 11, 2023

czgu commented Oct 12, 2023

Timed out waiting for UP message from ForkProcess #1337

Timed out waiting for UP message from ForkProcess #1337

Comments

gzhukov commented Sep 22, 2023

mlivirov commented Sep 24, 2023 • edited Loading

gzhukov commented Sep 24, 2023

adamstruck commented Sep 25, 2023

jczhong84 commented Sep 26, 2023 • edited Loading

baumandm commented Sep 26, 2023

czgu commented Oct 11, 2023

czgu commented Oct 12, 2023

mlivirov commented Sep 24, 2023 •

edited

Loading

jczhong84 commented Sep 26, 2023 •

edited

Loading