fix(celery): Reset DB connection pools for forked worker processes #13350

robdiciuccio · 2021-02-25T23:35:40Z

SUMMARY

Adds a listener for the worker_process_init Celery signal that disposes of and resets the SQLAlchemy connection pool being passed to the forked process.

Resolves the intermittent sqlalchemy.exc.NoSuchColumnError reported in #10530 and #12766 and the following error reported in #9860:

sqlalchemy.exc.DatabaseError: (psycopg2.DatabaseError) error with status PGRES_TUPLES_OK and no message from the libpq

This fix is primarily related to the default prefork Celery execution pool, but was also tested with the following pool invocations:

celery worker --app=superset.tasks.celery_app:app -Ofair -l INFO

celery worker --app=superset.tasks.celery_app:app --pool=threads -c 12 -l INFO

celery worker --app=superset.tasks.celery_app:app --pool=gevent -c 12 -l INFO

This configuration was tested with async queries enabled to place load on the celery workers, in both standalone and Docker-based workflows.

This PR also includes a fix for a client-side race condition in loading charts asynchronously (fixes #12913).

References:
https://docs.sqlalchemy.org/en/13/core/connections.html#engine-disposal
https://www.yangster.ca/post/not-the-same-pre-fork-worker-model/

TEST PLAN

Asynchronous tasks should run without sqlalchemy.exc.NoSuchColumnError when celery is run in prefork mode. See #10530 and #12766 for reproducability.

ADDITIONAL INFORMATION

Has associated issue: Fixes Celery task execution failing due to "PGRES_TUPLES_OK" DB error. #9860, celery SQLAlchemy errors in celery tasks that use sqlalchemy pools #10530, Global Async Queries doesn't work #12766 and [async] error: component not found for job <id> #12913
Changes UI
Requires DB Migration.
Confirm DB Migration upgrade and downgrade tested.
Introduces new feature or API
Removes existing feature or API

craig-rueda

LGTM

codecov · 2021-02-26T01:05:46Z

Codecov Report

Merging #13350 (a28c406) into master (7055c05) will decrease coverage by 0.01%.
The diff coverage is 25.00%.

@@            Coverage Diff             @@
##           master   #13350      +/-   ##
==========================================
- Coverage   77.12%   77.10%   -0.02%     
==========================================
  Files         881      881              
  Lines       45502    45507       +5     
  Branches     5447     5449       +2     
==========================================
- Hits        35093    35090       -3     
- Misses      10286    10293       +7     
- Partials      123      124       +1

Flag	Coverage Δ
cypress	`58.06% <0.00%> (+0.02%)`	⬆️
hive	`79.95% <0.00%> (-0.02%)`	⬇️
javascript	`62.55% <37.50%> (+0.11%)`	⬆️
mysql	`80.28% <0.00%> (-0.02%)`	⬇️
postgres	`80.32% <0.00%> (-0.02%)`	⬇️
presto	`?`
python	`80.71% <0.00%> (-0.14%)`	⬇️
sqlite	`79.94% <0.00%> (-0.02%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
superset-frontend/src/dashboard/index.jsx	`63.63% <0.00%> (-6.37%)`	⬇️
superset-frontend/src/explore/index.jsx	`63.63% <0.00%> (-6.37%)`	⬇️
superset-frontend/src/profile/App.tsx	`0.00% <0.00%> (ø)`
superset/tasks/celery_app.py	`0.00% <0.00%> (ø)`
...rset-frontend/src/components/TableLoader/index.tsx	`100.00% <100.00%> (ø)`
superset-frontend/src/middleware/asyncEvent.ts	`87.34% <100.00%> (ø)`
superset/db_engine_specs/presto.py	`82.47% <0.00%> (-5.56%)`	⬇️
superset/models/core.py	`88.55% <0.00%> (-0.28%)`	⬇️
...et-frontend/src/components/TableView/TableView.tsx	`92.85% <0.00%> (+3.57%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7055c05...7ca4d05. Read the comment docs.

dpgaspar

Nice! but does this happen even when using null pool on the workers? or do you think there's still uses of QueuePool create by flask-sqlalchemy at the worker level?

robdiciuccio · 2021-02-26T15:42:12Z

@dpgaspar the app.db instance doesn't appear to use the null pool with the metadata database. I know we're using that elsewhere to get around this forking issue, which I'm planning on reviewing after this fix is merged.

dpgaspar · 2021-02-26T15:45:35Z

@dpgaspar the app.db instance doesn't appear to use the null pool with the metadata database. I know we're using that elsewhere to get around this forking issue, which I'm planning on reviewing after this fix is merged.

workers when accessing the metadata db should use: https://github.com/apache/superset/blob/master/superset/utils/celery.py#L33

Let's talk about this?

robdiciuccio · 2021-02-26T16:05:29Z

Talked with @dpgaspar and we agree this is a good first step, and more DB connection pool investigation is required. Merging.

…pache#13350) * Reset sqlalchemy connection pool on celery process fork * Fix race condition with async chart loading state * pylint: ignore * prettier

robdiciuccio added 2 commits February 25, 2021 14:42

Reset sqlalchemy connection pool on celery process fork

95fbe84

Fix race condition with async chart loading state

8118d4f

robdiciuccio requested review from craig-rueda, dpgaspar and bkyryliuk February 25, 2021 23:35

superset-github-bot bot added the preset-io label Feb 25, 2021

pull-request-size bot added the size/S label Feb 25, 2021

robdiciuccio requested a review from villebro February 25, 2021 23:37

robdiciuccio added assigned:preset Assigned to the Preset team and removed assigned:preset Assigned to the Preset team labels Feb 25, 2021

This was referenced Feb 25, 2021

[async] error: component not found for job <id> #12913

Closed

Global Async Queries doesn't work #12766

Closed

celery SQLAlchemy errors in celery tasks that use sqlalchemy pools #10530

Closed

Superset scheduling issues #9906

Closed

craig-rueda approved these changes Feb 25, 2021

View reviewed changes

robdiciuccio added the v1.1 label Feb 26, 2021

robdiciuccio added 2 commits February 25, 2021 16:14

pylint: ignore

f4bb6f0

prettier

cf1fd34

Merge branch 'master' into rd/celery-db-pool

7ca4d05

dpgaspar approved these changes Feb 26, 2021

View reviewed changes

robdiciuccio merged commit b4ca39c into apache:master Feb 26, 2021

robdiciuccio deleted the rd/celery-db-pool branch February 26, 2021 16:05

Animesh3193 mentioned this pull request Jun 18, 2021

DB Connection pool throws error when executing raw sql in Celery job #15255

Closed

3 tasks

mistercrunch added 🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels 🚢 1.2.0 labels Mar 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(celery): Reset DB connection pools for forked worker processes #13350

fix(celery): Reset DB connection pools for forked worker processes #13350

robdiciuccio commented Feb 25, 2021 •

edited

Loading

craig-rueda left a comment

codecov bot commented Feb 26, 2021 •

edited

Loading

dpgaspar left a comment

robdiciuccio commented Feb 26, 2021

dpgaspar commented Feb 26, 2021

robdiciuccio commented Feb 26, 2021

fix(celery): Reset DB connection pools for forked worker processes #13350

fix(celery): Reset DB connection pools for forked worker processes #13350

Conversation

robdiciuccio commented Feb 25, 2021 • edited Loading

SUMMARY

TEST PLAN

ADDITIONAL INFORMATION

craig-rueda left a comment

Choose a reason for hiding this comment

codecov bot commented Feb 26, 2021 • edited Loading

Codecov Report

dpgaspar left a comment

Choose a reason for hiding this comment

robdiciuccio commented Feb 26, 2021

dpgaspar commented Feb 26, 2021

robdiciuccio commented Feb 26, 2021

robdiciuccio commented Feb 25, 2021 •

edited

Loading

codecov bot commented Feb 26, 2021 •

edited

Loading