Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore: Optimize fetching samples logic #25995

Merged

Conversation

john-bodley
Copy link
Member

@john-bodley john-bodley commented Nov 15, 2023

SUMMARY

At Airbnb (as mentioned previously) we have an uber wide dataset (circa thousands of columns) which is can be somewhat problematic for certain workflows. One such workflow is fetching sample data—which is used in explore and drill-to-detail.

The existing logic constructs and executes a query for fetching both the sample data and COUNT(*) where the former is rather complex—relying on a very expensive SQLAlchemy query to fetch all the columns—before processing the results.

This PR introduces an optimization which first constructs and executes the COUNT(*) query (which is both cheap to construct and execute) and only proceeds to the sample data if the first query succeeds. Given that (in our formulation) the COUNT(*) aggregate is not supported the endpoint returns in a few seconds as opposed from a few minutes. The TL;DR is we should fail early if possible.

BEFORE/AFTER SCREENSHOTS OR ANIMATED GIF

BEFORE

Screenshot 2023-11-15 at 2 16 57 PM

AFTER

Screenshot 2023-11-15 at 2 17 15 PM

TESTING INSTRUCTIONS

Existing unit tests.

ADDITIONAL INFORMATION

  • Has associated issue:
  • Required feature flags:
  • Changes UI
  • Includes DB Migration (follow approval process in SIP-59)
    • Migration is atomic, supports rollback & is backwards-compatible
    • Confirm DB migration upgrade and downgrade tested
    • Runtime estimates and downtime expectations provided
  • Introduces new feature or API
  • Removes existing feature or API

@john-bodley john-bodley force-pushed the john-bodley--optimize-get-samples branch 2 times, most recently from 475da36 to f91be05 Compare November 15, 2023 22:41
sample_data = samples_instance.get_payload()["queries"][0]

if sample_data.get("status") == QueryStatus.FAILED:
QueryCacheManager.delete(sample_data.get("cache_key"), CacheRegion.DATA)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not entirely sure why we only delete the sample data cache key, however I thought it was best to preserve the existing logic.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "count star" might be a heavy workload in some db, it trigger full table scan and aggregation, so just removed "sample_data" but keep "count star", it means that even though "sample_data" query is failed, count_star_data is cached still.

I think the new logic should remove cache key if the "count star" query is failed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and...another think is that we should customize count(*) in different db.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @zhaoyongjie for the comment. Thinking about this more, if a query fails then there’s shouldn’t be anything cached. Hence, in the new sequential formulation, if the COUNT(*) query succeeds but the sample data query fails we should actually be removing the cached result from the former.

Copy link
Member

@justinpark justinpark left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@john-bodley john-bodley force-pushed the john-bodley--optimize-get-samples branch from f91be05 to 706d8ab Compare November 16, 2023 00:23
@john-bodley john-bodley merged commit 326ac4a into apache:master Nov 16, 2023
29 checks passed
@john-bodley john-bodley added the v3.0 Label added by the release manager to track PRs to be included in the 3.0 branch label Nov 16, 2023
@john-bodley john-bodley deleted the john-bodley--optimize-get-samples branch November 16, 2023 23:42
michael-s-molina pushed a commit that referenced this pull request Nov 20, 2023
saghatelian added a commit to 10webio/superset that referenced this pull request Nov 27, 2023
* fix(sqllab): reinstate "Force trino client async execution" (apache#25680)

* fix: remove unnecessary redirect (apache#25679)

(cherry picked from commit da42bf2)

* fix(chore): dashboard requests to database equal the number of slices it has (apache#24709)

(cherry picked from commit 75a7431)

* fix: bump to FAB 4.3.9 remove CSP exception (apache#25712)

(cherry picked from commit 8fb0c8d)

* fix(horizontal filter label): show full tooltip with ellipsis (apache#25732)

(cherry picked from commit e4173d9)

* fix: Revert "fix(Charts): Set max row limit + removed the option to use an empty row limit value" (apache#25753)

(cherry picked from commit e2fe967)

* fix: dataset update uniqueness (apache#25756)

(cherry picked from commit c7f8d11)

* fix(sqllab): slow pop datasource query (apache#25741)

(cherry picked from commit 2a2bc82)

* fix: allow for backward compatible errors (apache#25640)

* fix: DB-specific quoting in Jinja macro (apache#25779)

(cherry picked from commit 5659c87)

* fix: Revert "fix: Apply normalization to all dttm columns (apache#25147)" (apache#25801)

* fix: Resolve issue apache#24195 (apache#25804)

(cherry picked from commit 8737a8a)

* fix(SQL field in edit dataset modal): display full sql query (apache#25768)

(cherry picked from commit 1eba712)

* fix(sqllab): infinite fetching status after results are landed (apache#25814)

(cherry picked from commit 3f28eeb)

* fix: Fires onChange when clearing all values of single select (apache#25853)

(cherry picked from commit 8061d5c)

* fix: the temporal x-axis results in a none time_range. (apache#25429)

Co-authored-by: Elizabeth Thompson <eschutho@gmail.com>
(cherry picked from commit ae619b1)

* fix(table chart): Show Cell Bars correctly apache#25625 (apache#25707)

(cherry picked from commit 916f7bc)

* fix: remove `update_charts_owners` (apache#25843)

* fix(charts): Time grain is None when dataset uses Jinja (apache#25842)

(cherry picked from commit 7536dd1)

* fix: Saving Mixed Chart with dashboard filter applied breaks adhoc_filter_b (apache#25877)

(cherry picked from commit 268c1dc)

* fix: database version field (apache#25898)

(cherry picked from commit 06ffcd2)

* fix: trino cursor (apache#25897)

(cherry picked from commit cdb18e0)

* chore: Updates CHANGELOG.md for 3.0.2

* fix(trino): allow impersonate_user flag to be imported (apache#25872)

Co-authored-by: John Bodley <4567245+john-bodley@users.noreply.github.com>
(cherry picked from commit 458be8c)

* fix(table): Double percenting ad-hoc percentage metrics (apache#25857)

(cherry picked from commit 784a478)

* fix(sqllab): invalid sanitization on comparison symbol (apache#25903)

(cherry picked from commit 581d3c7)

* fix: update flask-caching to avoid breaking redis cache, solves apache#25339 (apache#25947)

Co-authored-by: Ville Brofeldt <33317356+villebro@users.noreply.github.com>

* fix: always denorm column value before querying values (apache#25919)

* chore(colors): Updating Airbnb brand colors (apache#23619)

(cherry picked from commit 6d8424c)

* fix: naming denomalized to denormalized in helpers.py (apache#25973)

(cherry picked from commit 5def416)

* fix(helm): Restart all related deployments when bootstrap script changed (apache#25703)

* fix(rls): Update text from tables to datasets in RLS modal (apache#25997)

(cherry picked from commit 210f1f8)

* fix: Make Select component fire onChange listener when a selection is pasted in (apache#25993)

(cherry picked from commit 5fccf67)

* fix(explore): redandant force param (apache#25985)

(cherry picked from commit e7a1876)

* chore: Optimize fetching samples logic (apache#25995)

(cherry picked from commit 326ac4a)

* fix(native filters): rendering performance improvement by reduce overrendering (apache#25901)

(cherry picked from commit e1d73d5)

* fix: update FAB to 4.3.10, Azure user info fix (apache#26037)

(cherry picked from commit 628cd34)

* chore: Updates CHANGELOG.md for 3.0.2 (rc2)

---------

Co-authored-by: Rob Moore <giftig@users.noreply.github.com>
Co-authored-by: Igor Khrol <igor.khrol@automattic.com>
Co-authored-by: Stepan <66589759+Always-prog@users.noreply.github.com>
Co-authored-by: Daniel Vaz Gaspar <danielvazgaspar@gmail.com>
Co-authored-by: Ross Mabbett <92495987+rtexelm@users.noreply.github.com>
Co-authored-by: Geido <60598000+geido@users.noreply.github.com>
Co-authored-by: Beto Dealmeida <roberto@dealmeida.net>
Co-authored-by: JUST.in DO IT <justin.park@airbnb.com>
Co-authored-by: Elizabeth Thompson <eschutho@gmail.com>
Co-authored-by: John Bodley <4567245+john-bodley@users.noreply.github.com>
Co-authored-by: Michael S. Molina <70410625+michael-s-molina@users.noreply.github.com>
Co-authored-by: mapledan <mapledan829@gmail.com>
Co-authored-by: Arko <90512504+SA-Ark@users.noreply.github.com>
Co-authored-by: Antonio Rivero <38889534+Antonio-RiveroMartnez@users.noreply.github.com>
Co-authored-by: Kamil Gabryjelski <kamil.gabryjelski@gmail.com>
Co-authored-by: Michael S. Molina <michael.s.molina@gmail.com>
Co-authored-by: FGrobelny <150029280+FGrobelny@users.noreply.github.com>
Co-authored-by: Giacomo Barone <46573388+ggbaro@users.noreply.github.com>
Co-authored-by: Ville Brofeldt <33317356+villebro@users.noreply.github.com>
Co-authored-by: Hugh A. Miles II <hughmil3s@gmail.com>
Co-authored-by: josedev-union <70741025+josedev-union@users.noreply.github.com>
Co-authored-by: yousoph <sophieyou12@gmail.com>
Co-authored-by: Jack Fragassi <jfragassi98@gmail.com>
josedev-union pushed a commit to Ortege-xyz/studio that referenced this pull request Jan 22, 2024
cccs-rc pushed a commit to CybercentreCanada/superset that referenced this pull request Mar 6, 2024
@mistercrunch mistercrunch added 🍒 3.0.2 🍒 3.0.3 🍒 3.0.4 🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels 🚢 3.1.0 labels Mar 8, 2024
sfirke pushed a commit to sfirke/superset that referenced this pull request Mar 22, 2024
Rozelin-dc pushed a commit to Rozelin-dc/superset that referenced this pull request May 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels size/S v3.0 Label added by the release manager to track PRs to be included in the 3.0 branch 🍒 3.0.2 🍒 3.0.3 🍒 3.0.4 🚢 3.1.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants