chore: Optimize fetching samples logic #25995

john-bodley · 2023-11-15T22:37:27Z

SUMMARY

At Airbnb (as mentioned previously) we have an uber wide dataset (circa thousands of columns) which is can be somewhat problematic for certain workflows. One such workflow is fetching sample data—which is used in explore and drill-to-detail.

The existing logic constructs and executes a query for fetching both the sample data and COUNT(*) where the former is rather complex—relying on a very expensive SQLAlchemy query to fetch all the columns—before processing the results.

This PR introduces an optimization which first constructs and executes the COUNT(*) query (which is both cheap to construct and execute) and only proceeds to the sample data if the first query succeeds. Given that (in our formulation) the COUNT(*) aggregate is not supported the endpoint returns in a few seconds as opposed from a few minutes. The TL;DR is we should fail early if possible.

BEFORE/AFTER SCREENSHOTS OR ANIMATED GIF

BEFORE

AFTER

TESTING INSTRUCTIONS

Existing unit tests.

ADDITIONAL INFORMATION

Has associated issue:
Required feature flags:
Changes UI
Includes DB Migration (follow approval process in SIP-59)
- Migration is atomic, supports rollback & is backwards-compatible
- Confirm DB migration upgrade and downgrade tested
- Runtime estimates and downtime expectations provided
Introduces new feature or API
Removes existing feature or API

john-bodley · 2023-11-15T22:42:49Z

superset/views/datasource/utils.py

+        sample_data = samples_instance.get_payload()["queries"][0]
+
+        if sample_data.get("status") == QueryStatus.FAILED:
+            QueryCacheManager.delete(sample_data.get("cache_key"), CacheRegion.DATA)


I'm not entirely sure why we only delete the sample data cache key, however I thought it was best to preserve the existing logic.

The "count star" might be a heavy workload in some db, it trigger full table scan and aggregation, so just removed "sample_data" but keep "count star", it means that even though "sample_data" query is failed, count_star_data is cached still.

I think the new logic should remove cache key if the "count star" query is failed.

and...another think is that we should customize count(*) in different db.

Thanks @zhaoyongjie for the comment. Thinking about this more, if a query fails then there’s shouldn’t be anything cached. Hence, in the new sequential formulation, if the COUNT(*) query succeeds but the sample data query fails we should actually be removing the cached result from the former.

justinpark

LGTM

(cherry picked from commit 326ac4a)

* fix(sqllab): reinstate "Force trino client async execution" (apache#25680) * fix: remove unnecessary redirect (apache#25679) (cherry picked from commit da42bf2) * fix(chore): dashboard requests to database equal the number of slices it has (apache#24709) (cherry picked from commit 75a7431) * fix: bump to FAB 4.3.9 remove CSP exception (apache#25712) (cherry picked from commit 8fb0c8d) * fix(horizontal filter label): show full tooltip with ellipsis (apache#25732) (cherry picked from commit e4173d9) * fix: Revert "fix(Charts): Set max row limit + removed the option to use an empty row limit value" (apache#25753) (cherry picked from commit e2fe967) * fix: dataset update uniqueness (apache#25756) (cherry picked from commit c7f8d11) * fix(sqllab): slow pop datasource query (apache#25741) (cherry picked from commit 2a2bc82) * fix: allow for backward compatible errors (apache#25640) * fix: DB-specific quoting in Jinja macro (apache#25779) (cherry picked from commit 5659c87) * fix: Revert "fix: Apply normalization to all dttm columns (apache#25147)" (apache#25801) * fix: Resolve issue apache#24195 (apache#25804) (cherry picked from commit 8737a8a) * fix(SQL field in edit dataset modal): display full sql query (apache#25768) (cherry picked from commit 1eba712) * fix(sqllab): infinite fetching status after results are landed (apache#25814) (cherry picked from commit 3f28eeb) * fix: Fires onChange when clearing all values of single select (apache#25853) (cherry picked from commit 8061d5c) * fix: the temporal x-axis results in a none time_range. (apache#25429) Co-authored-by: Elizabeth Thompson <eschutho@gmail.com> (cherry picked from commit ae619b1) * fix(table chart): Show Cell Bars correctly apache#25625 (apache#25707) (cherry picked from commit 916f7bc) * fix: remove `update_charts_owners` (apache#25843) * fix(charts): Time grain is None when dataset uses Jinja (apache#25842) (cherry picked from commit 7536dd1) * fix: Saving Mixed Chart with dashboard filter applied breaks adhoc_filter_b (apache#25877) (cherry picked from commit 268c1dc) * fix: database version field (apache#25898) (cherry picked from commit 06ffcd2) * fix: trino cursor (apache#25897) (cherry picked from commit cdb18e0) * chore: Updates CHANGELOG.md for 3.0.2 * fix(trino): allow impersonate_user flag to be imported (apache#25872) Co-authored-by: John Bodley <4567245+john-bodley@users.noreply.github.com> (cherry picked from commit 458be8c) * fix(table): Double percenting ad-hoc percentage metrics (apache#25857) (cherry picked from commit 784a478) * fix(sqllab): invalid sanitization on comparison symbol (apache#25903) (cherry picked from commit 581d3c7) * fix: update flask-caching to avoid breaking redis cache, solves apache#25339 (apache#25947) Co-authored-by: Ville Brofeldt <33317356+villebro@users.noreply.github.com> * fix: always denorm column value before querying values (apache#25919) * chore(colors): Updating Airbnb brand colors (apache#23619) (cherry picked from commit 6d8424c) * fix: naming denomalized to denormalized in helpers.py (apache#25973) (cherry picked from commit 5def416) * fix(helm): Restart all related deployments when bootstrap script changed (apache#25703) * fix(rls): Update text from tables to datasets in RLS modal (apache#25997) (cherry picked from commit 210f1f8) * fix: Make Select component fire onChange listener when a selection is pasted in (apache#25993) (cherry picked from commit 5fccf67) * fix(explore): redandant force param (apache#25985) (cherry picked from commit e7a1876) * chore: Optimize fetching samples logic (apache#25995) (cherry picked from commit 326ac4a) * fix(native filters): rendering performance improvement by reduce overrendering (apache#25901) (cherry picked from commit e1d73d5) * fix: update FAB to 4.3.10, Azure user info fix (apache#26037) (cherry picked from commit 628cd34) * chore: Updates CHANGELOG.md for 3.0.2 (rc2) --------- Co-authored-by: Rob Moore <giftig@users.noreply.github.com> Co-authored-by: Igor Khrol <igor.khrol@automattic.com> Co-authored-by: Stepan <66589759+Always-prog@users.noreply.github.com> Co-authored-by: Daniel Vaz Gaspar <danielvazgaspar@gmail.com> Co-authored-by: Ross Mabbett <92495987+rtexelm@users.noreply.github.com> Co-authored-by: Geido <60598000+geido@users.noreply.github.com> Co-authored-by: Beto Dealmeida <roberto@dealmeida.net> Co-authored-by: JUST.in DO IT <justin.park@airbnb.com> Co-authored-by: Elizabeth Thompson <eschutho@gmail.com> Co-authored-by: John Bodley <4567245+john-bodley@users.noreply.github.com> Co-authored-by: Michael S. Molina <70410625+michael-s-molina@users.noreply.github.com> Co-authored-by: mapledan <mapledan829@gmail.com> Co-authored-by: Arko <90512504+SA-Ark@users.noreply.github.com> Co-authored-by: Antonio Rivero <38889534+Antonio-RiveroMartnez@users.noreply.github.com> Co-authored-by: Kamil Gabryjelski <kamil.gabryjelski@gmail.com> Co-authored-by: Michael S. Molina <michael.s.molina@gmail.com> Co-authored-by: FGrobelny <150029280+FGrobelny@users.noreply.github.com> Co-authored-by: Giacomo Barone <46573388+ggbaro@users.noreply.github.com> Co-authored-by: Ville Brofeldt <33317356+villebro@users.noreply.github.com> Co-authored-by: Hugh A. Miles II <hughmil3s@gmail.com> Co-authored-by: josedev-union <70741025+josedev-union@users.noreply.github.com> Co-authored-by: yousoph <sophieyou12@gmail.com> Co-authored-by: Jack Fragassi <jfragassi98@gmail.com>

pull-request-size bot added the size/S label Nov 15, 2023

john-bodley force-pushed the john-bodley--optimize-get-samples branch 2 times, most recently from 475da36 to f91be05 Compare November 15, 2023 22:41

john-bodley commented Nov 15, 2023

View reviewed changes

john-bodley requested review from justinpark and michael-s-molina November 15, 2023 22:43

justinpark approved these changes Nov 15, 2023

View reviewed changes

chore: Optimize fetching samples logic

706d8ab

john-bodley force-pushed the john-bodley--optimize-get-samples branch from f91be05 to 706d8ab Compare November 16, 2023 00:23

john-bodley merged commit 326ac4a into apache:master Nov 16, 2023
29 checks passed

john-bodley added the v3.0 Label added by the release manager to track PRs to be included in the 3.0 branch label Nov 16, 2023

john-bodley deleted the john-bodley--optimize-get-samples branch November 16, 2023 23:42

michael-s-molina pushed a commit that referenced this pull request Nov 20, 2023

chore: Optimize fetching samples logic (#25995)

da06206

(cherry picked from commit 326ac4a)

john-bodley mentioned this pull request Nov 21, 2023

fix: Optimize fetching samples logic #26060

Merged

9 tasks

josedev-union pushed a commit to Ortege-xyz/studio that referenced this pull request Jan 22, 2024

chore: Optimize fetching samples logic (apache#25995)

7fa145a

cccs-rc pushed a commit to CybercentreCanada/superset that referenced this pull request Mar 6, 2024

chore: Optimize fetching samples logic (apache#25995)

7c404e7

mistercrunch added 🍒 3.0.2 🍒 3.0.3 🍒 3.0.4 🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels 🚢 3.1.0 labels Mar 8, 2024

sfirke pushed a commit to sfirke/superset that referenced this pull request Mar 22, 2024

chore: Optimize fetching samples logic (apache#25995)

d576467

Rozelin-dc pushed a commit to Rozelin-dc/superset that referenced this pull request May 9, 2024

chore: Optimize fetching samples logic (apache#25995)

0568102

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore: Optimize fetching samples logic #25995

chore: Optimize fetching samples logic #25995

john-bodley commented Nov 15, 2023 •

edited

Loading

john-bodley Nov 15, 2023

zhaoyongjie Nov 19, 2023

zhaoyongjie Nov 19, 2023

john-bodley Nov 19, 2023

justinpark left a comment

chore: Optimize fetching samples logic #25995

chore: Optimize fetching samples logic #25995

Conversation

john-bodley commented Nov 15, 2023 • edited Loading

SUMMARY

BEFORE/AFTER SCREENSHOTS OR ANIMATED GIF

BEFORE

AFTER

TESTING INSTRUCTIONS

ADDITIONAL INFORMATION

john-bodley Nov 15, 2023

Choose a reason for hiding this comment

zhaoyongjie Nov 19, 2023

Choose a reason for hiding this comment

zhaoyongjie Nov 19, 2023

Choose a reason for hiding this comment

john-bodley Nov 19, 2023

Choose a reason for hiding this comment

justinpark left a comment

Choose a reason for hiding this comment

john-bodley commented Nov 15, 2023 •

edited

Loading