-
Notifications
You must be signed in to change notification settings - Fork 13.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
chore: Optimize fetching samples logic #25995
chore: Optimize fetching samples logic #25995
Conversation
475da36
to
f91be05
Compare
sample_data = samples_instance.get_payload()["queries"][0] | ||
|
||
if sample_data.get("status") == QueryStatus.FAILED: | ||
QueryCacheManager.delete(sample_data.get("cache_key"), CacheRegion.DATA) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not entirely sure why we only delete the sample data cache key, however I thought it was best to preserve the existing logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The "count star" might be a heavy workload in some db, it trigger full table scan and aggregation, so just removed "sample_data" but keep "count star", it means that even though "sample_data" query is failed, count_star_data is cached still.
I think the new logic should remove cache key if the "count star" query is failed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and...another think is that we should customize count(*) in different db.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @zhaoyongjie for the comment. Thinking about this more, if a query fails then there’s shouldn’t be anything cached. Hence, in the new sequential formulation, if the COUNT(*)
query succeeds but the sample data query fails we should actually be removing the cached result from the former.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
f91be05
to
706d8ab
Compare
(cherry picked from commit 326ac4a)
* fix(sqllab): reinstate "Force trino client async execution" (apache#25680) * fix: remove unnecessary redirect (apache#25679) (cherry picked from commit da42bf2) * fix(chore): dashboard requests to database equal the number of slices it has (apache#24709) (cherry picked from commit 75a7431) * fix: bump to FAB 4.3.9 remove CSP exception (apache#25712) (cherry picked from commit 8fb0c8d) * fix(horizontal filter label): show full tooltip with ellipsis (apache#25732) (cherry picked from commit e4173d9) * fix: Revert "fix(Charts): Set max row limit + removed the option to use an empty row limit value" (apache#25753) (cherry picked from commit e2fe967) * fix: dataset update uniqueness (apache#25756) (cherry picked from commit c7f8d11) * fix(sqllab): slow pop datasource query (apache#25741) (cherry picked from commit 2a2bc82) * fix: allow for backward compatible errors (apache#25640) * fix: DB-specific quoting in Jinja macro (apache#25779) (cherry picked from commit 5659c87) * fix: Revert "fix: Apply normalization to all dttm columns (apache#25147)" (apache#25801) * fix: Resolve issue apache#24195 (apache#25804) (cherry picked from commit 8737a8a) * fix(SQL field in edit dataset modal): display full sql query (apache#25768) (cherry picked from commit 1eba712) * fix(sqllab): infinite fetching status after results are landed (apache#25814) (cherry picked from commit 3f28eeb) * fix: Fires onChange when clearing all values of single select (apache#25853) (cherry picked from commit 8061d5c) * fix: the temporal x-axis results in a none time_range. (apache#25429) Co-authored-by: Elizabeth Thompson <eschutho@gmail.com> (cherry picked from commit ae619b1) * fix(table chart): Show Cell Bars correctly apache#25625 (apache#25707) (cherry picked from commit 916f7bc) * fix: remove `update_charts_owners` (apache#25843) * fix(charts): Time grain is None when dataset uses Jinja (apache#25842) (cherry picked from commit 7536dd1) * fix: Saving Mixed Chart with dashboard filter applied breaks adhoc_filter_b (apache#25877) (cherry picked from commit 268c1dc) * fix: database version field (apache#25898) (cherry picked from commit 06ffcd2) * fix: trino cursor (apache#25897) (cherry picked from commit cdb18e0) * chore: Updates CHANGELOG.md for 3.0.2 * fix(trino): allow impersonate_user flag to be imported (apache#25872) Co-authored-by: John Bodley <4567245+john-bodley@users.noreply.github.com> (cherry picked from commit 458be8c) * fix(table): Double percenting ad-hoc percentage metrics (apache#25857) (cherry picked from commit 784a478) * fix(sqllab): invalid sanitization on comparison symbol (apache#25903) (cherry picked from commit 581d3c7) * fix: update flask-caching to avoid breaking redis cache, solves apache#25339 (apache#25947) Co-authored-by: Ville Brofeldt <33317356+villebro@users.noreply.github.com> * fix: always denorm column value before querying values (apache#25919) * chore(colors): Updating Airbnb brand colors (apache#23619) (cherry picked from commit 6d8424c) * fix: naming denomalized to denormalized in helpers.py (apache#25973) (cherry picked from commit 5def416) * fix(helm): Restart all related deployments when bootstrap script changed (apache#25703) * fix(rls): Update text from tables to datasets in RLS modal (apache#25997) (cherry picked from commit 210f1f8) * fix: Make Select component fire onChange listener when a selection is pasted in (apache#25993) (cherry picked from commit 5fccf67) * fix(explore): redandant force param (apache#25985) (cherry picked from commit e7a1876) * chore: Optimize fetching samples logic (apache#25995) (cherry picked from commit 326ac4a) * fix(native filters): rendering performance improvement by reduce overrendering (apache#25901) (cherry picked from commit e1d73d5) * fix: update FAB to 4.3.10, Azure user info fix (apache#26037) (cherry picked from commit 628cd34) * chore: Updates CHANGELOG.md for 3.0.2 (rc2) --------- Co-authored-by: Rob Moore <giftig@users.noreply.github.com> Co-authored-by: Igor Khrol <igor.khrol@automattic.com> Co-authored-by: Stepan <66589759+Always-prog@users.noreply.github.com> Co-authored-by: Daniel Vaz Gaspar <danielvazgaspar@gmail.com> Co-authored-by: Ross Mabbett <92495987+rtexelm@users.noreply.github.com> Co-authored-by: Geido <60598000+geido@users.noreply.github.com> Co-authored-by: Beto Dealmeida <roberto@dealmeida.net> Co-authored-by: JUST.in DO IT <justin.park@airbnb.com> Co-authored-by: Elizabeth Thompson <eschutho@gmail.com> Co-authored-by: John Bodley <4567245+john-bodley@users.noreply.github.com> Co-authored-by: Michael S. Molina <70410625+michael-s-molina@users.noreply.github.com> Co-authored-by: mapledan <mapledan829@gmail.com> Co-authored-by: Arko <90512504+SA-Ark@users.noreply.github.com> Co-authored-by: Antonio Rivero <38889534+Antonio-RiveroMartnez@users.noreply.github.com> Co-authored-by: Kamil Gabryjelski <kamil.gabryjelski@gmail.com> Co-authored-by: Michael S. Molina <michael.s.molina@gmail.com> Co-authored-by: FGrobelny <150029280+FGrobelny@users.noreply.github.com> Co-authored-by: Giacomo Barone <46573388+ggbaro@users.noreply.github.com> Co-authored-by: Ville Brofeldt <33317356+villebro@users.noreply.github.com> Co-authored-by: Hugh A. Miles II <hughmil3s@gmail.com> Co-authored-by: josedev-union <70741025+josedev-union@users.noreply.github.com> Co-authored-by: yousoph <sophieyou12@gmail.com> Co-authored-by: Jack Fragassi <jfragassi98@gmail.com>
SUMMARY
At Airbnb (as mentioned previously) we have an uber wide dataset (circa thousands of columns) which is can be somewhat problematic for certain workflows. One such workflow is fetching sample data—which is used in explore and drill-to-detail.
The existing logic constructs and executes a query for fetching both the sample data and
COUNT(*)
where the former is rather complex—relying on a very expensive SQLAlchemy query to fetch all the columns—before processing the results.This PR introduces an optimization which first constructs and executes the
COUNT(*)
query (which is both cheap to construct and execute) and only proceeds to the sample data if the first query succeeds. Given that (in our formulation) theCOUNT(*)
aggregate is not supported the endpoint returns in a few seconds as opposed from a few minutes. The TL;DR is we should fail early if possible.BEFORE/AFTER SCREENSHOTS OR ANIMATED GIF
BEFORE
AFTER
TESTING INSTRUCTIONS
Existing unit tests.
ADDITIONAL INFORMATION