fix: add datasource.changed_on to cache_key #8901

villebro · 2019-12-29T11:37:59Z

SUMMARY

When changing the SQL query in a datasource, the cache key doesn't reflect this change, as only the datasource id is added to the cache key. The same problem also applies to metrics and expressions. This PR adds the changed_on property of the datasource to the dict on which cache_key is based, ensuring that any change to the datasource will invalidate cached results. While this might cause unnecessary cache misses if datasources are automatically refreshed periodically, this at least ensures that no updates to datasources (sql query, metrics, expressions) go unnoticed.

TEST PLAN

Tested locally by replicating the problem in the issue, and ensuring that the bug goes away with this diff. New unit tests added to ensure that the new QueryContext/QueryObject classes work as intended. A small bug was also fixed in QueryContext.

ADDITIONAL INFORMATION

Has associated issue: closes SQL Lab Query doesn't get reflected in chart #8898
Changes UI
Requires DB Migration.
Confirm DB Migration upgrade and downgrade tested.
Introduces new feature or API
Removes existing feature or API

REVIEWERS

@durchgedreht

codecov-io · 2019-12-29T11:46:06Z

Codecov Report

Merging #8901 into master will not change coverage.
The diff coverage is n/a.

@@           Coverage Diff           @@
##           master    #8901   +/-   ##
=======================================
  Coverage   58.97%   58.97%           
=======================================
  Files         359      359           
  Lines       11333    11333           
  Branches     2787     2787           
=======================================
  Hits         6684     6684           
  Misses       4471     4471           
  Partials      178      178

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 191aca1...7713e08. Read the comment docs.

willbarrett · 2019-12-30T23:38:46Z

Thanks for this @villebro! My recommendation would be to go with the changed_on timestamp instead of the sql column. As you mention, Datasources define a number of different things that could affect cache behavior or results. Causing cache misses on every Datasource update feels more future-proof, as this logic won't need to change every time Datasource behavior changes. We may cause some unnecessary cache misses, but I think we should lean in the direction of ensuring correctness first and thinking about performance second.

villebro · 2019-12-31T06:46:11Z

Thanks for chiming in @willbarrett ; personally I also started leaning towards using changed_on instead of all potentially interesting attributes. I'll update the PR to reflect this. It would be valuable to hear if large deployments of Superset are using large scale automated metadata refreshes of existing datasources, and if this might cause unnecessary cache misses. @john-bodley @etr2460 @betodealmeida any feedback would be valuable to understand if this would be a blocker in your deployments?

villebro · 2020-01-08T19:33:23Z

As there doesn't seem to be any objections to adding changed_on to cache_key, I'll try to finalize this in the coming days.

villebro · 2020-01-13T07:01:41Z

FYI working on this fix+unit tests surfaced some other issues I will be addressing with this PR, i.e. will take slightly longer to complete than anticipated.

villebro · 2020-01-13T19:21:26Z

superset/common/query_context.py

+    def cache_key(self, query_obj: QueryObject, **kwargs) -> Optional[str]:
        extra_cache_keys = self.datasource.get_extra_cache_keys(query_obj.to_dict())
        cache_key = (
            query_obj.cache_key(
                datasource=self.datasource.uid,
                extra_cache_keys=extra_cache_keys,
+                changed_on=self.datasource.changed_on,
                **kwargs
            )
            if query_obj
            else None
        )
+        return cache_key


Cache key calculation logic was broken into a method of its own to enable easier unit testing.

villebro · 2020-01-13T19:22:43Z

superset/common/query_object.py

-            metric if "expressionType" in metric else metric["label"]  # type: ignore
-            for metric in metrics
-        ]
+        self.metrics = [utils.get_metric_name(metric) for metric in metrics]


During testing I noticed that the existing logic was incomplete; utils.get_metric_name on the other hand is used elsewhere and handles all metric types correctly (legacy and ad-hoc).

villebro · 2020-01-13T19:26:29Z

tests/core_tests.py

-        # TODO: update once get_data is implemented for QueryObject
-        with self.assertRaises(Exception):
-            self.get_resp("/api/v1/query/", {"query_context": data})
+        qc_dict = self._get_query_context_dict()
+        data = json.dumps(qc_dict)
+        resp = json.loads(self.get_resp("/api/v1/query/", {"query_context": data}))
+        self.assertEqual(resp[0]["rowcount"], 100)


The old unit test seemed to be incomplete, so fixed a few bugs in the body (limit -> row_limit and removed time_range) to make it work properly.

villebro · 2020-01-13T19:26:52Z

This is ready for review.

john-bodley · 2020-01-13T19:39:37Z

@villebro do you think there is merit in adding a line item to UPDATING.md to mention that this PR will invalidate the entire cache given that the keys used for hashing have changed?

villebro · 2020-01-13T19:43:50Z

@john-bodley oh absolutely, good idea. Will add a note in UPDATING.

willbarrett

Lookin' good!

(cherry picked from commit dc60db2)

john-bodley · 2020-01-29T18:02:00Z

superset/common/query_context.py

        extra_cache_keys = self.datasource.get_extra_cache_keys(query_obj.to_dict())
        cache_key = (
            query_obj.cache_key(
                datasource=self.datasource.uid,
                extra_cache_keys=extra_cache_keys,
+                changed_on=self.datasource.changed_on,


@villebro apologies for the late comment. In your PR description you mention that the cache key should be a function on whether the column or metric definitions associated with a datasource are changed.

On line #166 you merely use the datasource change_on and thus I was wondering whether a changes to the columns and/or metrics cascade, i.e., trigger an update to the datasource changed_on?

It is my understanding that changed_on is updated every time any change is applied, i.e. does not check if only relevant metrics/expressions have changed. While this can cause unnecessary cache misses, I felt the added complexity of checking only for relevant changes was not warranted unless the ultimately proposed simpler solution was found to be too generic (I tried to convey this in the unit test which only changed the description). If this does cause unacceptable amounts of cache misses I think we need to revisit this logic; until then I personally think this is a good compromise. However, I'm happy to open up the discussion again if there are opinions to the contrary.

pull-request-size bot added the size/XS label Dec 29, 2019

Add datasource.changed_on to cache_key and add+fix related unit tests

18d99df

villebro changed the title ~~[WIP] Add sql query to cache_key~~ Add datasource.changed_on to cache_key Jan 13, 2020

pull-request-size bot added size/M and removed size/XS labels Jan 13, 2020

villebro commented Jan 13, 2020

View reviewed changes

villebro requested review from john-bodley, mistercrunch, etr2460 and dpgaspar January 13, 2020 19:27

villebro changed the title ~~Add datasource.changed_on to cache_key~~ fix: add datasource.changed_on to cache_key Jan 13, 2020

villebro added 2 commits January 13, 2020 23:08

Add note to UPDATING.md

1454bf2

Remove redundant comment about metric names

80b8474

pull-request-size bot added size/L and removed size/M labels Jan 13, 2020

willbarrett approved these changes Jan 15, 2020

View reviewed changes

mistercrunch approved these changes Jan 16, 2020

View reviewed changes

mistercrunch merged commit c087a48 into apache:master Jan 16, 2020

john-bodley mentioned this pull request Jan 28, 2020

[fix] Reverting metic logic from #8901 #9030

Merged

12 tasks

john-bodley added a commit that referenced this pull request Jan 28, 2020

[fix] Reverting metic logic from #8901

1f60745

john-bodley added a commit that referenced this pull request Jan 28, 2020

[fix] Reverting metic logic from #8901 (#9030)

dc60db2

john-bodley added a commit to airbnb/superset-fork that referenced this pull request Jan 28, 2020

[fix] Reverting metic logic from apache#8901 (apache#9030)

73e0f1c

(cherry picked from commit dc60db2)

john-bodley reviewed Jan 29, 2020

View reviewed changes

etr2460 mentioned this pull request Jul 19, 2021

fix: Bust chart cache when metric/column is changed #15786

Merged

8 tasks

mistercrunch added 🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels 🚢 0.36.0 labels Feb 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: add datasource.changed_on to cache_key #8901

fix: add datasource.changed_on to cache_key #8901

villebro commented Dec 29, 2019 •

edited

Loading

codecov-io commented Dec 29, 2019

willbarrett commented Dec 30, 2019

villebro commented Dec 31, 2019

villebro commented Jan 8, 2020

villebro commented Jan 13, 2020

villebro Jan 13, 2020

villebro Jan 13, 2020

villebro Jan 13, 2020

villebro commented Jan 13, 2020

john-bodley commented Jan 13, 2020 •

edited

Loading

villebro commented Jan 13, 2020

willbarrett left a comment

john-bodley Jan 29, 2020

villebro Jan 29, 2020

fix: add datasource.changed_on to cache_key #8901

fix: add datasource.changed_on to cache_key #8901

Conversation

villebro commented Dec 29, 2019 • edited Loading

CATEGORY

SUMMARY

TEST PLAN

ADDITIONAL INFORMATION

REVIEWERS

codecov-io commented Dec 29, 2019

Codecov Report

willbarrett commented Dec 30, 2019

villebro commented Dec 31, 2019

villebro commented Jan 8, 2020

villebro commented Jan 13, 2020

villebro Jan 13, 2020

Choose a reason for hiding this comment

villebro Jan 13, 2020

Choose a reason for hiding this comment

villebro Jan 13, 2020

Choose a reason for hiding this comment

villebro commented Jan 13, 2020

john-bodley commented Jan 13, 2020 • edited Loading

villebro commented Jan 13, 2020

willbarrett left a comment

Choose a reason for hiding this comment

john-bodley Jan 29, 2020

Choose a reason for hiding this comment

villebro Jan 29, 2020

Choose a reason for hiding this comment

villebro commented Dec 29, 2019 •

edited

Loading

john-bodley commented Jan 13, 2020 •

edited

Loading