Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fetch charts with GET to benefit from browser cache and conditional requests #7032

Merged
merged 91 commits into from
Apr 3, 2019

Conversation

betodealmeida
Copy link
Member

@betodealmeida betodealmeida commented Mar 14, 2019

This is a small PR that does a lot. It changes the initial request for charts (in explore or dashboards) to be done through a GET request, greatly improving the loading speed of dashboards. It also moves the caching to the HTTP layer, allowing us to benefit from Expires and ETag headers for conditional requests.

The problem

This diagram compares the current flow ("before") with the one implemented by this PR ("after"):

Cache

Before

Let's assume Superset is configured with a 1 hour cache, and also that the data changes on a longer period (daily, eg):

  1. User "A" requests a chart from Superset doing a POST request with the payload.
  2. Superset computes the query and sends it to the DB.
  3. DB returns a dataframe.
  4. Superset caches the dataframe.
  5. Superset serializes the payload and sends it back to user "A".
  6. User "A" refreshes the dashboard.
  7. Superset finds the dataframe cached.
  8. Superset serializes the payload and send it back to user "A".
  9. Superset cache expires after 1 hour.
  10. User "A" refreshes the dashboard.
  11. Superset computes the query and sends it to the DB.
  12. DB returns the exact same dataframe.
  13. Superset caches the dataframe again.
  14. Superset serializes the payload and sends it back to user "A".

There are a few inefficiencies here:

  • The browser cache is never used, because it's doing POST requests.
  • Superset needs to serialize the payload even on a cache hit.
  • Data is transferred to the browser even if it hasn't changed.

After

  1. User "A" requests a chart from Superset doing a GET request with the chart id.
  2. Superset computes the query and sends it to the DB.
  3. DB returns a dataframe.
  4. Superset serializes the dataframe and caches the HTTP response.
  5. Superset sends the payload to user "A", with an Expires header of 1 hour, and an ETag header which is a hash of the payload.
  6. The browser stores the response in its native cache, and SupersetClient caches it also in the Cache interface.
  7. The user refreshes the dashboard.
  8. Because of the Expires header and the use of GET the data is read directly from the native browser cache.
  9. Superset cache expires after 1 hour.
  10. User "A" refreshes the dashboard. The native cache is not used, since Expires is now in the past. SupersetClient looks for a cached response in the Cache interface, and if one is found, extracts its ETag.
  11. The browser requests the chart with an If-None-Match header, containing the hash of the cached response (its ETag).
  12. Superset computes the query and sends it to the DB.
  13. DB returns the exact same dataframe.
  14. Superset serializes the dataframe and caches the HTTP response.
  15. Superset sees that the ETag matches the If-None-Match header, returning a 304 Not Modified response.
  16. Browser fetches the cached response from the Cache interface.
  17. Browser uses the response.

Notes

  • The GET request is done only the first time the chart is mounted. Forcing refresh on dashboards and clicking "Run Query" in the Explore views perform POST requests, which bypass the cache, and cache the new response. I tested the Explore view and dashboards with filters, and all further interactions are done with POSTs.

  • Since we're caching the HTTP response, we need to verify that the user has permission to read the cached response. This is done by passing a check_perms function to the decorator that caches the responses.

  • The fetch API has no support for conditional responses with ETags. We need to add explicit support in SupersetClient. I have a separate PR for that (see feat: add support for conditional requests apache-superset/superset-ui#119).

  • There is one small downside to this approach. During the time while Expires is still valid, the browser will not perform any requests for cached charts unless the user explicitly refreshes a dashboard or click "Run Query" in the Explore view. If the data is bad, they will see bad data until it expires or they purposefully refresh the chart. In the current workflow, in theory we can purge the cache in this case, since it lives only on the server-side. This is a hypothetical scenario, and we could workaround it by sending a notification to dashboards that one or more charts have bad data and should be refreshed.

khtruong and others added 4 commits March 5, 2019 11:11
* Exclude venv for python linter to ignore

* Fix NaN error
This PR sets the background-color css property on `.ace_scroller` instead of `.ace_content` to prevent the white background shown during resizing of the SQL editor before drag ends.
@betodealmeida
Copy link
Member Author

betodealmeida commented Mar 14, 2019

👀 @DiggidyDave

@graceguo-supercat
Copy link

graceguo-supercat commented Mar 14, 2019

you sure GET request can handle this.props.formData? in airbnb many, many chart's formData are longer than 4k chars :)

why not use Redis cache query results, in aribnb the round trip to fetch data from Redis is only 600ms~800ms.

i think we should use etag for dashboard metadata (like dashboard layout, huge json blob...)

@betodealmeida
Copy link
Member Author

@graceguo-supercat:

you sure GET request can handle this.props.formData? in airbnb many, many chart's formData are longer than 4k chars :)

Thanks, I'm aware of that problem. Here the GET request has only the chart ID, and the form data is read from the saved chart (params in the slices table is used).

why not use Redis cache query results, in aribnb the round trip to fetch data from Redis is only 600ms~800ms.

The decorator is using Redis for the server-side caching, but it's caching the HTTP response instead of the dataframe (saving time that's spent in serialization). But I'm also using the native browser cache (through the Expires header) and the new "Cache interface API" (in a separate PR that I'm finishing).

i think we should use etag for dashboard metadata (like dashboard layout, huge json blob...)

Everywhere! :)

@graceguo-supercat
Copy link

graceguo-supercat commented Mar 14, 2019

Here the GET request has only the chart ID, and the form data is read from the saved chart (params in the slices table is used).
how about in dashboard with filter, each chart's query is not the saved params? the actual query is formData overwrite saved chart's params.
Also, you still have a formData in GET request parameters right (you can see request parameters from browser location bar)? this is the blocker that cause issue.

@betodealmeida
Copy link
Member Author

how about in dashboard with filter, each chart's query is not the saved params? the actual query is formData overwrite saved chart's params.

Ah, you're right. The filter box works when changed, since it does a POST. But in the initial load it's not taken into consideration. Let me see how I can fix that.

Also, you still have a formData in GET request parameters right (you can see request parameters from browser location bar)? this is the blocker that cause issue.

Yes, but it has only the slice_id in it right now, eg: form_data: {"slice_id":78}. I'll see if I can append any additional parameters that are set by filters, this should keep it small.

@betodealmeida
Copy link
Member Author

@graceguo-supercat I tested the interaction with filter boxes and it's not working. I'll work on fixing it.

@williaster
Copy link
Contributor

@betodealmeida this seems pretty complicated on top of data requests that are already complicated
and error prone 🙉

My first question to gauge whether it's worthwhile is: what are the speedup times you are seeing for existing approach vs your new approach? (ideally for multiple dashboards of varying size)

@betodealmeida
Copy link
Member Author

@williaster the speedup will greatly depend on how often the data changes, how big the payload is (bignum vs deck.gl varies significantly), the duration of the cache, and how slow the network is. I can (and have) run tests against the example dashboards, but I don't think they would be significative, since they don't cover all the real life use cases.

I think the question we should ask here is: "given that this is clearly an improvement, how can we make it bug free?"

@john-bodley
Copy link
Member

@betodealmeida just to clarify step 4 (and per your diagram) you have:

  • Before: Superset caches the dataframe.
  • After: Superset serializes the dataframe and caches the HTTP response.

yet in the code it still seems like we're caching the result set from the database and thus I wonder if the diagram and thus After phase should mention that there's an additional entry in the server cache, i.e., after step (4) the server-side cache would contain:

  • The cached result set (as a dataframe).
  • The cached superset/explore_json HTTP response which includes server-side Python visualization specific mutations.

Note I'm not saying this is wrong as I strongly believe that the database response should be cache given that represents the bulk of the compute, I just wanted to get clarity on the logic.

@williaster
Copy link
Contributor

@betodealmeida this is a big change that has the potential to impact many many users. If you're unable to provide numbers that indicate that this is strictly an improvement (or at a minimum no regressions), I'm a bit reluctant to introduce this additional complexity.

I think it's part of the expected work of a feature like this to demonstrate the effects with real-life examples. It seems like you should be able to use real Lyft dashboards, dashboards from the "example datasets", etc., and you can throttle network speed using dev tools.

Another concern I have is the impact on the # of requests. We've needed to introduce domain sharding because of the large number of simultaneous requests made by larger dashboards, and this potential DOUBLES that number, so I would want to see perf numbers that demonstrate no regressions for that case as well.

@mistercrunch
Copy link
Member

Not directly related, but on the topic of optimization around caching, it would be an extra win to make the caching call async. Maybe a first use case for the async/await syntax in Superset.

@williaster
Copy link
Contributor

@mistercrunch fetch (SupersetClient) is async? do you just mean the async/await syntax?

@mistercrunch
Copy link
Member

mistercrunch commented Mar 15, 2019

@williaster I'm thinking about something else: on the server side, in python, making the cache.set(...) call async, meaning while a thread is pushing to the caching backend, the web server can start streaming the response to the client at the same time. There's no need to wait on the caching to start sending the response back...

@betodealmeida
Copy link
Member Author

betodealmeida commented Mar 15, 2019

@williaster:

@betodealmeida this is a big change that has the potential to impact many many users. If you're unable to provide numbers that indicate that this is strictly an improvement (or at a minimum no regressions), I'm a bit reluctant to introduce this additional complexity.

I think it's part of the expected work of a feature like this to demonstrate the effects with real-life examples. It seems like you should be able to use real Lyft dashboards, dashboards from the "example datasets", etc., and you can throttle network speed using dev tools.

Sure, I will give the numbers of the example dashboards and some of the Lyft dashboards. My point was that, considering that this is a strict improvement, defining a threshold to accept the changes seems arbitrary to me.

Of course I agree that if this causes a regression or no significant improvement we should not do it. Maybe it's not clear, but with this PR the client will always issue a smaller number of requests, and a percentage of those requests will receive body-less responses. Combined with the fact that we move the server cache closer to the user, there shouldn't be any regressions with this PR, unless I'm doing something stupid (which I've done in the past).

Also, keep in mind that this PR is against the lyftga branch, so we expect to test it in production before merging into master.

Another concern I have is the impact on the # of requests. We've needed to introduce domain sharding because of the large number of simultaneous requests made by larger dashboards, and this potential DOUBLES that number, so I would want to see perf numbers that demonstrate no regressions for that case as well.

Sorry, why do you think this would double the number of requests? The number of requests should be strictly equal or smaller: resources within the lifetime of the cache are no longer requested because the browser reads directly from its cache, and conditional requests are still a single request. The server will either return a normal response (200 OK) or a 304 Not Modified without body. There's no pre-flight request.

@betodealmeida
Copy link
Member Author

@john-bodley not sure if I understand your question. Currently we cache the dataframe in viz.BaseViz.get_df_payload. I haven't touched that code, but I added an additional cache storing the Response object instead. This has a few benefits:

  • It skips unpickling the dataframe and serializing it to json.
  • We don't need to recompute the ETag every time we read from the cache, since it's stored in the cached response.

Eventually we can remove the dataframe caching, but I'd rather do that in a separate PR.

@betodealmeida
Copy link
Member Author

@graceguo-supercat I changed the code so that in the GET request we pass:

  1. the slice_id
  2. any extra_filters

This way the GET request is still relatively small (and we can switch to Rison for a smaller URL if needed), and the caching mechanism works with dashboards that have filters saved.

conglei and others added 3 commits March 15, 2019 09:11
* added more functionalities for query context and object.

* fixed cache logic

* added default value for groupby

* updated comments and removed print

(cherry picked from commit d5b9795)
@john-bodley
Copy link
Member

@betodealmeida my point was that I felt that the picture and description doesn't accurately reflect what your code is actually doing, i.e., that there's actually two server-side cache keys associated with each chart. They're both using the same underlying cache but the diagram and steps don't accurately reflect this.

I agree with your logic but thought maybe the steps should be more explicit, e.g.,

After

  1. DB returns a dataframe.
  2. Superset caches the dataframe.
  3. Superset serializes the payload.
  4. Superset caches the the HTTP response associated with the payload.
  5. Superset sends the payload to user "A", with an Expires header of 1 hour, and an ETag header which is a hash of the payload.

@kristw kristw added enhancement:request Enhancement request submitted by anyone from the community risk:breaking-change Issues or PRs that will introduce breaking changes labels Mar 15, 2019
@betodealmeida
Copy link
Member Author

@john-bodley you're right. I ended up describing in "after" the workflow once we remove the dataframe cache.

@williaster
Copy link
Contributor

thanks for the benchmarks!

will let @john-bodley sign off since he had the last requested change.

@betodealmeida betodealmeida merged commit 538776b into apache:lyftga Apr 3, 2019
This was referenced Apr 17, 2019
@mistercrunch
Copy link
Member

I was working on this tricky bug and ended up here as the cause for it. The issue is around the fact that the formData may need to get sanitized by frontend-related logic.

The bug or symptom is the "Genders by State" example started showing an extra (3rd) metric (sum__num) on top of Boys and Girls.

There's a lot going on related to this, but a key point here is that we have control-related logic like default and things around control panel that sanitizes formData. In that particular case, the example's form data is malformed with both metrics and metric. In the explore view, or in a normal POST request, the formData gets sanitized:

  • missing keys get filled with the control's default
  • extra keys get deleted
  • more stuff

Now another thing that compounds here is that viz.py blindly looks at all METRIC_KEYS to help craft the query_obj. So the combo of these things create the issue:

  • example is malformed, it should have only metrics and NO metric key (easy to fix)
  • in GET mode, the formData does not get "sanitized" by the frontend
  • viz.py could be more prescriptive about the metrics it builds into the query object

Now it's pretty clear that addressing any of these 3 things would fix my symptom, but point 2 on its own is worrisome. It can lead to intricate issues over time. Say I save a chart, and after that I add a new control to that vizType, in the context of the GET, it won't get the right default.

Good news is I'm working on a refactor #7350 to help and clean up all of the control / formData processing logic. It grew out of control over time. Now the assumption is that all request would make it through this logic, and I'm realizing that's not the case, at least since this PR.

Ideas?

@wraps(f)
def wrapper(*args, **kwargs):
# check if the user can access the resource
check_perms(*args, **kwargs)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was digging around to try and figure out where the datasource access permission is done nowadays, and found it here in the etag_cache decorator. I feel like it's not the right place for it.

I understand that this needs to happen prior to reading from cache, but maybe it should be done as a prior decorator, or maybe both of these routines should be done inside a method instead of decorators, to avoid calling get_viz twice.

form_data, slc = get_form_data(slice_id, use_slice_data=True)
datasource_type = slc.datasource.type
datasource_id = slc.datasource.id
viz_obj = get_viz(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think get_viz gets called at least two times now (here and in the view itself)

@mistercrunch
Copy link
Member

Found some other issues here that I wanted to raise ^^^

Also I noticed that the big "merge" on master of this and much more stuff got actually done on a single commit instead of a proper merge 538776b

In the future, lyftga branches and the likes should be merged, not squashed and merged as we loose tons of history. For instance if we wanted to revert this PR, there's no single commit in master we can address, we'd have to revert all of 538776b or get really creative...

@DiggidyDave
Copy link
Contributor

DiggidyDave commented May 29, 2019

@mistercrunch agreed, and after that lyftga branch all of our commits were merged individually (from lyft-release-sp8. We have now switched to working on and continuously deploying from master so this should not longer even be able to happen. ;-)

@mistercrunch mistercrunch added 🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels 🚢 0.34.0 labels Feb 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels enhancement:request Enhancement request submitted by anyone from the community lyft Related to Lyft risk:breaking-change Issues or PRs that will introduce breaking changes 🚢 0.34.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.