-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Discuss] Rendering partial results #55408
Comments
Pinging @elastic/kibana-app-arch (Team:AppArch) |
Zoomdata called this capability "data sharpening" and used as it as a differentiator for doing BI against massive Hadoop data sets (leveraging Apache Impala). Nice little 3 min. overview: https://www.youtube.com/watch?v=zZs-SIkwJ-g |
I have 2 questions:
|
I think the challenge is as a user it's not obvious to look at an incremental visualization alone and judge whether the result is useful or not. Imagine a dashboard having some useful/converging visual results based on partial data and some not, and not being able to tell which is which. So unless you deeply understand the data and computation being done (or can communicate the degree of uncertainty to the viewer), I think partial results in general are unhelpful at best and misleading at worst. That said, I think there are a few cases where partial results can help:
There are a lot of academic papers that investigate incremental/progressive visualization, particularly for exploratory analysis. It might be worth doing a review to see the challenges and findings, and how they might apply here. (Just one example: Trust Me, I’m Partially Right: Incremental Visualization Lets Analysts Explore Large Datasets Faster) |
Trust, but Verify: Optimistic Visualizations of Approximate Queries for Exploring Big Data from Dominik, and Danyel, Ding, Wang at MS is a favorite. There can be value in getting an approximate response, even if the sample distribution is not representative of the population distribution, for example, it helps establish magnitudes, units of measure, approx. extent of the data. If we're lucky or we can tilt odds in our favor, then even the distribution of the first 1% of the data will be similar to that of the rest. A related concept is the level of detail (LoD). For example, in the initial query, do super coarse binning or rely on an aggregate index even if not of the ideal resolution, as it may still give a decent histogram or heatmap; then evolve into a more granular histogram, or into a scatterplot, respectively. So the topic of "visualizing incomplete data" may cover not just missing documents, buckets or bucket contents, but also preemptive strategies, eg. intentionally "missing out" on some Level of Detail (LoD) in favor of retaining both low latency and representative results. Examples for enriching coarse visualizations:
Some other techniques are known, sometimes called adaptive sampling, weighing bin sizes with importance, for example, do temporal binning of the last 24hrs by minute; the prior 30 days, by hour; the prior 12 months, by day etc. |
The partial results concerned is something we discussed multiple times during the make it slow work, with and without Elasticsearch folks. The simple use case is Discover or any other application that shows raw documents as a result. in these cases getting results stream while the query is in progress instead of waiting for the query to return the entire results set is a better experience and can be useful on multiple occasions When it comes to aggregations and I use the dashboard as an example. Please assume that we will provide a UI that shows clearly that the panel is still in progress (we will share here soon @mdefazio ). A lot of how this feature will be perceived is on us and how we make it clear that these are partial results that we intend to do. |
I am sharing @peterschretlen concern here a lot. I want to leave aside partial results for discover for a while, because I think there they might provide some advantage and purely focus on Visualizations and Dashboards, where they are as Peter put it unhelpful at best and misleading at worst. I think the effect gets worse for really long running queries. Let me use a couple of examples here to demonstrate this. Assuming you create a line chart an it starts loading and you'll maybe sit there for a couple of minutes watching it evolving: Now it needs another 10 minutes to finish loading all data (you might head for a small coffee). So how'd the final chart look like? This is a total valid and likely scenario. Before we're not having the full data, we're basically working in an uncertainty cone of 100%. Some aggregations might converge faster towards the final results others can flip constantly. No matter how much we're making sure to explain the user that this is just partial data, I am pretty sure they'll build up expectations nevertheless about how the final chart will look like. So showing them the partial data, had mainly done one thing: building up false expectations, but not providing any value to them in that case, despite the fact that they see the chart is still loading (but that's no other information than any loading spinner can convey). There are more examples, where basically running a terms aggregation on the x-axis of a bar chart, the bars will constantly show reorder and disappear potentially, since we're not having the final order until the point we have the final data. So a scenario like the following can be very likely: So the only things we've done despite creating some rendering artifacts and showing some "loading spinner" in form of a chart is misleading the user about final results. There actually is one very special niche, where partial results do make sense, and that is when we're knowing the actual buckets in advance and buckets will only come in once they are complete, but they won't change metrics afterwards anymore. This could potentially happen for two cases: A date histogram and a histogram with a fixed min/max value. In those cases if we can guarantee the metrics won't change anymore, we'll actually see proper progress in the chart. As of the discussion we had yesterday, this is not the behavior of ES at the moment, and the recommendation there would be not to actually use partial results loading for that, but basically do multiple requests from the Kibana spanning increasingly larger time ranges instead. Since the usefulness of loading partial results in general is limited to some very special use-cases I would highly recommend we're not going for a generic solution, that will convey misleading information for the sake of showing we're still loading data to the user. Instead we could consider building that specific solution for those narrow date_histogram and histogram use-cases were we know we're not showing basically "random" data to the user (and before we don't have all data, we don't know if the deviation from the final result is actually smaller than from random data, we could show).
As shown above I don't think that's entirely true, since I don't think we'll have a way to design it, that will basically disable human's pattern matching algorithms within their brains, so we don't mislead by the partial information. If we want to nevertheless go the route of partial results for generic cases, I would second Peter's suggestion here, and we should have a couple of people with good data scientist experience work through some of the research on that topic and end up with a good recommendation on what and how we should address this. |
@monfera i like the document you posted. but that goes beyond just showing partial results elasticsearch would return. if we would have a way to show the uncertainty that would be great, but i think that is a big project on its own. And as Tim mentions with aggregations we have no way to measure the uncertainty and it could theoretically be close to 100% till the last shard returns data. Maybe we should look in the direction of handling this with multiple queries to es as suggested by es team in yesterdays meeting. (requesting last day, last 5, last 15, last 30) for when it comes to aggregation. |
Can we explicitly separate discussion of timeseries data from other types? I think there are clear useful ways to load incrementally or display partial data for time series. |
Great discussion. |
I don’t have a problem going ahead with design and engineering assuming we want to pass in-progress data through to the visualization. I assume the ability for a visualization to accept in-progress data vs how (or if) it renders that data can be treated separately? If that's true, a final decision of how in-progress results are shown to users doesn't need to be made now. We should move forward, but let’s revisit after thinking through the design, trying to account for concerns raised here. I do think it's time to stop discussion in the abstract - we surfaced some valid concerns but we're not going to progress much further in this issue without something concrete to discuss. |
Summary of where we are today: There seems to be consensus that:
And I think the concerns are:
|
Thanks for the summary Peter. |
As we progress with https://github.com/elastic/dev/issues/1209 and as queries potentially become longer, the importance of showing partial results is increasing. While, like @peterschretlen mentions in his summary, in some cases, partial results might be misleading, in others, especially when looking at normally distributed or at timebased data, partial results might be useful and allow users to optimize their workflow. The current focus is making sure that the UX of core applications (Discover, Visualizations, Lens, Dashboard) is very clear as for if a user is viewing a partial results. |
Thank you for contributing to this issue, however, we are closing this issue due to inactivity as part of a backlog grooming effort. If you believe this feature/bug should still be considered, please reopen with a comment. |
Started to be discussed in here: #53336 but decided to open up a new ticket to discuss partial results in general, not the expression implementation.
Partial results is part of the Make it Slow effort but there has been a lot of discussion on whether it makes sense to show partial results when the information may not be a true reflection of the final results.
I personally think there is value in showing partial results even in the case of a pie chart with an average aggregation. I think we need to make it clear to the user that these are not the final results, like having the visualization be greyed out, but we should show these partial results and leave it up to the user to decide whether or not any information can be gleaned from these partial results.
I argue it is possible to gain valuable information from partial results even in the case of a pie chart and an aggregation like Sum or Avg, or Top hit, assuming the user knows something about their data. For instance if the data is numeric and always >= 0, then as soon as any slice has data in it, the user knows they found a hit where the number is > 0. With the SUM agg, they can know even more information because as long as the number is always positive, they know the numbers they are looking at might still grow higher, but they will never shrink.
Even if this is not a very common situation, I think it may be easier, technically, to always show partial results by default. The only time I think we need to be careful is when a secondary query is sent out based on data from the first. If that query is a slow query, we need to cancel it as soon as the first query sends us new values. Or perhaps detect this situation and not show partial results to avoid the extra querying overhead.
It was also my takeaway from the Make it Slow PR that we should be showing partial results for nearly all visualizations, but not everyone had that same takeaway, so let's use this issue to reach a documented decision.
cc @AlonaNadler @peterschretlen @ppisljar @timroes @alexh97
The text was updated successfully, but these errors were encountered: