-
-
Notifications
You must be signed in to change notification settings - Fork 404
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for streaming data #2011
Conversation
After a long discussion with @philippjfr I came around to this approach for now, even if we find a deeper way to integrate streamz support in the long term. There are a few changes that would help:
|
8f1e6d1
to
62a7dd1
Compare
Ready to review once #2007 is merged. |
Happy to hold off on merging until I've added a user guide entry and some examples. |
Note that I don't think we should warn here, modifying the data can often be useful, e.g. when computing a histogram of the streaming window. |
b2f8a26
to
20272f8
Compare
@jbednar Ready for you to make a pass over the user guide. If you wouldn't mind could you also document the |
a. These two statements together are pretty dense:
also, is b. Reading this, " which will wait for 10 sets of stream updates to accumulate and then apply pd.concat to combine those updates,” it sounds to me like it will send chunk 1-10 together, then send again with chunks 11-20, then again with chunks 21-30, etc. I suggest a rewording of "which will wait for 10 sets of stream updates to accumulate and then apply pd.concat to combine those updates." to make it clearer how it works. The current wording is also not clear if it continues pass the 10th chunk. Everything else seems clear to me and is relatively easy to follow. |
holoviews/streams.py
Outdated
""" | ||
|
||
def __init__(self, sdf, backlog=1000, chunksize=None, **params): | ||
try: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess backlog and chunksize can't be Parameters because Streams use Parameters for their own special purposes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's right, although I could filter them out. @jlstevens might want to chime in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They certainly seem like they ought to be Parameters, so that they can be set at the class or instance level as appropriate.
@@ -0,0 +1,383 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's quite a long title; not sure if "Working with" adds anything? Maybe "Streaming Data"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or maybe "Streaming Data Sources", if that helps avoid confusion with hv streams in general.
You could maybe warn in such a case, if the user hasn't passed a special argument writeable=True? Fine by me either way. |
bc73bf7
to
ed92995
Compare
It looks like this notebook needs to be run cell by cell for it to work well, which I don't usually like; I'd rather it just sit there consuming CPU so that I can read each bit as I wish, not necessarily one cell at a time in order. So it would be good to show how to have a stream that ends, but I don't think most of the examples should be like that; seems to me that most should just keep going indefinitely. In any case, particularly if single-cell running is encouraged, please change variables to be unique. E.g. source and sdf are defined differently in different cells, which causes confusion and apparent errors when run out of order. |
In "Applying operations", instead of redefining |
Apart from that, thanks, and it looks great! I'll push my changes to the notebook now. |
What would you expect to happen? The axis has to follow the current window, so it's not clear to me how you could avoid setting the range when a new event comes in.
I'm not sure what you mean here either, the backlog determines the range of the axis not the other way around.
I'll update once #2023 is merged.
Sure, I'll add that if you haven't already. |
On zooming in, I would expect the size of the viewport to change, even if the actual location of the viewport is tracking what comes in. I'm mainly thinking about the x axis here, where zooming amounts to changing the backlog. If the backlog is 1000 points and I zoom into the x axis 2X, I'd expect to see 500 points at a time from then on. From there, if I zoom back to 1.0X, I'd expect to see 1000 points again. If I instead zoom out, I would expect to see a half-empty plot, because I know the backlog isn't keeping that much data, but I might hope that the backlog will dynamically update based on my zooming so that I now see up to 2000 points if I wait for a while, because the backlog has now been updated to 2000 points. This all seems uncontroversial to me, which doesn't mean it's feasible, but I think it does mean it's reasonable to consider. The y axis is significantly trickier. If I zoom out, I'd expect to see the same data as before, with a buffer around the top and bottom; probably not very useful, but intuitive. If I zoom in, I'd expect to see half the y range as before, but the question is which bit of the y range, since that keeps jumping around all the time. And I guess the answer is the middle bit of the range, if I haven't done any panning, such that if the auto-ranged Y is 0.3 to 0.6, I'd expect to see 0.4 to 0.5 after zooming y 2X. And then I'd expect panning to show me 0.5 to 0.6 or 0.3 to 0.4, depending on how I drag it. To motivate all this, just imagine a streaming plot that has quite a bit of data that is arriving very slowly, with a long backlog. Anyone looking at it is going to want to be able to zoom in to a bit of it and check it out, and will be surprised and annoyed if such zooming suddenly gets overwritten every time an update appears. This all also brings up the issue of wanting to be able to scrub back into some portion of the history, e.g. to have a backlog of 5000 points, showing only the last 1000 but allowing scrubbing (panning to the left, in this case) over the previous 4000. This isn't stuff that needs to be handled in this PR, just functionality that I think is clearly desirable without any weight for how feasible it is to implement. |
Oh, and I didn't see your note about documenting the chunksize option for the DataFrameStream, so I don't think I focused on that explicitly. |
Okay, I agree none of that is particularly controversial and you can achieve the scrubbing back portion of what you suggested already. The rest may be feasible but I don't think it'll be in this PR. Hunt's suggestion of having some button that lets you stop updates to a plot seems very reasonable here, I could imagine a button that temporarily pauses updates to a plot (while leaving the actual stream intact). That way you could pause the plot, zoom in and then start it again once you've looked at the feature you're interested in. |
Oh, yes, I meant to suggest that too, a "go offline" mode, which works only on the data available at that instant, and then rejoins the stream when you click off of it. |
bb27388
to
abf6af2
Compare
Should also address #1775 by adding a psutils based example. |
hi. On Windows the psutil example faile because there is no active/inactive/wired memory:
then it fails, on maybe a missed patch from myself ?
|
That's annoying, would you mind pasting what is available on Windows?
Yes, it seems you got an old version of this PR before I rebased fixes to Area.stack into it. |
on windows, there is:
|
Thanks, that's great! I'll update the example. |
f5ac900
to
835f1a0
Compare
This is looking a lot better! To start off with, I would like to rework the start of the user guide a little. Currently the first sentence is:
This sentence bothers me as it needs qualification. I've rewritten the introduction in a way I think is clearer given the material in the earlier user guides:
Of course, you don't need to use this suggestion verbatim but I think it is important to explain how this user guide is different from the other ones involving DynamicMap and interactive updates. |
I'll be making edits to this comment as I work through the user guide/code and I'll make a note at the end when I'm done! Note that I am now making these changes to the notebook as I go. Comments (continued)
API comment@philippjfr I think we should use |
I'm happy with your new suggested intro, except that "streaming directly in HoloViews" doesn't mean anything to me; isn't it all streaming directly in HoloVIews? Need to make what streamz achieves clearer. The GIF suggestion sounds promising, and it would be great to see a prototype. |
@jbednar I have completed a full pass over the user guide making various edits and formatting fixes. I tried to clarify the vague bit of the introduction that Jim pointed out. I'm now very happy with this API, functionality and user guide! Of course I haven't looked closely at the code yet but I expect that it is ok. That said, @philippjfr my last main API gripe is the |
holoviews/streams.py
Outdated
events emitted by a streamz object. | ||
|
||
When streaming a DataFrame will use the DataFrame index by | ||
default, this may be disabled by setting index=False. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use the DataFrame index how? As in simply retain it or something more sophisticated?
"Since ``Pipe`` is completely general the data can be any custom type it provides a completely general mechanism to stream structured or unstructured data. Due to this generality it cannot provide some of the more complex features provided by the ``Buffer`` stream." | ||
"This approach of using an element constructor directly does not allow you to use anything other than the default key and value dimensions. One simple workaround this limitation is to use ``functools.partial`` as demonstrated in the **Controlling the backlog section** below.\n", | ||
"\n", | ||
"Since ``Pipe`` is completely general the data can be any custom type it provides a completely general mechanism to stream structured or unstructured data. Due to this generality it cannot provide some of the more complex features provided by the ``Buffer`` stream that we will now explore." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't parse this sentence.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed - I assume you are referring to the last sentence above. I'll rewrite it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clarified the sentence in 881c70c
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Second-to-last, starting with Since.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll edit now with some misc fixes and you can fix up this sentence afterwards if needed.
I have now looked through the code and I am now happy with this PR, including the user guide. The only sticking point I would like addressed before merging is whether |
I'm happy to have it merged with or without the change /s/backlog/size. Good job! |
After a quick brainstorm with @philippjfr we decided Happy to see this PR merged after that final change. |
Maybe you should make that change quickly before I think of more things to mention! ;-p I'm just wondering if we should say something about the plotting optimization being disabled if the data in the element changes in the callback from what was supplied by the |
Ready to merge once tests pass. |
Everything seems to be addressed after that last set of commits. Happy to merge once the tests pass. |
Tests are passing except for one build that was restarted due to a transient. Merging. |
This pull request has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
This PR builds on #2007 adding a
StreamDataFrame
Stream, which lets you subscribe to updates from astreamz.StreamingDataFrame
. The dependency is entirely optional and with a bit of glue it's very easy to subscribe to generate a streaming plot this way. For now I have just monkey-patched streamz using this code:The individual chunks emitted by the
StreamingDataFrame
are fed through a user supplied callback and are then used to update the bokeh plot using theCDS.stream
method, which sends just the most recent samples. It may also be worth considering an option whereStreamingDataFrame
itself accumulates data rather than just emitting chunks, which can also be very useful. The current approach makes something like this possible: