-
Notifications
You must be signed in to change notification settings - Fork 813
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TINKERPOP-1822: Change default RepeatStep to DFS #838
Closed
Closed
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you given any thought to the memory requirements of
stashedStarts
? Seems like that approach could be really intensive for a large graph (i.e. a traversal that touches a lot of data)? any thoughts?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the delay, things have gotten fairly busy for me. I'd given a little bit of thought around this, but not a whole lot. I don't think this needs to be a linked list, it could likely an ArrayList or an ArrayDeque and the entries would take up less memory. ArrayDeque may be preferable for large graphs since it'll resize less frequently than an ArrayList, but it will consume more memory than an ArrayList by default.
However, I think having a
stashedStarts
here is necessary from the view of "one piece of code serving 2 algorithms" If this were to be completely rewritten as DFS (I don't know how to do that at this point, but for the sake of argument), and there was a desire to use the same code to utilize BFS, there'd likely be a need to have stashed starts to achieve BFS.That's about as far as I've gotten in thinking about this for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An
ArrayList
sounds reasonable to me as far as the rightList
implementation for howstashedStarts
is being used. When I'd posed the question I was more thinking about the more general increased memory requirements for doing DFS in this fashion as we basically have to accumulate what could be a fairly large list in memory in order to perform this operation. I suppose that we do such things infold()
-ing steps but the user is explicitly aware of their choice to do that when they use such a step. In the case ofstashedStarts
the memory requirement for choosing DFS is somewhat hidden as it's not clear from the Gremlin they've written that an internal list is being built. Perhaps that's simply a side-effect of allowing this to work in the way that it does (as you alluded to in your comment).I'm still thinking that DFS will be something that users will invoke in specific use cases where they will be more aware of the consequences of what it is that they are doing. If that is the case, then this would perhaps just be one more consequence of making that choice to consider.
I'd be curious to see some JFRs around the different modes of execution that we now have. Perhaps some microbenchmarks are in order too. And then the fun part....does OLAP still work without any change? 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@spmallette Any thoughts on what might be some decent data/traversals for the JFRs/microbenchmarks around this?
As far as OLAP goes, what are the expectations there? I haven't made any code changes to the computer algorithm yet since I'm not terribly familiar with that side of TinkerPop, so I think things should still be working working as before. I guess the question here is code changes or documentation changes when it comes to OLAP?
I'm hopefully going to spend some time this weekend or the next around this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe just start with the Grateful Dead dataset? I think it might be sufficiently complex to yield a good test of the different approaches we have now. If not, maybe we need to generate something artificial.
Personally, I'd love to see a JFR that executes the same traversal with each of the three configurations that we now have with a
Thread.sleep()
between them so that we can easily distinguish when one traversal stops and the next starts. Not sure what the traversal (or traversals) needs to be - I guess I'd just like to easily compare what happens from a processing/memory perspective with each of the configurations we've talked about and then true that up with the expectations that we have regarding each configuration that we have.I was just curious if it all the tests still pass there or not. I'd assume so given that you didn't make changes there, but I just wanted to be sure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added some JFRs to the JIRA and a performance comparison:
https://issues.apache.org/jira/browse/TINKERPOP-1822?focusedCommentId=16530124&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16530124
The results have left me with questions........
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@spmallette Those results are definitely...interesting to say the least. I think the tests themselves are reasonable, though, as a comment, I'm not typically using a repeat that's going to be able to utilize the
RepeatUnrollStrategy
. At least not for what I originally started investigating this for.That said, I took a step back and worked on performance between the BFS and DFS, and have gotten them much closer. On my local machine that BFS test was returning 889 from the counter. With the latest commit I added, DFS is returning 758. That's obviously not coming close to the default "let the strategies do their thing" performance, but it's significantly better than the ops counts being in the teens for DFS. Given that I was expecting slightly degraded performance with DFS, I think this is in a much better place performance wise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at this a bit more, it looks like I misread some of the test query results I had, and the new commit doesn't work to make the repeat step depth first, so ignore that last comment. I'm still working on trying to figure out a new approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had to remove
RepeatUnrollStrategy
because it adds barriers occasionally as do other strategies - that really needs to be fixed as something separate.https://issues.apache.org/jira/browse/TINKERPOP-2004
I think that if you had a specific use case in mind when you started doing this it would be cool if you did a performance test on that and shared your results if possible.
I also think that the queries in my tests weren't really presenting scenarios where someone would want to do DFS. I'm imaging that the only time DFS will be used is when the user is knowledgeable and has advanced understanding of their data to know that DFS will out perform the default. I assume that your specific use case was falling into that scenario. As i said, it would be nice to see that in action.
So, that said, it would be great to see DFS perform more quickly for that case I presented for the JFRs, but that may not be an explicit requirement. It may be more important to simply demonstrate that DFS has a use case where it can shine.
I didn't study the JFRs for too long as the performance struck me as a point of discussion first. If you have any thoughts to share on those specifically, that would also be cool. Thanks again!