Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TINKERPOP-1822: Change default RepeatStep to DFS #838

Closed
wants to merge 1 commit into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@
package org.apache.tinkerpop.gremlin.process.traversal.step.branch;

import org.apache.tinkerpop.gremlin.process.traversal.Step;
import org.apache.tinkerpop.gremlin.process.traversal.step.Barrier;
import org.apache.tinkerpop.gremlin.process.traversal.Traversal;
import org.apache.tinkerpop.gremlin.process.traversal.Traverser;
import org.apache.tinkerpop.gremlin.process.traversal.step.TraversalParent;
Expand All @@ -31,6 +32,7 @@
import java.util.ArrayList;
import java.util.Collections;
import java.util.Iterator;
import java.util.LinkedList;
import java.util.List;
import java.util.NoSuchElementException;
import java.util.Set;
Expand Down Expand Up @@ -187,17 +189,40 @@ public int hashCode() {
return result;
}

private final LinkedList<Traverser.Admin<S>> stashedStarts = new LinkedList<>();

private Traverser.Admin<S> nextStart(final boolean useDfs) {
if (!useDfs) {
return this.starts.next();
} else {
if (this.starts.hasNext()) {
return this.starts.next();
} else {
return this.stashedStarts.pop();
}
}
}

@Override
protected Iterator<Traverser.Admin<S>> standardAlgorithm() throws NoSuchElementException {
if (null == this.repeatTraversal)
throw new IllegalStateException("The repeat()-traversal was not defined: " + this);

final List<Step> steps = this.repeatTraversal.getSteps();
final Step stepBeforeRepeatEndStep = steps.get(steps.size() - 2);
final boolean useDfs = !(stepBeforeRepeatEndStep instanceof Barrier);
while (true) {
if (this.repeatTraversal.getEndStep().hasNext()) {
return this.repeatTraversal.getEndStep();
} else {
final Traverser.Admin<S> start = this.starts.next();
final Traverser.Admin<S> start = nextStart(useDfs);
start.initialiseLoops(this.getId(), this.loopName);
if (useDfs) {
final List<Traverser.Admin<S>> localStarts = new ArrayList<>();
while (this.starts.hasNext()) {
localStarts.add(this.starts.next());
}
stashedStarts.addAll(0, localStarts);
}
if (doUntil(start, true)) {
start.resetLoops();
return IteratorUtils.of(start);
Expand Down Expand Up @@ -240,10 +265,12 @@ protected Iterator<Traverser.Admin<S>> computerAlgorithm() throws NoSuchElementE

public static <A, B, C extends Traversal<A, B>> C addRepeatToTraversal(final C traversal, final Traversal.Admin<B, B> repeatTraversal) {
final Step<?, B> step = traversal.asAdmin().getEndStep();
boolean setBfs = false;
if (step instanceof RepeatStep && null == ((RepeatStep) step).repeatTraversal) {
((RepeatStep<B>) step).setRepeatTraversal(repeatTraversal);
} else {
final RepeatStep<B> repeatStep = new RepeatStep<>(traversal.asAdmin());
List<Step> steps = repeatTraversal.getSteps();
repeatStep.setRepeatTraversal(repeatTraversal);
traversal.asAdmin().addStep(repeatStep);
}
Expand Down Expand Up @@ -289,11 +316,41 @@ public RepeatEndStep(final Traversal.Admin traversal) {
super(traversal);
}

final LinkedList<Traverser.Admin<S>> stashedStarts = new LinkedList<>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you given any thought to the memory requirements of stashedStarts? Seems like that approach could be really intensive for a large graph (i.e. a traversal that touches a lot of data)? any thoughts?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay, things have gotten fairly busy for me. I'd given a little bit of thought around this, but not a whole lot. I don't think this needs to be a linked list, it could likely an ArrayList or an ArrayDeque and the entries would take up less memory. ArrayDeque may be preferable for large graphs since it'll resize less frequently than an ArrayList, but it will consume more memory than an ArrayList by default.

However, I think having a stashedStarts here is necessary from the view of "one piece of code serving 2 algorithms" If this were to be completely rewritten as DFS (I don't know how to do that at this point, but for the sake of argument), and there was a desire to use the same code to utilize BFS, there'd likely be a need to have stashed starts to achieve BFS.

That's about as far as I've gotten in thinking about this for now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An ArrayList sounds reasonable to me as far as the right List implementation for how stashedStarts is being used. When I'd posed the question I was more thinking about the more general increased memory requirements for doing DFS in this fashion as we basically have to accumulate what could be a fairly large list in memory in order to perform this operation. I suppose that we do such things in fold()-ing steps but the user is explicitly aware of their choice to do that when they use such a step. In the case of stashedStarts the memory requirement for choosing DFS is somewhat hidden as it's not clear from the Gremlin they've written that an internal list is being built. Perhaps that's simply a side-effect of allowing this to work in the way that it does (as you alluded to in your comment).

I'm still thinking that DFS will be something that users will invoke in specific use cases where they will be more aware of the consequences of what it is that they are doing. If that is the case, then this would perhaps just be one more consequence of making that choice to consider.

I'd be curious to see some JFRs around the different modes of execution that we now have. Perhaps some microbenchmarks are in order too. And then the fun part....does OLAP still work without any change? 😄

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@spmallette Any thoughts on what might be some decent data/traversals for the JFRs/microbenchmarks around this?

As far as OLAP goes, what are the expectations there? I haven't made any code changes to the computer algorithm yet since I'm not terribly familiar with that side of TinkerPop, so I think things should still be working working as before. I guess the question here is code changes or documentation changes when it comes to OLAP?

I'm hopefully going to spend some time this weekend or the next around this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any thoughts on what might be some decent data/traversals for the JFRs/microbenchmarks around this?

Maybe just start with the Grateful Dead dataset? I think it might be sufficiently complex to yield a good test of the different approaches we have now. If not, maybe we need to generate something artificial.

Personally, I'd love to see a JFR that executes the same traversal with each of the three configurations that we now have with a Thread.sleep() between them so that we can easily distinguish when one traversal stops and the next starts. Not sure what the traversal (or traversals) needs to be - I guess I'd just like to easily compare what happens from a processing/memory perspective with each of the configurations we've talked about and then true that up with the expectations that we have regarding each configuration that we have.

As far as OLAP goes, what are the expectations there?

I was just curious if it all the tests still pass there or not. I'd assume so given that you didn't make changes there, but I just wanted to be sure.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@spmallette Those results are definitely...interesting to say the least. I think the tests themselves are reasonable, though, as a comment, I'm not typically using a repeat that's going to be able to utilize the RepeatUnrollStrategy. At least not for what I originally started investigating this for.

That said, I took a step back and worked on performance between the BFS and DFS, and have gotten them much closer. On my local machine that BFS test was returning 889 from the counter. With the latest commit I added, DFS is returning 758. That's obviously not coming close to the default "let the strategies do their thing" performance, but it's significantly better than the ops counts being in the teens for DFS. Given that I was expecting slightly degraded performance with DFS, I think this is in a much better place performance wise.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at this a bit more, it looks like I misread some of the test query results I had, and the new commit doesn't work to make the repeat step depth first, so ignore that last comment. I'm still working on trying to figure out a new approach.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had to remove RepeatUnrollStrategy because it adds barriers occasionally as do other strategies - that really needs to be fixed as something separate.

https://issues.apache.org/jira/browse/TINKERPOP-2004

I think that if you had a specific use case in mind when you started doing this it would be cool if you did a performance test on that and shared your results if possible.

I also think that the queries in my tests weren't really presenting scenarios where someone would want to do DFS. I'm imaging that the only time DFS will be used is when the user is knowledgeable and has advanced understanding of their data to know that DFS will out perform the default. I assume that your specific use case was falling into that scenario. As i said, it would be nice to see that in action.

So, that said, it would be great to see DFS perform more quickly for that case I presented for the JFRs, but that may not be an explicit requirement. It may be more important to simply demonstrate that DFS has a use case where it can shine.

I didn't study the JFRs for too long as the performance struck me as a point of discussion first. If you have any thoughts to share on those specifically, that would also be cool. Thanks again!


private Traverser.Admin<S> nextStart(RepeatStep<S> repeatStep, boolean useDfs) {
if (!useDfs) {
return this.starts.next();
} else {
if (this.starts.hasNext()) {
return this.starts.next();
} else {
return this.stashedStarts.pop();
}
}
}

@Override
public boolean hasNext() {
return super.hasNext() || !this.stashedStarts.isEmpty();
}

@Override
protected Iterator<Traverser.Admin<S>> standardAlgorithm() throws NoSuchElementException {
final RepeatStep<S> repeatStep = (RepeatStep<S>) this.getTraversal().getParent();
final List<Step> steps = repeatStep.repeatTraversal.getSteps();
final Step stepBeforeRepeatEndStep = steps.get(steps.size() - 2);
final boolean useDfs = !(stepBeforeRepeatEndStep instanceof Barrier);
while (true) {
final Traverser.Admin<S> start = this.starts.next();
final Traverser.Admin<S> start = nextStart(repeatStep, useDfs);
start.incrLoops();
if (useDfs) {
final List<Traverser.Admin<S>> localStarts = new ArrayList<>();
while (this.starts.hasNext()) {
localStarts.add(this.starts.next());
}
stashedStarts.addAll(0, localStarts);
}
start.incrLoops();
if (repeatStep.doUntil(start, false)) {
start.resetLoops();
Expand Down
14 changes: 14 additions & 0 deletions gremlin-test/features/branch/Repeat.feature
Original file line number Diff line number Diff line change
Expand Up @@ -354,3 +354,17 @@ Scenario: g_VX6X_repeatXa_bothXcreatedX_simplePathX_emitXrepeatXb_bothXknowsXX_u
Then the result should be unordered
| result |
| josh |
Scenario: g_V_hasXname_markoX_repeatXoutE_order_byXweight_decrX_inVX_emit
Given the modern graph
And the traversal of
"""
g.V().has("name", "marko").repeat(__.outE().order().by("weight", Order.decr).inV()).emit()
"""
When iterated to list
Then the result should be ordered
| result |
| josh |
| ripple |
| lop |
| vadas |
| lop |
Original file line number Diff line number Diff line change
Expand Up @@ -47,6 +47,7 @@
import static org.apache.tinkerpop.gremlin.process.traversal.dsl.graph.__.loops;
import static org.apache.tinkerpop.gremlin.process.traversal.dsl.graph.__.out;
import static org.apache.tinkerpop.gremlin.process.traversal.dsl.graph.__.outE;
import static org.apache.tinkerpop.gremlin.process.traversal.Order.decr;
import static org.hamcrest.MatcherAssert.assertThat;
import static org.hamcrest.core.Is.is;
import static org.hamcrest.collection.IsIterableContainingInOrder.contains;
Expand Down Expand Up @@ -122,6 +123,8 @@ public abstract class RepeatTest extends AbstractGremlinProcessTest {

public abstract Traversal<Vertex, String> get_g_VX6X_repeatXa_bothXcreatedX_simplePathX_emitXrepeatXb_bothXknowsXX_untilXloopsXbX_asXb_whereXloopsXaX_asXbX_hasXname_vadasXX_dedup_name(final Object v6Id);

public abstract Traversal<Vertex, Vertex> get_g_V_hasXname_markoX_repeatXoutE_order_byXweight_decrX_inVX_emit();

@Test
@LoadGraphWith(MODERN)
public void g_V_repeatXoutX_timesX2X_emit_path() {
Expand Down Expand Up @@ -455,6 +458,18 @@ public void g_VX6X_repeatXa_bothXcreatedX_simplePathX_emitXrepeatXb_bothXknowsXX
assertFalse(traversal.hasNext());
}

@Test
@LoadGraphWith(MODERN)
public void g_V_hasXname_markoX_repeatXoutE_order_byXweight_decrX_inVX_emit() {
final List<Vertex> vertices = get_g_V_hasXname_markoX_repeatXoutE_order_byXweight_decrX_inVX_emit().toList();
assertEquals(5, vertices.size());
assertEquals("josh", vertices.get(0).values("name").next());
assertEquals("ripple", vertices.get(1).values("name").next());
assertEquals("lop", vertices.get(2).values("name").next());
assertEquals("vadas", vertices.get(3).values("name").next());
assertEquals("lop", vertices.get(4).values("name").next());
}

public static class Traversals extends RepeatTest {

@Override
Expand Down Expand Up @@ -577,5 +592,10 @@ public Traversal<Vertex, String> get_g_VX6X_repeatXa_bothXcreatedX_simplePathX_e
public Traversal<Vertex, String> get_g_VX1X_repeatXrepeatXunionXout_uses_out_traversesXX_whereXloops_isX0X_timesX1X_timeX2X_name(final Object v1Id) {
return g.V(v1Id).repeat(__.repeat(__.union(out("uses"), out("traverses")).where(__.loops().is(0))).times(1)).times(2).values("name");
}

@Override
public Traversal<Vertex, Vertex> get_g_V_hasXname_markoX_repeatXoutE_order_byXweight_decrX_inVX_emit() {
return g.V().has("name", "marko").repeat(__.outE().order().by("weight", decr).inV()).emit();
}
}
}