-
-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable multiQuery optimization for PropertyMapStep and ElementMapStep [cql-tests] [tp-tests] #3803
Enable multiQuery optimization for PropertyMapStep and ElementMapStep [cql-tests] [tp-tests] #3803
Conversation
f3cdce1
to
0588749
Compare
6912efd
to
4ee27d0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've only reviewed the doc changes and a small part of the code changes so far. Will try to review the rest of the code changes next week, but I'm not sure yet whether I'll find enough time for that and wanted to already submit my only comment I have at the moment (meaning the inline comment, not this one^^).
docs/configs/janusgraph-cfg.md
Outdated
@@ -372,6 +372,7 @@ behaves same as `all_properties`.<br>- `required_and_next_properties_or_all` Pre | |||
`values`, `properties,` `valueMap`, `elementMap`, or `propertyMap` then acts like `all_properties`.<br>- `none` Skips `has` step batch properties pre-fetch optimization.<br> | String | required_and_next_properties | MASKABLE | | |||
| query.batch.limited | Configure a maximum batch size for queries against the storage backend. This can be used to ensure responsiveness if batches tend to grow very large. The used batch size is equivalent to the barrier size of a preceding `barrier()` step. If a step has no preceding `barrier()`, the default barrier of TinkerPop will be inserted. This option only takes effect if `query.batch.enabled` is `true`. | Boolean | true | MASKABLE | | |||
| query.batch.limited-size | Default batch size (barrier() step size) for queries. This size is applied only for cases where `LazyBarrierStrategy` strategy didn't apply `barrier` step and where user didn't apply barrier step either. This option is used only when `query.batch.limited` is `true`. Notice, value `2147483647` is considered to be unlimited. | Integer | 2500 | MASKABLE | | |||
| query.batch.properties-mode | Properties pre-fetching mode for `values`, `properties`, `valueMap`, `propertyMap`, `elementMap` steps. Used only when query.batch.enabled is `true`.<br>Supported modes:<br>- `all_properties` Pre-fetch all vertex properties on any property access (fetches all vertex properties in a single slice query)<br>- `required_properties_only` Pre-fetch necessary vertex properties only (uses a separate slice query per each required property)<br>- `none` Skips vertex properties pre-fetching optimization.<br> | String | required_properties_only | MASKABLE | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't looked into the implementation, so only writing this after reading the the description here:
If I use all_properties
, then it fetches all vertex properties in a single slice query, but if I use the default of required_properties_only
, then it uses a separate slice query per each required property.
So if my vertices all have the same 10 properties and I want all of them, then all_properties
will just perform one backend query, but required_properties_only
will perform 10?
This sounds to me like I'm getting a better performance with all_properties
in this example, but then I have to accept the downside that I will also fetch properties for other traversals that I don't need if I for example only want one specific property there. Is that correct?
Wouldn't it also make sense to use just one query for required_properties_only
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sounds to me like I'm getting a better performance with all_properties in this example, but then I have to accept the downside that I will also fetch properties for other traversals that I don't need if I for example only want one specific property there. Is that correct?
Wouldn't it also make sense to use just one query for required_properties_only?
Declimer: Right now, in master
branch, JanusGraph always fetches ALL vertex properties using values
, properties
, valueMap
, propertyMap
, elementMap
(and other steps) because of the default setting query.fast-property = true
. If we change query.fast-property = false
it fetches each vertex property in a separate slice query.
Of course, it would be really great to be able to fetch multiple selected properties using a single slice query, but not easy with the current JanusGraph data structure. At this moment JanusGraph can't retrieve two separate properties in a single slice query. JanusGraph knows how to retrieve a single specific property or all properties.
To explain why, let's imagine the next row:
propertyA: foo; propertyB: bar; propertyC: fooBar
Now, let's imagine the user wants to retrieve propertyA
and propertyB
. How JanusGraph suppose to construct a single Slice query to retrieve only those two properties without propertyB
?
At this moment JanusGraph can't do that. I think we could implement something like multi-slice
query or batched-slice
query, so that JanusGraph can specify multiple ranges which it is interested in.
Nevertheless, this work of multi-property slice queries is beyond the current PR. I think if we implement it then query.fast-property
will become obsolete because users would always prefer retrieving only necessary properties in a single query. Right now JanusGraph users need to make trade offs between query.fast-property = true
or query.fast-property = false
. Same with these batch
configurations. Users need to make trade offs between retrieving only required properties (multiple backend requests but no redundant data fetched) or retrieving all properties (a single backend request but redundant data may be fetched).
P.S. I have plans to cover this topic later in a more in-depth blog post.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC label
is not fetched in all_properties
mode (or equivalently, when fast-property=true
). IMO if the overhead is not large (in other words, if this does not lead to an extra backend call), we should also fetch label
. The reason is:
Let's say the user has an index that only applies to the vertex label god
.
mgmt.buildIndex('byNameAndLabel', Vertex.class).addKey(name).indexOnly(god).buildCompositeIndex()
When JanusGraph upserts the property name
for ANY vertex, it has to check the label of that vertex. This OFTEN leads to a backend call because the label of that vertex is not cached, even if fast-property=true
(again, I am not 100% sure but that's what I remember).
This is not really part of this PR per se, but I just want to bring this issue to your attention since you are working on this area.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC
label
is not fetched inall_properties
mode (or equivalently, whenfast-property=true
). IMO if the overhead is not large (in other words, if this does not lead to an extra backend call), we should also fetchlabel
. The reason is:Let's say the user has an index that only applies to the vertex label
god
.mgmt.buildIndex('byNameAndLabel', Vertex.class).addKey(name).indexOnly(god).buildCompositeIndex()
When JanusGraph upserts the property
name
for ANY vertex, it has to check the label of that vertex. This OFTEN leads to a backend call because the label of that vertex is not cached, even iffast-property=true
(again, I am not 100% sure but that's what I remember).This is not really part of this PR per se, but I just want to bring this issue to your attention since you are working on this area.
As far as I can tell, vertex label is stored as an edge and not a property. I could be wrong so, but whenever we call AbstractVertex.label method it uses the next backend call:
‘’’
queryBuilder.noPartitionRestriction().type(BaseLabel.VertexLabelEdge).direction(Direction.OUT);
‘’’
As I understand it retrieves OUT edge with the VertexLabelEdge label.
Thus, I’m not sure we can fetch properties and the label edge via a single slice query. Unless we are sure that VertexLabelEdge is placed near the properties in the row (not sure how to check that) then we will need two slice queries to retrieve vertex properties and the vertex label.
I have added multi-query optimisation to retrieve vertex labels to steps:
‘hasLabel’ (this was added in the previous PR), ‘valueMap().with(WithOptions.tokens, WithOptions.label)’, and ‘elementMap()’. These steps use an additional multi-query to retrieve vertex labels in batches.
I also have plans to add multi-query optimisation to Gremlin ‘label’ step (‘g.V().label()’), but it has a small challenge because LabelStep in TinkerPop is ‘final’. Thus, we need to remove ‘final’ keyword first in TinkerPop itself. Nevertheless, it won’t be a part of this PR, but I will create a separate ticket.
The issue you shared is not fully related to multi-query optimisation, but I totally agree that it does make sense to cache vertex label instead of firing a new query each time.
For valueMap and elementMap steps you may see that I evaluate vertex labels (when they are required) for a batch of vertices and then store all labels for a current batch in the Map<JanusGraphVertex, String>.
For foldable ‘has’ steps the logic relies on transaction cache instead. We are using multi-query (not bigger than the transaction cache) and immediately evaluate the result which reuses transaction cache data. We always evaluate hasId first (no backend calls), if hasId passed, we evaluate hasLabel then (first multi-query), if hasLabel passed, we evaluate ‘has’ property conditions (second multi-query). The result is cached for a current batch in Map<JanusGraphVertex, Boolean>, where Boolean means vertex passed or not passed all conditions.
Thus, for hasLabel, valueMap, elementMap steps - we are prefetching vertex labels and no additional caching is needed for these steps.
Nevertheless, as I mentioned above, I totally agree that it does make sense to cache vertex labels, but I haven’t looked in that direction yet. All I can tell is that vertex fetching slice query is going to be cached in the transaction cache, but if transaction cache is small, the result might be evicted from the cache which will result to a separate backend call whenever we need that vertex label again.
I also agree that we need to add multi-query vertex label prefetching to other steps. An obvious step is ‘label’ (mentioned above), but another one could be .property step (‘g.V().property(“name”, “foo”)’) which implies to the upsert case you referred above (when an upsert is touching an index constrained to a vertex label).
I don’t think improved vertex label caching or label multi-query optimisation for other steps (other than valueMap or elementMap) are going to be part of this PR, but I agree we need to take a closer look at them and implement those additional optimisations.
P.S. ‘fast-property=true’ is applicable to vertex properties only, and not to a label. Thus, you are right that vertex labels are not fetched and / or cached with ‘fast-property=true’.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the detailed explanation, @porunov! I agree that changing this is out of scope for this PR. This was mostly a question for my own understanding.
We could maybe add a short explanation to the docs section about batch processing to make sure that uses are aware of the trade off.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the detailed explanation, @porunov! I agree that changing this is out of scope for this PR. This was mostly a question for my own understanding.
We could maybe add a short explanation to the docs section about batch processing to make sure that uses are aware of the trade off.
The explanation doc makes sense, but I’m not sure that it should be a part of a batch processing doc because it’s related to JanusGraph datastructure more and the logic is applicable with both enabled and disabled batch processing.
Nevertheless, I think it makes sense to open a separate docs improvement issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Docs improvement issue: #3814
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Taking into consideration that the query batch modes are part of query.batch
configuration namespace, I'm now thinking it actually makes sense to add that explanation under the Batch processing documentation as @FlorianHockmann suggested.
I added it at the end of batch-processing.md
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sounds to me like I'm getting a better performance with all_properties in this example, but then I have to accept the downside that I will also fetch properties for other traversals that I don't need if I for example only want one specific property there. Is that correct?
It's correct as for now. I.e. when multi-query is enabled then g.V(v1,v2,v3,v4,v5).values("prop1", "prop2", "prop3")
will execute 3 separate queries for provided vertices (v1
, v2
, v3
, v4
, v5
). Moreover, current multi-query cannot execute several slice queries together. Instead is executes them one by one (awaiting until the previous query is fully finished). The behavior is the same in master
branch and in this PR for required_properties_only
mode. However this PR also adds all_properties
which fetches all vertex properties using a single slice query.
As for now I can say - you are correct. You will get much better performance with all_properties
instead of required_properties_only
(not applicable to master
branch because master
branch currently always work in required_properties_only
mode).
However, you will get worse performance not because of the fact that you need to execute 10 CQL queries instead of 1 CQL query, but due to the fact that those 10 queries are executed sequentially. In case they are executed in parallel - the performance difference between executing 1 CQL query or 100 CQL queries will most likely be negligible to none (in case connection channels are not overloaded).
Thus, I started working on #3825 . This will allow storage implementations to execute slice queries of multi-query however they like. I currently added CQL storage driver optimization to execute all slice queries of multi-query in parallel instead of sequentially. This should significantly improve performance cases when multiple slice queries are needed to evaluate multi-query (like the example above g.V(v1,v2,v3,v4,v5).values("prop1", "prop2", "prop3")
).
After #3825 - the situation will change and required_properties_only
might actually work faster than all_properties
in many cases.
Moreover, after we implement with #3816 then required_properties_only
will probably be a better choice in most of the cases. After that all_properties
will be beneficial only in some very specific cases when pre-fetching of all properties into the transaction cache is beneficial for next steps evaluation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thus, I’m not sure we can fetch properties and the label edge via a single slice query. Unless we are sure that VertexLabelEdge is placed near the properties in the row (not sure how to check that) then we will need two slice queries to retrieve vertex properties and the vertex label.
Agree this is out of the scope of this PR. In order to check that, you can call .label.profile()
and see the backend call. A property falls in the range [0x40, 0x60) while an edge falls in the range [0x60, 0x80). I don't remember what 0x60
stands for, but if it happens to be the label, then we could include it when fetching all properties.
UPDATE: I take that back, the profiler does not show the backend call for the label step.
UPDATE: I took a quick look and seems the query for label
is [0x24, 0x25)
, so it seems okay to fetch label together with properties.
4ee27d0
to
a3bb658
Compare
@JanusGraph/committers In case anyone wants to have an introduction call into this PR (or explanation of #3244 ), so that it would be easier to review it, please, ping me here or in Discord to setup a call. I think I could walk over the PR quickly to explain this optimization. |
Benchmark results Master branch (the same
Current PR (
Conclusion: What is missed in the above benchmarks? I realized that I wanted to show performance difference for cases when user wants to retrieve multiple properties (like ~5 properties) with modes |
a3bb658
to
5240e3b
Compare
New benchmark of this PR. This benchmark is the similar to the above benchmark, but has the next changes:
|
5240e3b
to
c7423c6
Compare
Refactored Current benchmark is below:
|
@FlorianHockmann @li-boxuan please, let me know if you have plans on reviewing this PR or would you be OK with lazy consensus for this PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is (another) great piece of work! I only have some nitpicks, mostly related to the doc.
One thing I am curious about: why did you put two significant efforts:
Adds possibility to fetch properties and labels of vertices using valueMap, elementMap, propertyMap steps.
and
Adds fetching modes to properties, values, valueMap, elementMap, propertyMap steps to be able to preFetch all properties (single slice query) or only required properties (separate slice query per each requested property).
to the same PR? Is that just a coincidence?
.../main/java/org/janusgraph/graphdb/tinkerpop/optimize/step/fetcher/LabelStepBatchFetcher.java
Show resolved
Hide resolved
Thank you for the review!
Do you mean why didn't I split the work over the two separate PRs? If so, the new Just two be clear I will comment on both points:
Currently in
Currently in |
… [cql-tests] [tp-tests] Adds possibility to fetch properties and labels of vertices using valueMap, elementMap, propertyMap steps. Adds fetching modes to properties, values, valueMap, elementMap, propertyMap steps to be able to preFetch all properties (single slice query) or only required properties (separate slice query per each requested property). Fixes JanusGraph#2444 Fixes JanusGraph#3814 Signed-off-by: Oleksandr Porunov <alexandr.porunov@gmail.com>
c7423c6
to
813da90
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, I only found 3 really tiny linguistic / style nits in the docs. You can also just address those while merging.
Adds possibility to fetch properties and labels of vertices using valueMap, elementMap, propertyMap steps.
Adds fetching modes to properties, values, valueMap, elementMap, propertyMap steps to be able to preFetch all properties (single slice query) or only required properties (separate slice query per each requested property).
Fixes #2444
Fixes #3814
Thank you for contributing to JanusGraph!
In order to streamline the review of the contribution we ask you
to ensure the following steps have been taken:
For all changes:
master
)?For code changes:
For documentation related changes: