Replies: 2 comments 4 replies
-
Nice write-up with a sound analysis! You would have to check in the JanusGraph code whether this behaviour is due to JanusGraph or due to Lucene. In the former case I would count it as a bug and you can file an issue. Maybe, a JanusGraph committer recognizes the "begins with" functionality. I would expect this only to be possible with a wildcard search, where Janusgraph inserts the wildcard (apparently for the first token of the query only). |
Beta Was this translation helpful? Give feedback.
-
Thank you very much
Thanks! :-) Followed your advise and looked in the code. It has to be somewhere here: https://github.com/JanusGraph/janusgraph/blob/master/janusgraph-lucene/src/main/java/org/janusgraph/diskstorage/lucene/LuceneIndex.java Didn't get into the gory details, but I saw that the raw lucene queries will be logged on DEBUG log level, and voila:
It is even "worse" than you assumption: For single token query strings the wildcard is added, but as soon there are multiple query tokens, for all tokens, no wildcard is added. However, the assessment "worse" "only" applies regarding the general (desired) logic I am assuming above. So the main question IMHO has to be cleared before doing anything else: How is textContainsPrefix supposed to behave, if the query string contains mutliple tokens? |
Beta Was this translation helpful? Give feedback.
-
Consider a simple setup with a Lucene search backend containing a mixed index with a TEXT property (example in jruby, only dependency is lock_jar gem, complete minimal example also attached as zip):
Now there are two vertices in the graph as follows and a helper function to search for the indexed property and print the number of results found:
Documentation of text search states:
and further:
This applies wonderfully if the query string only contains one token, and everything works as expected (output as comment):
So only prefixes of the property value "singletoken" are found.
Now the problem is, that the definition in the documentation that a result is found
is not applicaple to the case where the query string contains multiple words/tokens. If that definition would hold true, then no result would ever be found for multi token query strings, as it is impossible for a word (in the text string) to begin with multiple words/tokens (of the query string).
Hence my general question is: How does textContainsPrefix behave, if the query string contains mutliple tokens? Or in other words, what is the higher level logic in this case?
Here are a few observations which seem to imply the following logic: For each token in the query string, at least one token in the text string has to be present, where query token is a prefix of text token:
However, as soon as query string contains one incomplete token + additional prefixes, things get unexpected:
Here, I would expect both cases to return a result.
I would even assume that this is a bug? But, as the documentation of textContainsPrefix is incomplete here, I am of course not sure... however, current logic does not seem to make sense to me.
We would be very happy to get insight about this issue.
best
query_sample.zip
Beta Was this translation helpful? Give feedback.
All reactions