-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Management of 'empty' index query #35
Comments
Could you please define an example for a better understanding? |
On a KeywordIndex ('tags'), storing by example tags ;-). |
I guess you'll need a TreeSet of rids for non indexed docs in the index, like it is done for example in hypatia FieldIndex (used by substanced) |
You can test this branch https://github.com/zopefoundation/Products.ZCatalog/tree/not-parm-patch Queries on [] should work. If that's what you needed, I'll create a PR. |
@sgeulette Where you able to try the branch @andbag suggested? |
I couldn't test it in Plone. |
It would be difficult to implement this on the index level: the index by itself knows only about the documents it has indexed (itself), not about the documents known by the catalog. What you are calling for are the documents known by the catalog and not known by the index.
|
@d-maurer by the way, you identified a bug. The same argument applies to the 'pure not' operation. The index can currently only return documents that the index knows. Obects without a value (== None) belong to the result set of a 'pure not' operation. |
Andreas Gabriel wrote at 2019-4-13 12:34 -0700:
@d-maurer by the way, you identified a bug. The same argument applies to the 'pure not' operation. The index can currently only return documents that the index knows. Obects without a value (== None) belong to the result set of a 'pure not' operation.
One could say that this is a feature (not a bug).
As least I have seen in the implementation of one index
(likely `UnIndex`) that an explicit
intersection with the set of indexed objects has been performed --
even with an incoming "resultset".
Without such an implicit restriction, an index alone cannot
evaluate a pure not -- because by itself it does not know
the set of all catalogued objects. It would need assumptions
about its integration (e.g. that its acquisition parent provides
this set) to implement a "true not" -- something conceptually not nice.
|
I think it's more of a bug as a feature. Because 'not' support was added on 25 Mar 2012. And indexing of objects with empty value was disabled due to a BTrees 4.0+ compatibility problem on 2 Nov 2014 (two years later). There must have been a phase in between in which BTrees (<4.0) were used. Otherwise the fix would surely have been added earlier. If an unittest had existed for this case, the bug would have been exposed. Since no unittest exists for this case, the bug has not been detected. @d-maurer as vincentfretin suggests, you can collect the documenids with empty values separately in the index. My current incomplete branch follows this idea. What's your opinion? |
Andreas Gabriel wrote at 2019-4-14 14:10 -0700:
...
@d-maurer as vincentfretin suggests, you can collect the documenids with empty values separately in the index. My [current incomplete branch](1eb9502) follows this idea. What's your opinion?
If you do this, then there should be a search which excludes those
artifical index entries: i.e. it should be possible to query
for those documents which have a meaningful value for a given index
(the `Indexed` query of `Products.AdvancedQuery`).
Currently, "_unindex.keys()" is this set (and the `BooleanIndex`
makes use of this). If "_unindex" knows all objects, whether or not
the index has a meaningful value for it, there should be some
other means to restrict to documents with meaningful value.
You speak of "empty value". This suggests that you speak in fact of
a `KeywordIndex`. A search for "empty value" is on a
different level than the "normal" `KeywordIndex` searches:
in the first case, you query the complete set of keywords (and that
this set should be empty); in the second case that a given keyword is
in this set.
|
Andreas Gabriel wrote at 2019-4-14 14:10 -0700:
...
@d-maurer as vincentfretin suggests, you can collect the documenids with empty values separately in the index. My [current incomplete branch](1eb9502) follows this idea. What's your opinion?
Should your concern really be `KeywordIndex`, then there
is a simpler fix:
`UnIndex` currently implements the "pure not" via
`record.keys = [k for k in index.keys() if k not in not_parm]`
(and then treats it as an `or` with `not`).
If you change this to
`return difference(self._unindex.keys(), self._apply_not(not_parm, resultset)`,
then the case "empty keywords" is handled correctly (because
documents with an empty keyword set are in `_unindex`).
That said: I agree now with you that the current handling
of the pure not is buggy: it should handle "empty keywords" correctly.
However, I maintain that the cases "empty keywords" and
"keywords not applicable" should remain distinct.
|
@d-maurer I suggest that UnIndex should generally support this feature, which can be disabled or enabled. Then we should decide which indexes whose parent class is UnIndex should support the feature by default. Even for debugging, it would be helpful if you could use the catalog to quickly identify those objects that have no value set for an index. However, I have currently no idea how to name the query option to disable or enable the feature. |
@andbag I suggest to model this not via a query option but via a special (search) "term" (aka "key"). This way, it could be combined with "normal" "term"s via
Indexes which support the new feature could be marked with an interface, maybe
|
@d-maurer I am currently experimenting with the IIndexingMissingValue interface. If KeywordIndex implements this interface, should the index consider MissingValue implicit or explicit. There are some special queries that I would like to know what results are expected from the search, e.g.
Should the results here contain implicitly items with MissingValue? Or do I have to explicitly specify the term 'MissingValue' in the query?
|
hi, I think q3 should return MissingValue brains too, like "pure not" mentioned before. regards |
Andreas Gabriel wrote at 2019-4-16 13:08 -0700:
@d-maurer I am currently experimenting with the IIndexingMissingValue interface. If KeywordIndex implements this interface, should the index consider MissingValue implicit or explicit. There are some special queries that I would like to know what results are expected from the search, e.g.
`q1 = {'query': ['f']}`, 'not': ['f']}`
This query requests objects indexed under 'f' and not indexed under 'f'.
No object satisfies these restrictions.
`q2 = {'query': ['f', 'g']}`, 'not': ['f']}`
This query requests objects indexed under 'f' or 'g'
but not indexed under 'f'. Due to the (non empty) positive part ('f' or 'g'),
`MissingValue` is excluded.
`q3 = {'not': ['f']}`
This query is interpreted as the "pure not" (side note: this
means that the "operator" is silently interpreted as "and"
because an "empty or" has an empty result while an "empty and" does
not make any restriction).
It includes `MissingValue` automatically.
I noticed recently, that for `KeywordIndex` the distinction
between "empty value" and "missing value" could be interesting.
"empty value" would mean that the object supports keyword assignment
but its keyword set its empty; "missing value" would mean, that
the object does not support keyword assignment at all.
For "empty value", the object is (currently) in "_unindex" -- with an
empty list as value; in the "missing value" case, it is not in "_unindex".
Searches for "empty value" could be supported without a change
in the index structure. It would be sufficient
to implement the "pure not" via `difference(index._unindex, excludes)`
rather than `difference(multiunion([k for k in index._index]), excludes)`.
Note: while former `BTrees` versions supported set operations
between inhomogenous objects provided they have identical
key structure, the current `BTrees` version no longer does.
Therefore, `difference(index._unindex, excludes)` will fail and would need to
be replaced by `difference(IISet(index._unindex), excludes)`.
|
@d-maurer I still have questions. How should the search for the empty set be defined? And which variations should be supported? Which results are expected? Examples:
I'm not sure if this is the right syntax to query for items with empty set. |
Andreas Gabriel wrote at 2019-5-2 04:27 -0700:
@d-maurer I still have questions. How should the search for the empty set be defined? And which variations should be supported? Which results are expected? Examples:
`q1 = {'query': []}`
Should return all items with empty sets, but not items with MissingValue.
`q2 = {'query': [[], 'a']}`
Should return all items with empty sets and items with keyword 'a'.
`q3 = {'not': []}`
Should return all items with sets (length greater than 0) and consequently items with MissingValue.
`q4 = {'not': [[], MissingValue]}`
Should return all items with sets (length greater than 0) but not items with MissingValue.
I would introduce another "pseudo key" `EmptyValue` (beside `MissingValue`).
to explicitely represent the case that the object has an empty
value for this index. This would be relevant only for `KeywordIndex`
like indexes.
In some of your examples above, you use `[]` as representation of the concept
`EmptyValue`. My proposal is very similar but uses a speaking name
instead of a symbol.
|
Hi,
Since 3.0 version, it is possible to search on 'not': cool ! Empty index elements are not included in the results.
It would be nice to can also query on empty index in the catalog.
And then combine those 2 different queries if needed.
What do you think about this improvement ?
Regards
The text was updated successfully, but these errors were encountered: