Management of 'empty' index query #35

sgeulette · 2018-03-21T08:24:25Z

Hi,
Since 3.0 version, it is possible to search on 'not': cool ! Empty index elements are not included in the results.
It would be nice to can also query on empty index in the catalog.
And then combine those 2 different queries if needed.

What do you think about this improvement ?
Regards

andbag · 2018-03-21T08:33:38Z

Could you please define an example for a better understanding?

sgeulette · 2018-03-21T08:57:29Z

On a KeywordIndex ('tags'), storing by example tags ;-).
If I want to find brains without tag, I cannot query on [] or None.
A solution is that I manage an empty value ['no_tag'] in the index to query on this special value.
I think it would be better that the catalog manage itself this behavior and can query on empty value search.

vincentfretin · 2018-03-21T09:00:24Z

I guess you'll need a TreeSet of rids for non indexed docs in the index, like it is done for example in hypatia FieldIndex (used by substanced)
https://github.com/Pylons/hypatia/blob/master/hypatia/field/__init__.py#L94
See my example here:
Pylons/hypatia#9 (comment)

andbag · 2018-03-21T10:54:49Z

You can test this branch

https://github.com/zopefoundation/Products.ZCatalog/tree/not-parm-patch

Queries on [] should work. If that's what you needed, I'll create a PR.

icemac · 2018-06-07T06:30:24Z

@sgeulette Where you able to try the branch @andbag suggested?

sgeulette · 2018-06-07T07:05:19Z

I couldn't test it in Plone.

d-maurer · 2019-03-08T07:39:00Z

It would be nice to can also query on empty index in the catalog.

It would be difficult to implement this on the index level: the index by itself knows only about the documents it has indexed (itself), not about the documents known by the catalog. What you are calling for are the documents known by the catalog and not known by the index.

Products.AdvancedQuery allows you to formulate queries like this via ~ Indexed(index) (i.e. search
for the documents not indexed by index). As ZCatalog does not have a general "not"; you would need a new index query parameter telling the index to look into the enclosing catalog and determine the set of its known objects. Making assumptions about the enclosing catalog (and how to determine its known objects) is not nice -- at least on the conceptual level.

andbag · 2019-04-13T19:34:29Z

@d-maurer by the way, you identified a bug. The same argument applies to the 'pure not' operation. The index can currently only return documents that the index knows. Obects without a value (== None) belong to the result set of a 'pure not' operation.

d-maurer · 2019-04-13T19:58:04Z

Andreas Gabriel wrote at 2019-4-13 12:34 -0700:

@d-maurer by the way, you identified a bug. The same argument applies to the 'pure not' operation. The index can currently only return documents that the index knows. Obects without a value (== None) belong to the result set of a 'pure not' operation.

One could say that this is a feature (not a bug). As least I have seen in the implementation of one index (likely `UnIndex`) that an explicit intersection with the set of indexed objects has been performed -- even with an incoming "resultset". Without such an implicit restriction, an index alone cannot evaluate a pure not -- because by itself it does not know the set of all catalogued objects. It would need assumptions about its integration (e.g. that its acquisition parent provides this set) to implement a "true not" -- something conceptually not nice.

andbag · 2019-04-14T21:10:34Z

I think it's more of a bug as a feature. Because 'not' support was added on 25 Mar 2012. And indexing of objects with empty value was disabled due to a BTrees 4.0+ compatibility problem on 2 Nov 2014 (two years later). There must have been a phase in between in which BTrees (<4.0) were used. Otherwise the fix would surely have been added earlier. If an unittest had existed for this case, the bug would have been exposed. Since no unittest exists for this case, the bug has not been detected.

@d-maurer as vincentfretin suggests, you can collect the documenids with empty values separately in the index. My current incomplete branch follows this idea. What's your opinion?

d-maurer · 2019-04-14T22:08:04Z

Andreas Gabriel wrote at 2019-4-14 14:10 -0700:

... @d-maurer as vincentfretin suggests, you can collect the documenids with empty values separately in the index. My [current incomplete branch](1eb9502) follows this idea. What's your opinion?

If you do this, then there should be a search which excludes those artifical index entries: i.e. it should be possible to query for those documents which have a meaningful value for a given index (the `Indexed` query of `Products.AdvancedQuery`). Currently, "_unindex.keys()" is this set (and the `BooleanIndex` makes use of this). If "_unindex" knows all objects, whether or not the index has a meaningful value for it, there should be some other means to restrict to documents with meaningful value. You speak of "empty value". This suggests that you speak in fact of a `KeywordIndex`. A search for "empty value" is on a different level than the "normal" `KeywordIndex` searches: in the first case, you query the complete set of keywords (and that this set should be empty); in the second case that a given keyword is in this set.

d-maurer · 2019-04-15T05:14:26Z

Andreas Gabriel wrote at 2019-4-14 14:10 -0700:

... @d-maurer as vincentfretin suggests, you can collect the documenids with empty values separately in the index. My [current incomplete branch](1eb9502) follows this idea. What's your opinion?

Should your concern really be `KeywordIndex`, then there is a simpler fix: `UnIndex` currently implements the "pure not" via `record.keys = [k for k in index.keys() if k not in not_parm]` (and then treats it as an `or` with `not`). If you change this to `return difference(self._unindex.keys(), self._apply_not(not_parm, resultset)`, then the case "empty keywords" is handled correctly (because documents with an empty keyword set are in `_unindex`). That said: I agree now with you that the current handling of the pure not is buggy: it should handle "empty keywords" correctly. However, I maintain that the cases "empty keywords" and "keywords not applicable" should remain distinct.

andbag · 2019-04-15T11:42:02Z

@d-maurer I suggest that UnIndex should generally support this feature, which can be disabled or enabled. Then we should decide which indexes whose parent class is UnIndex should support the feature by default. Even for debugging, it would be helpful if you could use the catalog to quickly identify those objects that have no value set for an index. However, I have currently no idea how to name the query option to disable or enable the feature.

d-maurer · 2019-04-15T17:11:52Z

However, I have currently no idea how to name the query option to disable or enable the feature.

@andbag I suggest to model this not via a query option but via a special (search) "term" (aka "key"). This way, it could be combined with "normal" "term"s via and, or and not. This also reflects the behaviour of some indexes in their [un]index_object: they use a "_marker" to represent the case, that an object has no value for a given index. This "_marker" could become global and part of the official interface to represent "the index has no value for the object".
Drawback: the feature would be available only to python code, not directly for through the web queries (as the special term has no natural textual representation which could be used easily in a web form).

None could be a natural choice to represent the case "no value for this index" (I have chosen this for Products.ManageableIndex). However, there might be indexes around which use None as a meaningful object value - and those may break if we would use None for the new purpose. Therefore, I suggest to introduce a special marker object.

Indexes which support the new feature could be marked with an interface, maybe IIndexingMissingValue:
Products.PluginIndexes.interfaces:

...
MissingValue = object()  # can be used as query "term" to query for objects the index does not have a value for.

class IIndexingMissingValue(Interface):
    """Marker interface to mark indexes with support the `MissingValue` query term."""
...

andbag · 2019-04-16T20:08:03Z

@d-maurer I am currently experimenting with the IIndexingMissingValue interface. If KeywordIndex implements this interface, should the index consider MissingValue implicit or explicit. There are some special queries that I would like to know what results are expected from the search, e.g.

q1 = {'query': ['f']}, 'not': ['f']}`

q2 = {'query': ['f', 'g']}, 'not': ['f']}`

q3 = {'not': ['f']}

Should the results here contain implicitly items with MissingValue? Or do I have to explicitly specify the term 'MissingValue' in the query?

q1 = {'query': [MissingValue]} etc.

sgeulette · 2019-04-16T22:02:44Z

hi,

I think q3 should return MissingValue brains too, like "pure not" mentioned before.
{'not': ['f', MissingValue]} should explicitly be used if MissingValue is not desired.

regards

d-maurer · 2019-04-17T05:45:44Z

Andreas Gabriel wrote at 2019-4-16 13:08 -0700:

@d-maurer I am currently experimenting with the IIndexingMissingValue interface. If KeywordIndex implements this interface, should the index consider MissingValue implicit or explicit. There are some special queries that I would like to know what results are expected from the search, e.g. `q1 = {'query': ['f']}`, 'not': ['f']}`

This query requests objects indexed under 'f' and not indexed under 'f'. No object satisfies these restrictions.

`q2 = {'query': ['f', 'g']}`, 'not': ['f']}`

This query requests objects indexed under 'f' or 'g' but not indexed under 'f'. Due to the (non empty) positive part ('f' or 'g'), `MissingValue` is excluded.

`q3 = {'not': ['f']}`

This query is interpreted as the "pure not" (side note: this means that the "operator" is silently interpreted as "and" because an "empty or" has an empty result while an "empty and" does not make any restriction). It includes `MissingValue` automatically. I noticed recently, that for `KeywordIndex` the distinction between "empty value" and "missing value" could be interesting. "empty value" would mean that the object supports keyword assignment but its keyword set its empty; "missing value" would mean, that the object does not support keyword assignment at all. For "empty value", the object is (currently) in "_unindex" -- with an empty list as value; in the "missing value" case, it is not in "_unindex". Searches for "empty value" could be supported without a change in the index structure. It would be sufficient to implement the "pure not" via `difference(index._unindex, excludes)` rather than `difference(multiunion([k for k in index._index]), excludes)`. Note: while former `BTrees` versions supported set operations between inhomogenous objects provided they have identical key structure, the current `BTrees` version no longer does. Therefore, `difference(index._unindex, excludes)` will fail and would need to be replaced by `difference(IISet(index._unindex), excludes)`.

andbag · 2019-05-02T11:26:58Z

@d-maurer I still have questions. How should the search for the empty set be defined? And which variations should be supported? Which results are expected? Examples:

q1 = {'query': []}
Should return all items with empty sets, but not items with MissingValue.

q2 = {'query': [(), 'a']}
Should return all items with empty sets and items with keyword 'a'.

q3 = {'not': []}
Should return all items with sets (length greater than 0) and consequently items with MissingValue.

q4 = {'not': [(), MissingValue]}
Should return all items with sets (length greater than 0) but not items with MissingValue.

I'm not sure if this is the right syntax to query for items with empty set.

d-maurer · 2019-05-02T18:03:38Z

Andreas Gabriel wrote at 2019-5-2 04:27 -0700:

@d-maurer I still have questions. How should the search for the empty set be defined? And which variations should be supported? Which results are expected? Examples: `q1 = {'query': []}` Should return all items with empty sets, but not items with MissingValue. `q2 = {'query': [[], 'a']}` Should return all items with empty sets and items with keyword 'a'. `q3 = {'not': []}` Should return all items with sets (length greater than 0) and consequently items with MissingValue. `q4 = {'not': [[], MissingValue]}` Should return all items with sets (length greater than 0) but not items with MissingValue.

I would introduce another "pseudo key" `EmptyValue` (beside `MissingValue`). to explicitely represent the case that the object has an empty value for this index. This would be relevant only for `KeywordIndex` like indexes. In some of your examples above, you use `[]` as representation of the concept `EmptyValue`. My proposal is very similar but uses a speaking name instead of a symbol.

andbag mentioned this issue May 8, 2019

Not indexed value support (MissingValue, EmptyValue) #74

Open

mamico mentioned this issue Jan 28, 2024

Intermittent error when using "not" when searching with index Subject plone/Products.CMFPlone#3895

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Management of 'empty' index query #35

Management of 'empty' index query #35

sgeulette commented Mar 21, 2018

andbag commented Mar 21, 2018

sgeulette commented Mar 21, 2018

vincentfretin commented Mar 21, 2018

andbag commented Mar 21, 2018

icemac commented Jun 7, 2018

sgeulette commented Jun 7, 2018

d-maurer commented Mar 8, 2019

andbag commented Apr 13, 2019

d-maurer commented Apr 13, 2019 via email

andbag commented Apr 14, 2019

d-maurer commented Apr 14, 2019 via email

d-maurer commented Apr 15, 2019 via email

andbag commented Apr 15, 2019

d-maurer commented Apr 15, 2019

andbag commented Apr 16, 2019

sgeulette commented Apr 16, 2019

d-maurer commented Apr 17, 2019 via email

andbag commented May 2, 2019 •

edited

Loading

d-maurer commented May 2, 2019 via email

Management of 'empty' index query #35

Management of 'empty' index query #35

Comments

sgeulette commented Mar 21, 2018

andbag commented Mar 21, 2018

sgeulette commented Mar 21, 2018

vincentfretin commented Mar 21, 2018

andbag commented Mar 21, 2018

icemac commented Jun 7, 2018

sgeulette commented Jun 7, 2018

d-maurer commented Mar 8, 2019

andbag commented Apr 13, 2019

d-maurer commented Apr 13, 2019 via email

andbag commented Apr 14, 2019

d-maurer commented Apr 14, 2019 via email

d-maurer commented Apr 15, 2019 via email

andbag commented Apr 15, 2019

d-maurer commented Apr 15, 2019

andbag commented Apr 16, 2019

sgeulette commented Apr 16, 2019

d-maurer commented Apr 17, 2019 via email

andbag commented May 2, 2019 • edited Loading

d-maurer commented May 2, 2019 via email

andbag commented May 2, 2019 •

edited

Loading