Not indexed value support (MissingValue, EmptyValue) #74

andbag · 2019-05-08T13:13:11Z

As discussed in issue #35 UnIndex can now support queries on MissingValue and EmptyValue. KeywordIndex implements currently the the new feature. I hope for active feedback.

andbag · 2019-05-08T13:14:47Z

@icemac unfortunately, CI for python3.8-dev is broken.

dataflake · 2019-05-08T13:17:24Z

That's clearly an issue with the build environment, not with your code...

d-maurer

KeywordIndex line 76: insertNotIndexed should not have the argument newKeywords or should have a different name (such as insertSpecialIndexEntry).

KeywordIndex line 75: you may need to remove not indexed entries if oldkeywords is None.

KeywordIndex lines 81, 88: you want oldkeywords to be an OOSet but you store it as a list.

KeywordIndex line 84: I expect an insertNotIndex(...) somewhere in this else block (as there was one in the then block). In addition, it might be necessary to remove a potential MissingValue from the special index.

KeywordIndex line 120: not sure that an exception from the call should be silently swallowed. I suggest to at least log an entry.

KeywordIndex lines 149, 73: inconsistent check for missing _unindex entry (_marker versus None).

KeywordIndex: I suggest to rename ...NotIndexed to ...SpeciallyIndexed (or something similar) - as the document is indexed, just not in the "normal" way.

KeywordIndex, line 154: maybe, you do not need this line (in case, that _unindex is set to [] for an empty value (as this is the case for the old KeywordIndex).

The names MissingValue and EmptyValue are in "CamelCase" which indicates a class. Maybe, we should avoid "CamelCasing" for them to be more conformant with PEP8.

interfaces line 288: NotIndexedValue should not be used by the application; maybe, we want to indicate this by prefixing the name with _.

KeywordIndex - "pure not": I have not seen a "pure not" special handling for KeywordIndex. However, if we include MissingValue in a "pure not" for "UnIndex", we must as well include EmptyValue for KeywordIndex -- this could be done in UnIndex (to avoid code duplication in KeywordIndex.

UnIndex, line 565: the "pure not" should likely be implemented via _unindex rather than an enumeration of all keys (which may cause a huge multiunion to be executed). In addition: KeywordIndex might need that documents indexed under EmptyValue are included (by default) in a "pure not" result.

andbag · 2019-05-08T20:57:20Z

@d-maurer thanks for the helpful comments. However, the old implementation of KeywordIndex does not keep empty values like '()' in _unindex.

>>> from Products.PluginIndexes.KeywordIndex.KeywordIndex import KeywordIndex
>>> index = KeywordIndex('foo')
>>> class Dummy: pass
>>> obj1 = Dummy()
>>> obj1.foo = ('a','b')
>>> index.index_object(1,obj1)
True
>>> tuple(index._unindex.keys())
(1,)
>>> obj2 = Dummy()
>>> obj2.foo = ()
>>> index.index_object(2,obj2)
True
>>> tuple(index._unindex.keys())# expect (1, 2, ) but
(1,)

If we want to use _unindex for 'pure not' queries, then _unindex should also collect the special values.

d-maurer · 2019-05-09T05:09:04Z

Andreas Gabriel wrote at 2019-5-8 13:57 -0700:

@d-maurer thanks for the helpful comments. However, the old implementation of KeywordIndex does not keep empty values like '()' in _unindex.

You are right.

If we want to use _unindex for 'pure not' queries, then `_unindex` should also collect the special values.

You can do it -- but this is not a must: the index knows what is in its `_unindex` and can compensate as necessary. E.g. if it knows that `_unindex` does not contain empty values, it could implement the "pure not" based on "union(IISet(self._unindex), self._empty_...)".

d-maurer · 2019-05-11T06:23:58Z

@andbag
ATTENTION When we now index MissingValue, we may get in trouble with the strange behaviour described in #64, i.e. when the index indexes more than a single attribute. We need at least tests for this case.
The current behaviour is to iterate over all "indexed attributes" and give each attribute a chance to modify the index according to its value. This means effectively, that the last attributes with a value succeeds. When we index even "missing value", then the last attribute will effectively always win, whether it has a value or not. I am quite sure that this would be unexpected.

As I wrote in #64, I believe that the current behaviour is not what was really intended: it would be much more natural if the first rather than the last attribute with a value succeeds. Maybe, we use the opportunity to document what it should mean the an index indexes several attributes, and maybe, we change the order in the process.

Whether or not we do something about the documentation for the "several indexed attributes" case (or even change the order), we must ensure that an attribute with a value has precedence over one without a value. We can distinguish both cases by checking the return value of _index_object.

I am unsure how to handle the case "empty value" (in contrast to "missing value"): should an attribute with a non empty value have precedence over one with an empty value? This question is relevant only for KeywordIndex like indexes. Should we say that "empty value" must be handled differently from "missing value", then potentially, we must change _index_object as well to differentiate both cases.

andbag · 2019-05-14T12:37:10Z

Unfortunately, there are no tests yet that check the current behavior for multiple indexed attributes. I will submit a new PR for these tests so that we don't lose track if we change the current behavior.

andbag · 2019-05-14T15:04:44Z

@d-maurer My observations show that the last attribute always wins, regardless of whether the value of last indexed attribute is set or not. The same applies to the existence of the last indexed attribute. Following test is based on code of master branch:

>>> from Products.PluginIndexes.KeywordIndex.tests import TestKeywordIndex
>>> test=TestKeywordIndex()
>>> index = test._makeOne('foobar', extra={'indexed_attrs': 'foo, bar'})
>>> class DummyContent(object):
...    def __init__(self, **kw):
...       for k in kw.keys():
...          setattr(self, k, kw.get(k))
... 
>>> index.index_object(0, DummyContent(foo=['NO']))
True
>>> index.index_object(1, DummyContent(foo=['NO'], bar=None))
True
>>> index.index_object(2, DummyContent(foo=['NO'], bar=''))
True
>>> tuple(index._index)
()
>>> tuple(index._unindex)
()

If the last attribute has a value, it is stored in the index.

>>> index.index_object(3, DummyContent(foo=['NO'], bar='YES'))
True
>>> tuple(index._index)
('YES',)
>>> tuple(index._unindex)
(3,)

In this regard, the option "indexed attributes" has no effect :(. That's why I don't think anyone's using the feature.

d-maurer · 2019-05-14T16:32:37Z

Andreas Gabriel wrote at 2019-5-14 08:04 -0700:

@d-maurer My observations show that the last attribute always wins, regardless of whether the value of last indexed attribute is set or not. ...

Apparently, it is indeed worse than I had thought. Obviously a bug. It makes no sense to let the name indicate that an index can index several attributes while the index actually indexes just the last one specified.

... In this regard, the option "indexed attributes" has no effect :(. That's why I don't think anyone's using the feature.

You might be right. I see two options: * we raise an exception when more than a single attribute is indexed * we document the feature "indexed attributes" and ensure that the implementation follows the documentation -- at least for "our own" indexes.

andbag · 2019-05-15T15:18:06Z

@d-maurer

You might be right. I see two options:

we raise an exception when more than a single attribute is indexed

we document the feature "indexed attributes" and ensure that the implementation follows the documentation -- at least for "our own" indexes.

I prefer option one, because I don't want to implement features that nobody apparently requires. This feature can be much better implemented using a method that is executed by calling the single "indexed attribute". Which error fits best? TypeError or NotImplementedError?

d-maurer · 2019-05-15T16:03:11Z

Andreas Gabriel wrote at 2019-5-15 08:18 -0700:

... Which error fits best? TypeError or NotImplementedError?

I would opt for `NotImplementedError` (with a good description what is not implemented).

d-maurer · 2019-05-27T16:14:09Z

Andreas Gabriel wrote at 2019-5-27 08:43 -0700:

@d-maurer > * `None` value is interpreted as "missing" (maybe, this should indeed be the case; > it should then be documented, e.g. in the `IIndexingMissingValue` interface) I have no particular preference. In the old implementation, `None` was not indexed either. In contrast, an empty string was indexed. If FieldIndex implements `IIndexingEmptyValue`, empty string is currently interpreted as special value `empty`.

I recommend to treat the empty string as a "normal" (not a special) value. I recommend to treat `None` as a missing value (and document this).

Currently `_get_object_datum` returns `missing`, if the calculation results in `NoneType`, `AttributeError` or `TypeError` for ***@***.***(IIndexingMissingValue)`. Otherwise, `empty` is returned when the truth value test returns `False` for ***@***.***(IIndexingEmptyValue)`.

That definitely is not good enough: `0` (= zero) is a perfectly "normal" value for an index with integer values. And, in my view, `''` is a "normal" value for an index with `str` values. There is no need to introduce a special value, if you can search for the value in the "normal" way. We have introduced the special values, because they allow us to search for things not searchable in the "normal" way.

andbag · 2019-05-29T09:45:38Z

@d-maurer I've corrected the code and generalized it a bit. Before I improve the code, it would be nice if you could have a look at my changes. Especially the mapping of the special value mapping can now be configured and the purpose is documented in interfaces.py.

Products.ZCatalog/src/Products/PluginIndexes/interfaces.py

Lines 306 to 308 in 6d5fa3e

    
           special_values = Attribute('A dict which maps not regularly indexable ' 
        
                                      'values or errors on value calculation to ' 
        
                                      'a special value')

The implementation looks like this

Products.ZCatalog/src/Products/PluginIndexes/KeywordIndex/KeywordIndex.py

Lines 51 to 54 in 6d5fa3e

    
           special_values = {TypeError: missing, 
        
                             AttributeError: missing, 
        
                             None: missing, 
        
                             (): empty}

d-maurer · 2019-05-30T05:51:43Z

@andbag

@d-maurer ... Before I improve the code, it would be nice if you could have a look at my changes.

I find the idea good but have a few suggestions:

special_values could get a better name and description. Let's start with "description":
A dict mapping "exceptional" object values to a special value.
When the index indexes an object, it derives an index specific value from the object, the so called "object value" (relative to this index). This process can result in an exception or produce a value which the index cannot index in the normal way.
The attribute controls what should happen in such a case. It maps exceptions or values to one of the special values. If an exception not mapped occurs, it is reraised; if an object value is not mapped, it is indexed normally.
A name like map_to_special_value would fit quite well with this description.
KeywordIndex maps () to empty. However, a KeywordIndex related object value could be any sequence, not just tuple. It might be better to replace the dict by methods (e.g. map_value ("map value to a special value, if necessary") and map_exception_to_special_value).

d-maurer · 2019-05-30T06:08:26Z

src/Products/PluginIndexes/KeywordIndex/KeywordIndex.py

+                    try:
+                        self.insertForwardIndexEntry(kw, documentId)
+                        keys.append(kw)
+                    except TypeError:


I doubt that this exception handling is right: it does not index the object if one key cannot be indexed - and the problem is only reported via a log entry.
In my opinion, other alternatives would be better:

log then ignore keys not indexable

do not catch the exception (and let the whole operation fail)

handle this TypeError in the same way as if it had occurred during the object value determination (e.g. map to missing).

In any case, the logic is at the wrong place. One would need similar logic for "update existing index info" and it should not be duplicated.

In the original version there was a bug that could lead to inconsistencies in the index. Also, the problem was not logged. In order not to have to abolish the old behavior completely, I would prefer the first variant. Consequently, _unindex is only allowed to store indexable keywords.

Correction: Since the type OOSet is forced for keywords in the meantime, a TypeError can also be raised under python3 e.g. in the method map_value. For consistency reasons, TypeError is now always handled in the same way when determining the attribute value.

@d-maurer I'm beginning to wonder if it wouldn't be more sensible to escalate TypeError when a value in the keyword list is incompatible with the already indexed values. Otherwise the new values would have to be pre-validated before being indexed.

d-maurer · 2019-05-30T06:11:32Z

src/Products/PluginIndexes/KeywordIndex/KeywordIndex.py

+
+                newKeywords = OOSet(newKeywords)
+
+            self._unindex[documentId] = newKeywords


The _unindex update could be done together with the "update existing index info" case.

d-maurer · 2019-05-30T06:21:20Z

src/Products/PluginIndexes/KeywordIndex/KeywordIndex.py

+
+        # normalize datum
+        if isinstance(newKeywords, basestring):
+            newKeywords = (newKeywords,)
        else:
            try:
                # unique
                newKeywords = set(newKeywords)


At another place, the keywords are collected in an OOSet. Using different set types increases the constraints placed on the usable types for keywords: OOSet requires orderability (as the BTrees package as a whole); set requires hashability. I recommend to use OOSet uniformly (and avoid the tuple recasting).

andbag · 2019-05-31T09:43:06Z

KeywordIndex maps () to empty. However, a KeywordIndex related
object value could be any sequence, not just tuple. It might be better to
replace the dict by methods (e.g. map_value ("map value to a special value,
if necessary") and map_exception_to_special_value).

@d-maurer I just can't imagine how I can implement such a method generically. In the end, the _get_object_datum method in combination with map_exception_to_special_value already serves the purpose, doesn't it?
I've thought about it again. I'll program a proposal. However, the methods could take shorter names and look better in camel case notation (e.g. mapValue and mapException).

d-maurer · 2019-05-31T15:05:02Z

Andreas Gabriel wrote at 2019-5-31 02:43 -0700:

> * `KeywordIndex` maps `()` to `empty`. However, a `KeywordIndex` related > object value could be any sequence, not just `tuple`. It might be better to > replace the `dict` by methods (e.g. `map_value` ("map _value_ to a special value, > if necessary") and `map_exception_to_special_value`). @d-maurer I just can't imagine how I can implement such a method generically. In the end, the `_get_object_datum` method in combination with `map_exception_to_special_value` already serves the purpose, doesn't it?

@andbag Likely, you cannot have a fully generic implementation. Just a sensible default which a derived class can override if the default does not fit. All is a question of class architecture. In your architecture, we have an `_index_object` and a `_get_object_datum` (which you name `_get_object_keywords` for `KeywordIndex`). `_index_object` essentially gets an attribute name. Its task is to determine the object's value from this attribute and then updates the index accordingly. It uses `_get_object_datum` for the first subtask. The task of `_get_object_datum` is to produce a value the index can handle, either a "normal" or a "special" value. Currently, you define a local function `_getSpecialValueFor` inside `_get_object_{datum,keywords}` (you definitely want to put this on the index level) which "generically" recognizes cases which require the mapping to a special value (using the dict `special_values`). I suggest a slightly different architecture. It has methods `_index_object`, `_get_object_datum` (which likely should get official rather than private) and `map_value` and a tuple attribute `exceptions_treated_as_missing`. In this architecture, `_get_object_datum` could look like: ``` exceptions_treated_as_missing = AttributeError, TypeError, def _get_object_datum(self, obj, attr): try: datum = getattr(obj, attr) if safe_callable(datum): datum = datum() return self.map_value(datum) except self.exceptions_treated_as_missing: return missing ``` The default implementation for `map_value` could be: ``` def map_value(self, value): return missing if value is None else value ``` `KeywordIndex` could override `map_value` as ``` def map_value(self, value): value = super(KeywordIndex,self).map_value(value) if value is not missing: ... handle atomic value ... # at this place, *value* is expected to be a sequence value = empty if not value else OOSet(value) return value ``` With Python 3, a new limitation arises: a `BTree` can no longer have keys of different types (exception `None`). `TypeError`'s are raised when the condition is violated. The implementation draft above would result in a `missing` when an object has keywords of different type. Maybe, we do not want this behaviour.

icemac · 2019-06-07T05:53:31Z

What a pity that Python 3.8 segfaults when starting the test. I cleaned the caches and tried to restart the Python 3.8 job.

icemac · 2019-06-07T05:56:26Z

Cool, cleaning the TravisCI cache seems to do the trick.

d-maurer · 2019-06-12T05:07:56Z

src/Products/PluginIndexes/CompositeIndex/CompositeIndex.py

 from Products.PluginIndexes.KeywordIndex.KeywordIndex import KeywordIndex
 from Products.PluginIndexes.unindex import _marker
 from Products.ZCatalog.query import IndexQuery

+try:


Having used similar code for Python 2/3 compatibility, I have been directed to use six instead. Consistently using six for Python 2/3 compatibility will facilitate code cleanup once Python 2 support is dropped.

d-maurer · 2019-06-12T05:12:17Z

src/Products/PluginIndexes/CompositeIndex/CompositeIndex.py

-        return tuple(pkl)
+        return OOSet(pkl)
+
+    def _get_component_datum(self, obj, attr):


This almost looks like get_object_datum. Are you sure you need this special definition?

d-maurer · 2019-06-12T05:30:05Z

src/Products/PluginIndexes/KeywordIndex/KeywordIndex.py

+            else:
+                try:
+                    self.index_objectKeywords(documentId, newKeywords)
+                except self.exceptions_treated_as_missing:


The logic here does not yet seem correct: assume newKeywords is a special value - but not one we want to support. It then goes into index_objectKeywords (which will fail because it is a special value).

d-maurer · 2019-06-12T05:44:20Z

src/Products/PluginIndexes/KeywordIndex/KeywordIndex.py

+                                  index=self.id))
+                    if self.providesSpecialIndex(missing):
+                        newKeywords = missing
+                        self.insertSpecialIndexEntry(missing, documentId)


The logic here seems not yet correct: assume keywords "a", 1, "b". The index_objectKeywords will have failed after it has indexed "a" and add the document to the missing index as well. First while keywords "a" and "b" are similar, they are not treated similar; second it may surprise to have an object both in a "normal" index as well as the "missing" index.

Despite the appearance, the logic could be right: you may already have ensured at a different place that newkeywords contains only keywords of the same type. In this case, if index_objjectKeywords fails at all, it will fail with the first keyword. I suggest to add a corresponding comment in this case.

d-maurer · 2019-06-12T05:49:43Z

src/Products/PluginIndexes/KeywordIndex/KeywordIndex.py

+                              doc_id=documentId,
+                              index=self.id))
+
+                if self.providesSpecialIndex(missing):


I have already seen this quite complex logic before. I would centralize it (maybe in a locally defined function) to have it in a single place.

d-maurer · 2019-06-12T05:55:06Z

src/Products/PluginIndexes/KeywordIndex/KeywordIndex.py

+        return value
+
+    def index_objectKeywords(self, documentId, keywords):
+        """ carefully index keywords of object with integer id 'documentId'


You are no longer "carefull" here. Likely, there is no longer any need because you have ensured elsewhere that keywords is homogenous and the indexing will fail with the first element if it fails at all.

d-maurer · 2019-06-12T06:14:24Z

src/Products/PluginIndexes/KeywordIndex/KeywordIndex.py

+                newSet = newKeywords = OOSet(newKeywords)
+
+            try:
+                fdiff = difference(oldSet, newSet)


This will fail if the keywords change type - and will let your index in a strange state. Assume "oldSet" to be OOSet(['a', 'b']), "newSet" to be OOSet([1, 2]). Then under Python3, the differencewill result in aTypeErrorwhich lets your document remain indexed underoldSetand gets newly indexed undermissing`.

Under Python 3, all indexed values must have a common type. Changing the keyword type will therefore not work (apart from constructed cases). Therefore, you should let an exception from the difference calls propagate (maybe log and provide a more specific error message) and not turn it into missing.

d-maurer · 2019-06-12T06:16:04Z

src/Products/PluginIndexes/KeywordIndex/KeywordIndex.py


    def unindex_objectKeywords(self, documentId, keywords):
-        """ carefully unindex the object with integer id 'documentId'"""
+        """ carefully unindex keywords of object with integer id 'documentId'


You are not "carefull" here (drop the word).

d-maurer · 2019-06-12T06:31:09Z

src/Products/PluginIndexes/interfaces.py

+    special value query term."""
+
+    def map_value(value):
+        """ Map value, which is typically not generically indexable,


The "is typically not" is wrong.

I recommend:

def map_value(value): """Map (original) value to the value that should get indexed. The (original) value obtained from the object might not be indexable in the normal way. `map_value` gives you the chance to map it to a different, usually a special value in this case. """

d-maurer · 2019-06-26T16:29:03Z

Andreas Gabriel wrote at 2019-6-26 00:34 -0700:

andbag commented on this pull request. > for kw in newKeywords: - self.insertForwardIndexEntry(kw, documentId) - if newKeywords: - self._unindex[documentId] = list(newKeywords) - except TypeError: - return 0 + try: + self.insertForwardIndexEntry(kw, documentId) + keys.append(kw) + except TypeError: @d-maurer I'm beginning to wonder if it wouldn't be more sensible to escalate TypeError when a value in the keyword list is incompatible with the already indexed values. Otherwise the new values would have to be pre-validated before being indexed.

I am with you in this regard.

andbag added 9 commits May 7, 2019 11:31

Initial not indexed value support (MissingValue, EmptyValue)

2f56826

Reorganize NotIndexedValue classes

29b0bf3

Fix BTree TypeError

ce61b42

Add not indexed value tests

05c8079

Fix get_object_datum for not indexed value support

2de46a3

Add additional tests for not indexed value support

c61afae

Fix EmptyValue support

a47ebc7

Disable Missing/EmptyValue interface of CompositeIndex

d140ffb

Fix flake8

7f246be

andbag requested a review from d-maurer May 8, 2019 13:13

d-maurer requested changes May 8, 2019

View reviewed changes

Michael Howitz added 2 commits May 9, 2019 11:38

Merge branch 'master' into notindexed_value_support

a0c8546

Merge branch 'master' into notindexed_value_support

0f6a2ce

Merge branch 'master' into notindexed_value_support

7c9cf99

andbag added 6 commits May 14, 2019 15:44

New naming of methods and variables

a2a1ee7

Store special values in _unindex

6ceb6dc

Fix KeywordIndex tests

7b9313c

Fix unindex and refactor KeywordIndex

0911a0c

Return False for SpecialValues on truth value testing

47d5b95

Fix flake8

e305aca

andbag added 3 commits May 27, 2019 23:23

Add definition map for special values

7dd55b7

Refinement of special value support

0d58199

Fix test for empty value

6d5fa3e

d-maurer requested changes May 30, 2019

View reviewed changes

andbag added 10 commits June 4, 2019 13:12

Log then ignore keys not indexable

4110f65

Consolidate code

2815927

Avoid obsolete type casting

2a45691

Consolidate special value handling step one

6c5e52c

Consolidate special value handling step two

fd0a9d2

Consolidate code of CompositeIndex

225df14

Completion of interfaces

9bbfdfb

Code further generalized

906a858

flake8

ab99560

Continue cleaning

e595088

andbag added 3 commits June 11, 2019 08:09

Code further generalized II

d059521

Reorganize code and complete tests

9883352

Fix for py2 backward compatibility

62321b9

andbag requested a review from d-maurer June 11, 2019 14:08

d-maurer requested changes Jun 12, 2019

View reviewed changes

Merge branch 'master' into notindexed_value_support

3913a78

d-maurer mentioned this pull request May 6, 2020

FieldIndex value of None should be unindexed, not just ignored #100

Closed

mamico mentioned this pull request Jan 28, 2024

Intermittent error when using "not" when searching with index Subject plone/Products.CMFPlone#3895

Open


		newKeywords = OOSet(newKeywords)

		self._unindex[documentId] = newKeywords

Not indexed value support (MissingValue, EmptyValue) #74

Are you sure you want to change the base?

Not indexed value support (MissingValue, EmptyValue) #74

Conversation

andbag commented May 8, 2019

andbag commented May 8, 2019

dataflake commented May 8, 2019

d-maurer left a comment

Choose a reason for hiding this comment

andbag commented May 8, 2019

d-maurer commented May 9, 2019 via email

d-maurer commented May 11, 2019 • edited Loading

andbag commented May 14, 2019

andbag commented May 14, 2019

d-maurer commented May 14, 2019 via email

andbag commented May 15, 2019 • edited Loading

d-maurer commented May 15, 2019 via email

d-maurer commented May 27, 2019 via email

andbag commented May 29, 2019

d-maurer commented May 30, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andbag commented May 31, 2019 • edited Loading

d-maurer commented May 31, 2019 via email

icemac commented Jun 7, 2019

icemac commented Jun 7, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

d-maurer commented Jun 26, 2019 via email

d-maurer commented May 11, 2019 •

edited

Loading

andbag commented May 15, 2019 •

edited

Loading

andbag commented May 31, 2019 •

edited

Loading