-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Not indexed value support (MissingValue, EmptyValue) #74
base: master
Are you sure you want to change the base?
Conversation
@icemac unfortunately, CI for python3.8-dev is broken. |
That's clearly an issue with the build environment, not with your code... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
KeywordIndex
line 76: insertNotIndexed
should not have the argument newKeywords
or should have a different name (such as insertSpecialIndexEntry
).
KeywordIndex
line 75: you may need to remove not indexed entries if oldkeywords
is None
.
KeywordIndex
lines 81, 88: you want oldkeywords
to be an OOSet
but you store it as a list
.
KeywordIndex
line 84: I expect an insertNotIndex(...)
somewhere in this else
block (as there was one in the then
block). In addition, it might be necessary to remove a potential MissingValue
from the special index.
KeywordIndex
line 120: not sure that an exception from the call should be silently swallowed. I suggest to at least log an entry.
KeywordIndex
lines 149, 73: inconsistent check for missing _unindex
entry (_marker
versus None
).
KeywordIndex
: I suggest to rename ...NotIndexed
to ...SpeciallyIndexed
(or something similar) - as the document is indexed, just not in the "normal" way.
KeywordIndex
, line 154: maybe, you do not need this line (in case, that _unindex
is set to [] for an empty value (as this is the case for the old KeywordIndex
).
The names MissingValue
and EmptyValue
are in "CamelCase" which indicates a class. Maybe, we should avoid "CamelCasing" for them to be more conformant with PEP8.
interfaces
line 288: NotIndexedValue
should not be used by the application; maybe, we want to indicate this by prefixing the name with _
.
KeywordIndex
- "pure not": I have not seen a "pure not" special handling for KeywordIndex
. However, if we include MissingValue
in a "pure not" for "UnIndex", we must as well include EmptyValue
for KeywordIndex
-- this could be done in UnIndex
(to avoid code duplication in KeywordIndex
.
UnIndex
, line 565: the "pure not" should likely be implemented via _unindex
rather than an enumeration of all keys (which may cause a huge multiunion
to be executed). In addition: KeywordIndex
might need that documents indexed under EmptyValue
are included (by default) in a "pure not" result.
@d-maurer thanks for the helpful comments. However, the old implementation of KeywordIndex does not keep empty values like '()' in _unindex.
If we want to use _unindex for 'pure not' queries, then |
Andreas Gabriel wrote at 2019-5-8 13:57 -0700:
@d-maurer thanks for the helpful comments. However, the old implementation of KeywordIndex does not keep empty values like '()' in _unindex.
You are right.
If we want to use _unindex for 'pure not' queries, then `_unindex` should also collect the special values.
You can do it -- but this is not a must:
the index knows what is in its `_unindex` and can compensate
as necessary. E.g. if it knows that `_unindex` does not contain
empty values, it could implement the "pure not" based on
"union(IISet(self._unindex), self._empty_...)".
|
@andbag As I wrote in #64, I believe that the current behaviour is not what was really intended: it would be much more natural if the first rather than the last attribute with a value succeeds. Maybe, we use the opportunity to document what it should mean the an index indexes several attributes, and maybe, we change the order in the process. Whether or not we do something about the documentation for the "several indexed attributes" case (or even change the order), we must ensure that an attribute with a value has precedence over one without a value. We can distinguish both cases by checking the return value of I am unsure how to handle the case "empty value" (in contrast to "missing value"): should an attribute with a non empty value have precedence over one with an empty value? This question is relevant only for |
Unfortunately, there are no tests yet that check the current behavior for multiple indexed attributes. I will submit a new PR for these tests so that we don't lose track if we change the current behavior. |
@d-maurer My observations show that the last attribute always wins, regardless of whether the value of last indexed attribute is set or not. The same applies to the existence of the last indexed attribute. Following test is based on code of master branch:
If the last attribute has a value, it is stored in the index.
In this regard, the option "indexed attributes" has no effect :(. That's why I don't think anyone's using the feature. |
Andreas Gabriel wrote at 2019-5-14 08:04 -0700:
@d-maurer My observations show that the last attribute always wins, regardless of whether the value of last indexed attribute is set or not.
...
Apparently, it is indeed worse than I had thought.
Obviously a bug. It makes no sense to let the name indicate
that an index can index several attributes while the index actually
indexes just the last one specified.
...
In this regard, the option "indexed attributes" has no effect :(. That's why I don't think anyone's using the feature.
You might be right.
I see two options:
* we raise an exception when more than a single attribute is indexed
* we document the feature "indexed attributes" and ensure that
the implementation follows the documentation -- at least
for "our own" indexes.
|
I prefer option one, because I don't want to implement features that nobody apparently requires. This feature can be much better implemented using a method that is executed by calling the single "indexed attribute". Which error fits best? TypeError or NotImplementedError? |
Andreas Gabriel wrote at 2019-5-15 08:18 -0700:
...
Which error fits best? TypeError or NotImplementedError?
I would opt for `NotImplementedError` (with a good description
what is not implemented).
|
Andreas Gabriel wrote at 2019-5-27 08:43 -0700:
@d-maurer
> * `None` value is interpreted as "missing" (maybe, this should indeed be the case;
> it should then be documented, e.g. in the `IIndexingMissingValue` interface)
I have no particular preference. In the old implementation, `None` was not indexed either. In contrast, an empty string was indexed. If FieldIndex implements `IIndexingEmptyValue`, empty string is currently interpreted as special value `empty`.
I recommend to treat the empty string as a "normal" (not a special)
value.
I recommend to treat `None` as a missing value (and document this).
Currently `_get_object_datum` returns `missing`, if the calculation results in `NoneType`, `AttributeError` or `TypeError` for ***@***.***(IIndexingMissingValue)`. Otherwise, `empty` is returned when the truth value test returns `False` for ***@***.***(IIndexingEmptyValue)`.
That definitely is not good enough: `0` (= zero) is a perfectly
"normal" value for an index with integer values.
And, in my view, `''` is a "normal" value for an index with `str` values.
There is no need to introduce a special value, if you can search
for the value in the "normal" way.
We have introduced the special values, because they allow us to
search for things not searchable in the "normal" way.
|
@d-maurer I've corrected the code and generalized it a bit. Before I improve the code, it would be nice if you could have a look at my changes. Especially the mapping of the special value mapping can now be configured and the purpose is documented in Products.ZCatalog/src/Products/PluginIndexes/interfaces.py Lines 306 to 308 in 6d5fa3e
The implementation looks like this
|
I find the idea good but have a few suggestions:
|
try: | ||
self.insertForwardIndexEntry(kw, documentId) | ||
keys.append(kw) | ||
except TypeError: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I doubt that this exception handling is right: it does not index the object if one key cannot be indexed - and the problem is only reported via a log entry.
In my opinion, other alternatives would be better:
- log then ignore keys not indexable
- do not catch the exception (and let the whole operation fail)
- handle this
TypeError
in the same way as if it had occurred during the object value determination (e.g. map tomissing
).
In any case, the logic is at the wrong place. One would need similar logic for "update existing index info" and it should not be duplicated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the original version there was a bug that could lead to inconsistencies in the index. Also, the problem was not logged. In order not to have to abolish the old behavior completely, I would prefer the first variant. Consequently, _unindex
is only allowed to store indexable keywords.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correction: Since the type OOSet is forced for keywords in the meantime, a TypeError can also be raised under python3 e.g. in the method map_value
. For consistency reasons, TypeError is now always handled in the same way when determining the attribute value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@d-maurer I'm beginning to wonder if it wouldn't be more sensible to escalate TypeError when a value in the keyword list is incompatible with the already indexed values. Otherwise the new values would have to be pre-validated before being indexed.
|
||
newKeywords = OOSet(newKeywords) | ||
|
||
self._unindex[documentId] = newKeywords |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The _unindex
update could be done together with the "update existing index info" case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay
|
||
# normalize datum | ||
if isinstance(newKeywords, basestring): | ||
newKeywords = (newKeywords,) | ||
else: | ||
try: | ||
# unique | ||
newKeywords = set(newKeywords) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At another place, the keywords are collected in an OOSet
. Using different set types increases the constraints placed on the usable types for keywords: OOSet
requires orderability (as the BTrees
package as a whole); set
requires hashability. I recommend to use OOSet
uniformly (and avoid the tuple
recasting).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay
@d-maurer I just can't imagine how I can implement such a method generically. In the end, the |
Andreas Gabriel wrote at 2019-5-31 02:43 -0700:
> * `KeywordIndex` maps `()` to `empty`. However, a `KeywordIndex` related
> object value could be any sequence, not just `tuple`. It might be better to
> replace the `dict` by methods (e.g. `map_value` ("map _value_ to a special value,
> if necessary") and `map_exception_to_special_value`).
@d-maurer I just can't imagine how I can implement such a method generically. In the end, the `_get_object_datum` method in combination with `map_exception_to_special_value` already serves the purpose, doesn't it?
@andbag Likely, you cannot have a fully generic implementation. Just a
sensible default which a derived class can override if the default
does not fit.
All is a question of class architecture.
In your architecture, we have an `_index_object` and a `_get_object_datum`
(which you name `_get_object_keywords` for `KeywordIndex`).
`_index_object` essentially gets an attribute name. Its task is to determine
the object's value from this attribute and then updates the index accordingly.
It uses `_get_object_datum` for the first subtask.
The task of `_get_object_datum` is to produce a value the index can
handle, either a "normal" or a "special" value.
Currently, you define a local function `_getSpecialValueFor`
inside `_get_object_{datum,keywords}` (you definitely want to put this
on the index level) which "generically" recognizes cases which require
the mapping to a special value (using the dict `special_values`).
I suggest a slightly different architecture. It has methods
`_index_object`, `_get_object_datum` (which likely should get official
rather than private) and `map_value` and
a tuple attribute `exceptions_treated_as_missing`.
In this architecture, `_get_object_datum` could look like:
```
exceptions_treated_as_missing = AttributeError, TypeError,
def _get_object_datum(self, obj, attr):
try:
datum = getattr(obj, attr)
if safe_callable(datum):
datum = datum()
return self.map_value(datum)
except self.exceptions_treated_as_missing:
return missing
```
The default implementation for `map_value` could be:
```
def map_value(self, value):
return missing if value is None else value
```
`KeywordIndex` could override `map_value` as
```
def map_value(self, value):
value = super(KeywordIndex,self).map_value(value)
if value is not missing:
... handle atomic value ...
# at this place, *value* is expected to be a sequence
value = empty if not value else OOSet(value)
return value
```
With Python 3, a new limitation arises: a `BTree` can no longer
have keys of different types (exception `None`).
`TypeError`'s are raised when the condition is violated.
The implementation draft above would result in a `missing`
when an object has keywords of different type.
Maybe, we do not want this behaviour.
|
What a pity that Python 3.8 segfaults when starting the test. I cleaned the caches and tried to restart the Python 3.8 job. |
Cool, cleaning the TravisCI cache seems to do the trick. |
from Products.PluginIndexes.KeywordIndex.KeywordIndex import KeywordIndex | ||
from Products.PluginIndexes.unindex import _marker | ||
from Products.ZCatalog.query import IndexQuery | ||
|
||
try: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having used similar code for Python 2/3 compatibility, I have been directed to use six
instead. Consistently using six
for Python 2/3 compatibility will facilitate code cleanup once Python 2 support is dropped.
return tuple(pkl) | ||
return OOSet(pkl) | ||
|
||
def _get_component_datum(self, obj, attr): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This almost looks like get_object_datum
. Are you sure you need this special definition?
else: | ||
try: | ||
self.index_objectKeywords(documentId, newKeywords) | ||
except self.exceptions_treated_as_missing: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic here does not yet seem correct: assume newKeywords
is a special value - but not one we want to support. It then goes into index_objectKeywords
(which will fail because it is a special value).
index=self.id)) | ||
if self.providesSpecialIndex(missing): | ||
newKeywords = missing | ||
self.insertSpecialIndexEntry(missing, documentId) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic here seems not yet correct: assume keywords "a", 1, "b". The index_objectKeywords
will have failed after it has indexed "a"
and add the document to the missing
index as well. First while keywords "a"
and "b"
are similar, they are not treated similar; second it may surprise to have an object both in a "normal" index as well as the "missing" index.
Despite the appearance, the logic could be right: you may already have ensured at a different place that newkeywords
contains only keywords of the same type. In this case, if index_objjectKeywords
fails at all, it will fail with the first keyword. I suggest to add a corresponding comment in this case.
doc_id=documentId, | ||
index=self.id)) | ||
|
||
if self.providesSpecialIndex(missing): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have already seen this quite complex logic before. I would centralize it (maybe in a locally defined function) to have it in a single place.
return value | ||
|
||
def index_objectKeywords(self, documentId, keywords): | ||
""" carefully index keywords of object with integer id 'documentId' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are no longer "carefull" here. Likely, there is no longer any need because you have ensured elsewhere that keywords is homogenous and the indexing will fail with the first element if it fails at all.
newSet = newKeywords = OOSet(newKeywords) | ||
|
||
try: | ||
fdiff = difference(oldSet, newSet) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will fail if the keywords change type - and will let your index in a strange state. Assume "oldSet" to be OOSet(['a', 'b']), "newSet" to be
OOSet([1, 2]). Then under Python3, the
differencewill result in a
TypeErrorwhich lets your document remain indexed under
oldSetand gets newly indexed under
missing`.
Under Python 3, all indexed values must have a common type. Changing the keyword type will therefore not work (apart from constructed cases). Therefore, you should let an exception from the difference
calls propagate (maybe log and provide a more specific error message) and not turn it into missing
.
|
||
def unindex_objectKeywords(self, documentId, keywords): | ||
""" carefully unindex the object with integer id 'documentId'""" | ||
""" carefully unindex keywords of object with integer id 'documentId' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are not "carefull" here (drop the word).
special value query term.""" | ||
|
||
def map_value(value): | ||
""" Map value, which is typically not generically indexable, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The "is typically not" is wrong.
I recommend:
def map_value(value):
"""Map (original) value to the value that should get indexed.
The (original) value obtained from the object might not be indexable in the normal way.
`map_value` gives you the chance to map it to a different, usually a special value in this case.
"""
Andreas Gabriel wrote at 2019-6-26 00:34 -0700:
andbag commented on this pull request.
> for kw in newKeywords:
- self.insertForwardIndexEntry(kw, documentId)
- if newKeywords:
- self._unindex[documentId] = list(newKeywords)
- except TypeError:
- return 0
+ try:
+ self.insertForwardIndexEntry(kw, documentId)
+ keys.append(kw)
+ except TypeError:
@d-maurer I'm beginning to wonder if it wouldn't be more sensible to escalate TypeError when a value in the keyword list is incompatible with the already indexed values. Otherwise the new values would have to be pre-validated before being indexed.
I am with you in this regard.
|
As discussed in issue #35 UnIndex can now support queries on MissingValue and EmptyValue. KeywordIndex implements currently the the new feature. I hope for active feedback.