-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DRAFT] Rebase keyed JSON ordinals to start from zero. #41220
Conversation
Pinging @elastic/es-search |
@elasticmachine run elasticsearch-ci/bwc |
Maybe another option would be to keep a
I think we should. It makes the implementation a bit more complex, but it also removes the risk that consumers accidentally try to access ordinals that conceptually belong to another field?
This sounds totally reasonable to me. +1 to this if it makes things simpler. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently we use the OrdinalMap in 3 different places:
- For parent/child where the Lucene OrdinalMap is explicitly needed.
- For the low cardinality terms aggregator
- To handle missing values in the terms aggregator
We don't need the first one for the json field and the other two don't require to have an explicit OrdinalMap
, a LongUnaryOperator
would be enough. I think we should hide the OrdinalMap behind a simple interface (LongUnaryOperator
) and only let the parent/child query access it. For instance we could add a LongUnaryOperator getGlobalMapping()
to AtomicOrdinalsFieldData
that would use the ordinal map if the shard has multiple segments and the identity operator if there is a single segment. Because of the latter case I also think that we need to rebase the segment and global ordinals since we sometimes use segment ordinal as if they were global ordinals (single segment case). However it should be easier to implement if we rely on a LongUnaryOperator
that is built at the leaf level.
I like the approach both of you suggested around creating a new method on
Right, I had forgotten that because
I thought that this segment to global ordinal mapping was only used in the low cardinality optimization. Sorry if I'm missing something, it looks like the only non-recursive use of |
There is another call in |
I just meant that |
Right, it's only used for the LowCardinality aggregator so it's irrelevant here, sorry for the confusion. |
That works for me! To check I understand, we can do the following:
|
+1 for the plan, we also need to think of a way to make the terms aggregator aware that the low cardinality aggregator is not an option for the json field. Just disabling the access of the ordinal map is not enough since we pick the best execution mode based the cardinality of the field. |
Makes sense to me, I will try to draft something up in a new PR and we can discuss there. Thanks both of you for taking a look at this. @jimczi and I discussed this offline, but I found this code difficult to understand + work with -- separate from this work, it would be nice to brainstorm some ideas about how it could be refactored. |
This PR updates `KeyedJsonAtomicFieldData` to always return ordinals in the range `[0, (maxOrd - minOrd)]`, which is necessary for certain aggregations and sorting options to be supported. As discussed in #41220, I opted not to support `KeyedIndexFieldData#getOrdinalMap`, as it would add substantial complexity. The one place this affects is the 'low cardinality' optimization for terms aggregations, which now needs to be disabled for keyed JSON fields. It was fairly difficult to incorporate this change, and I have a couple follow-up refactors in mind to help simplify the global ordinals code. (I will likely wait until this feature branch is merged though before opening PRs on master).
This PR updates `KeyedJsonAtomicFieldData` to always return ordinals in the range `[0, (maxOrd - minOrd)]`, which is necessary for certain aggregations and sorting options to be supported. As discussed in #41220, I opted not to support `KeyedIndexFieldData#getOrdinalMap`, as it would add substantial complexity. The one place this affects is the 'low cardinality' optimization for terms aggregations, which now needs to be disabled for keyed JSON fields. It was fairly difficult to incorporate this change, and I have a couple follow-up refactors in mind to help simplify the global ordinals code. (I will likely wait until this feature branch is merged though before opening PRs on master).
This PR updates `KeyedJsonAtomicFieldData` to always return ordinals in the range `[0, (maxOrd - minOrd)]`, which is necessary for certain aggregations and sorting options to be supported. As discussed in #41220, I opted not to support `KeyedIndexFieldData#getOrdinalMap`, as it would add substantial complexity. The one place this affects is the 'low cardinality' optimization for terms aggregations, which now needs to be disabled for keyed JSON fields. It was fairly difficult to incorporate this change, and I have a couple follow-up refactors in mind to help simplify the global ordinals code. (I will likely wait until this feature branch is merged though before opening PRs on master).
This PR updates `KeyedJsonAtomicFieldData` to always return ordinals in the range `[0, (maxOrd - minOrd)]`, which is necessary for certain aggregations and sorting options to be supported. As discussed in #41220, I opted not to support `KeyedIndexFieldData#getOrdinalMap`, as it would add substantial complexity. The one place this affects is the 'low cardinality' optimization for terms aggregations, which now needs to be disabled for keyed JSON fields. It was fairly difficult to incorporate this change, and I have a couple follow-up refactors in mind to help simplify the global ordinals code. (I will likely wait until this feature branch is merged though before opening PRs on master).
I opened this PR to give a sense of the difficulties involved in rebasing keyed JSON ordinals to lie in the range
[0, (maxOrd - minOrd)]
. It's not intended as an actual proposed change, as it has some hacky elements and is not well-tested.The main complexity comes from the fact that in addition to rebasing the atomic field data, the underlying
OrdinalMap
must be rebased as well. There are two main issues:IndexOrdinalsFieldData
. So in order to rebase it, the min and max ords would need to be available from this top-level index field data. However, at this level, there is not an easy way to lookup global ords in order to compute the min and max. It might seem like it should be easy to lookup global ords, as all the right information is available inGlobalOrdinalsFieldData
. But the index field data could instead be of the form ofSortedSetDVOrdinalsIndexFieldData
, where it is harder to add such a method. Instead of doing a large refactor to address that issue, this PR pushes the methodgetOrdinalMap
down intoAtomicOrdinalsFieldData
, where the min and max ords are readily available.OrdinalMap
translates from global to segment ordinals. We always rebase global ordinals, but there is a question of whether segment ordinals should be rebased as well. The implementation ofgetOrdinalMap
is certainly simplest if the segment ordinals are not rebased, since it doesn't need to keep track of the min ords for each segment. However, it adds some complexity in that only sometimes rebase the keyed JSON ordinals. This PR explores that approach, and adds a flagrebaseOrdinals
toKeyedJsonDocValues
.I was hoping to get your thoughts on whether this approach is worth pursuing. To me it feels quite complicated and I'm wondering if we should consider a simpler compromise. For example, we could still rebase, but disallow the underlying
OrdinalMap
from being accessed for keyed JSON fields (and turn off the optimizations that rely on it being available).I am also still fairly new to this area of the code, so I wanted to check if I was misunderstanding something and accidentally overcomplicating the approach.
Relates to #40069 (comment).