Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DRAFT] Rebase keyed JSON ordinals to start from zero. #41220

Closed

Conversation

jtibshirani
Copy link
Contributor

@jtibshirani jtibshirani commented Apr 15, 2019

I opened this PR to give a sense of the difficulties involved in rebasing keyed JSON ordinals to lie in the range [0, (maxOrd - minOrd)]. It's not intended as an actual proposed change, as it has some hacky elements and is not well-tested.

The main complexity comes from the fact that in addition to rebasing the atomic field data, the underlying OrdinalMap must be rebased as well. There are two main issues:

  • With the current class structure, the ordinal map is accessed through IndexOrdinalsFieldData. So in order to rebase it, the min and max ords would need to be available from this top-level index field data. However, at this level, there is not an easy way to lookup global ords in order to compute the min and max. It might seem like it should be easy to lookup global ords, as all the right information is available in GlobalOrdinalsFieldData. But the index field data could instead be of the form of SortedSetDVOrdinalsIndexFieldData, where it is harder to add such a method. Instead of doing a large refactor to address that issue, this PR pushes the method getOrdinalMap down into AtomicOrdinalsFieldData, where the min and max ords are readily available.
  • For a particular segment, OrdinalMap translates from global to segment ordinals. We always rebase global ordinals, but there is a question of whether segment ordinals should be rebased as well. The implementation of getOrdinalMap is certainly simplest if the segment ordinals are not rebased, since it doesn't need to keep track of the min ords for each segment. However, it adds some complexity in that only sometimes rebase the keyed JSON ordinals. This PR explores that approach, and adds a flag rebaseOrdinals to KeyedJsonDocValues.

I was hoping to get your thoughts on whether this approach is worth pursuing. To me it feels quite complicated and I'm wondering if we should consider a simpler compromise. For example, we could still rebase, but disallow the underlying OrdinalMap from being accessed for keyed JSON fields (and turn off the optimizations that rely on it being available).

I am also still fairly new to this area of the code, so I wanted to check if I was misunderstanding something and accidentally overcomplicating the approach.

Relates to #40069 (comment).

@jtibshirani jtibshirani added >enhancement :Search Foundations/Mapping Index mappings, including merging and defining field types labels Apr 15, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search

@jtibshirani
Copy link
Contributor Author

@elasticmachine run elasticsearch-ci/bwc
@elasticmachine run elasticsearch-ci/default-distro

@jtibshirani jtibshirani requested review from jimczi and jpountz April 15, 2019 23:36
@jpountz
Copy link
Contributor

jpountz commented Apr 16, 2019

With the current class structure, the ordinal map is accessed through IndexOrdinalsFieldData. So in order to rebase it, the min and max ords would need to be available from this top-level index field data. However, at this level, there is not an easy way to lookup global ords in order to compute the min and max.

Maybe another option would be to keep a getOrdinalMap() on IndexOrdinalsFieldData like today and introduce a way to map segment ordinals to global ordinals to AtomicOrdinalsFieldData, which seems to be the only operation we need?

there is a question of whether segment ordinals should be rebased as well

I think we should. It makes the implementation a bit more complex, but it also removes the risk that consumers accidentally try to access ordinals that conceptually belong to another field?

For example, we could still rebase, but disallow the underlying OrdinalMap from being accessed for keyed JSON fields (and turn off the optimizations that rely on it being available).

This sounds totally reasonable to me. +1 to this if it makes things simpler.

Copy link
Contributor

@jimczi jimczi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently we use the OrdinalMap in 3 different places:

  • For parent/child where the Lucene OrdinalMap is explicitly needed.
  • For the low cardinality terms aggregator
  • To handle missing values in the terms aggregator

We don't need the first one for the json field and the other two don't require to have an explicit OrdinalMap, a LongUnaryOperator would be enough. I think we should hide the OrdinalMap behind a simple interface (LongUnaryOperator) and only let the parent/child query access it. For instance we could add a LongUnaryOperator getGlobalMapping() to AtomicOrdinalsFieldData that would use the ordinal map if the shard has multiple segments and the identity operator if there is a single segment. Because of the latter case I also think that we need to rebase the segment and global ordinals since we sometimes use segment ordinal as if they were global ordinals (single segment case). However it should be easier to implement if we rely on a LongUnaryOperator that is built at the leaf level.

@jtibshirani
Copy link
Contributor Author

I like the approach both of you suggested around creating a new method on AtomicOrdinalsFieldData that simply returns LongUnaryOperator (and avoiding touching IndexOrdinalsFieldData#getOrdinalMap). It certainly seems like this would help simplify things.

Because of the latter case I also think that we need to rebase the segment and global ordinals since we sometimes use segment ordinal as if they were global ordinals (single segment case)

Right, I had forgotten that because SortedSetDVOrdinalsIndexFieldData can serve as a global ordinals implementation, we certainly need to rebase segment ordinals! I don't see a very good way to do this yet with the approach above, as now the LongUnaryOperator will have to have knowledge of both global and segment ordinal offsets... I will look into it further.

To handle missing values in the terms aggregator

I thought that this segment to global ordinal mapping was only used in the low cardinality optimization. Sorry if I'm missing something, it looks like the only non-recursive use of WithOrdinals#globalOrdinalsMapping is in LowCardinality.

@jimczi
Copy link
Contributor

jimczi commented Apr 16, 2019

I thought that this segment to global ordinal mapping was only used in the low cardinality optimization. Sorry if I'm missing something, it looks like the only non-recursive use of WithOrdinals#globalOrdinalsMapping is in LowCardinality.

There is another call in MissingValues#replaceMissing but I don't understand what you mean by non-recursive?

@jtibshirani
Copy link
Contributor Author

jtibshirani commented Apr 16, 2019

I just meant that MissingValues#replaceMissing calls globalOrdinalsMapping only to implement its version of globalOrdinalsMapping.

@jimczi
Copy link
Contributor

jimczi commented Apr 16, 2019

Right, it's only used for the LowCardinality aggregator so it's irrelevant here, sorry for the confusion.
So to simplify things we could simply disallow accessing the OrdinalMap in the json field data and just remaps the segment ordinals to handle cases where there is a single segment ?

@jtibshirani
Copy link
Contributor Author

jtibshirani commented Apr 16, 2019

That works for me! To check I understand, we can do the following:

  • Disallow accessing getOrdinalMap for keyed JSON fields. This requires disabling the low cardinality optimization, but everything else should work.
  • Always rebase ordinals to lie in the range [0, (maxOrd - minOrd)], including both global + segment ords.

@jimczi
Copy link
Contributor

jimczi commented Apr 16, 2019

+1 for the plan, we also need to think of a way to make the terms aggregator aware that the low cardinality aggregator is not an option for the json field. Just disabling the access of the ordinal map is not enough since we pick the best execution mode based the cardinality of the field.

@jtibshirani
Copy link
Contributor Author

jtibshirani commented Apr 16, 2019

Makes sense to me, I will try to draft something up in a new PR and we can discuss there.

Thanks both of you for taking a look at this. @jimczi and I discussed this offline, but I found this code difficult to understand + work with -- separate from this work, it would be nice to brainstorm some ideas about how it could be refactored.

@jtibshirani jtibshirani deleted the rebase-global-ords branch April 16, 2019 23:59
@jtibshirani jtibshirani changed the title Rebase keyed JSON ordinals to start from zero. [DRAFT] Rebase keyed JSON ordinals to start from zero. Apr 17, 2019
jtibshirani added a commit that referenced this pull request Apr 17, 2019
This PR updates `KeyedJsonAtomicFieldData` to always return ordinals in the
range `[0, (maxOrd - minOrd)]`, which is necessary for certain aggregations and
sorting options to be supported.

As discussed in #41220, I opted not to support
`KeyedIndexFieldData#getOrdinalMap`, as it would add substantial complexity.
The one place this affects is the 'low cardinality' optimization for terms
aggregations, which now needs to be disabled for keyed JSON fields.

It was fairly difficult to incorporate this change, and I have a couple
follow-up refactors in mind to help simplify the global ordinals code. (I will
likely wait until this feature branch is merged though before opening PRs on
master).
jtibshirani added a commit that referenced this pull request May 1, 2019
This PR updates `KeyedJsonAtomicFieldData` to always return ordinals in the
range `[0, (maxOrd - minOrd)]`, which is necessary for certain aggregations and
sorting options to be supported.

As discussed in #41220, I opted not to support
`KeyedIndexFieldData#getOrdinalMap`, as it would add substantial complexity.
The one place this affects is the 'low cardinality' optimization for terms
aggregations, which now needs to be disabled for keyed JSON fields.

It was fairly difficult to incorporate this change, and I have a couple
follow-up refactors in mind to help simplify the global ordinals code. (I will
likely wait until this feature branch is merged though before opening PRs on
master).
jtibshirani added a commit that referenced this pull request May 24, 2019
This PR updates `KeyedJsonAtomicFieldData` to always return ordinals in the
range `[0, (maxOrd - minOrd)]`, which is necessary for certain aggregations and
sorting options to be supported.

As discussed in #41220, I opted not to support
`KeyedIndexFieldData#getOrdinalMap`, as it would add substantial complexity.
The one place this affects is the 'low cardinality' optimization for terms
aggregations, which now needs to be disabled for keyed JSON fields.

It was fairly difficult to incorporate this change, and I have a couple
follow-up refactors in mind to help simplify the global ordinals code. (I will
likely wait until this feature branch is merged though before opening PRs on
master).
jtibshirani added a commit that referenced this pull request May 29, 2019
This PR updates `KeyedJsonAtomicFieldData` to always return ordinals in the
range `[0, (maxOrd - minOrd)]`, which is necessary for certain aggregations and
sorting options to be supported.

As discussed in #41220, I opted not to support
`KeyedIndexFieldData#getOrdinalMap`, as it would add substantial complexity.
The one place this affects is the 'low cardinality' optimization for terms
aggregations, which now needs to be disabled for keyed JSON fields.

It was fairly difficult to incorporate this change, and I have a couple
follow-up refactors in mind to help simplify the global ordinals code. (I will
likely wait until this feature branch is merged though before opening PRs on
master).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>enhancement :Search Foundations/Mapping Index mappings, including merging and defining field types
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants