[DRAFT] Rebase keyed JSON ordinals to start from zero. #41220

jtibshirani · 2019-04-15T20:47:19Z

I opened this PR to give a sense of the difficulties involved in rebasing keyed JSON ordinals to lie in the range [0, (maxOrd - minOrd)]. It's not intended as an actual proposed change, as it has some hacky elements and is not well-tested.

The main complexity comes from the fact that in addition to rebasing the atomic field data, the underlying OrdinalMap must be rebased as well. There are two main issues:

With the current class structure, the ordinal map is accessed through IndexOrdinalsFieldData. So in order to rebase it, the min and max ords would need to be available from this top-level index field data. However, at this level, there is not an easy way to lookup global ords in order to compute the min and max. It might seem like it should be easy to lookup global ords, as all the right information is available in GlobalOrdinalsFieldData. But the index field data could instead be of the form of SortedSetDVOrdinalsIndexFieldData, where it is harder to add such a method. Instead of doing a large refactor to address that issue, this PR pushes the method getOrdinalMap down into AtomicOrdinalsFieldData, where the min and max ords are readily available.
For a particular segment, OrdinalMap translates from global to segment ordinals. We always rebase global ordinals, but there is a question of whether segment ordinals should be rebased as well. The implementation of getOrdinalMap is certainly simplest if the segment ordinals are not rebased, since it doesn't need to keep track of the min ords for each segment. However, it adds some complexity in that only sometimes rebase the keyed JSON ordinals. This PR explores that approach, and adds a flag rebaseOrdinals to KeyedJsonDocValues.

I was hoping to get your thoughts on whether this approach is worth pursuing. To me it feels quite complicated and I'm wondering if we should consider a simpler compromise. For example, we could still rebase, but disallow the underlying OrdinalMap from being accessed for keyed JSON fields (and turn off the optimizations that rely on it being available).

I am also still fairly new to this area of the code, so I wanted to check if I was misunderstanding something and accidentally overcomplicating the approach.

Relates to #40069 (comment).

elasticmachine · 2019-04-15T20:47:21Z

Pinging @elastic/es-search

jtibshirani · 2019-04-15T21:47:15Z

@elasticmachine run elasticsearch-ci/bwc
@elasticmachine run elasticsearch-ci/default-distro

jpountz · 2019-04-16T07:33:57Z

With the current class structure, the ordinal map is accessed through IndexOrdinalsFieldData. So in order to rebase it, the min and max ords would need to be available from this top-level index field data. However, at this level, there is not an easy way to lookup global ords in order to compute the min and max.

Maybe another option would be to keep a getOrdinalMap() on IndexOrdinalsFieldData like today and introduce a way to map segment ordinals to global ordinals to AtomicOrdinalsFieldData, which seems to be the only operation we need?

there is a question of whether segment ordinals should be rebased as well

I think we should. It makes the implementation a bit more complex, but it also removes the risk that consumers accidentally try to access ordinals that conceptually belong to another field?

For example, we could still rebase, but disallow the underlying OrdinalMap from being accessed for keyed JSON fields (and turn off the optimizations that rely on it being available).

This sounds totally reasonable to me. +1 to this if it makes things simpler.

jimczi

Currently we use the OrdinalMap in 3 different places:

For parent/child where the Lucene OrdinalMap is explicitly needed.
For the low cardinality terms aggregator
To handle missing values in the terms aggregator

We don't need the first one for the json field and the other two don't require to have an explicit OrdinalMap, a LongUnaryOperator would be enough. I think we should hide the OrdinalMap behind a simple interface (LongUnaryOperator) and only let the parent/child query access it. For instance we could add a LongUnaryOperator getGlobalMapping() to AtomicOrdinalsFieldData that would use the ordinal map if the shard has multiple segments and the identity operator if there is a single segment. Because of the latter case I also think that we need to rebase the segment and global ordinals since we sometimes use segment ordinal as if they were global ordinals (single segment case). However it should be easier to implement if we rely on a LongUnaryOperator that is built at the leaf level.

jtibshirani · 2019-04-16T17:42:48Z

I like the approach both of you suggested around creating a new method on AtomicOrdinalsFieldData that simply returns LongUnaryOperator (and avoiding touching IndexOrdinalsFieldData#getOrdinalMap). It certainly seems like this would help simplify things.

Because of the latter case I also think that we need to rebase the segment and global ordinals since we sometimes use segment ordinal as if they were global ordinals (single segment case)

Right, I had forgotten that because SortedSetDVOrdinalsIndexFieldData can serve as a global ordinals implementation, we certainly need to rebase segment ordinals! I don't see a very good way to do this yet with the approach above, as now the LongUnaryOperator will have to have knowledge of both global and segment ordinal offsets... I will look into it further.

To handle missing values in the terms aggregator

I thought that this segment to global ordinal mapping was only used in the low cardinality optimization. Sorry if I'm missing something, it looks like the only non-recursive use of WithOrdinals#globalOrdinalsMapping is in LowCardinality.

jimczi · 2019-04-16T17:54:52Z

I thought that this segment to global ordinal mapping was only used in the low cardinality optimization. Sorry if I'm missing something, it looks like the only non-recursive use of WithOrdinals#globalOrdinalsMapping is in LowCardinality.

There is another call in MissingValues#replaceMissing but I don't understand what you mean by non-recursive?

jtibshirani · 2019-04-16T18:03:23Z

I just meant that MissingValues#replaceMissing calls globalOrdinalsMapping only to implement its version of globalOrdinalsMapping.

jimczi · 2019-04-16T18:11:47Z

Right, it's only used for the LowCardinality aggregator so it's irrelevant here, sorry for the confusion.
So to simplify things we could simply disallow accessing the OrdinalMap in the json field data and just remaps the segment ordinals to handle cases where there is a single segment ?

jtibshirani · 2019-04-16T18:15:50Z

That works for me! To check I understand, we can do the following:

Disallow accessing getOrdinalMap for keyed JSON fields. This requires disabling the low cardinality optimization, but everything else should work.
Always rebase ordinals to lie in the range [0, (maxOrd - minOrd)], including both global + segment ords.

jimczi · 2019-04-16T19:56:57Z

+1 for the plan, we also need to think of a way to make the terms aggregator aware that the low cardinality aggregator is not an option for the json field. Just disabling the access of the ordinal map is not enough since we pick the best execution mode based the cardinality of the field.

jtibshirani · 2019-04-16T20:08:36Z

Makes sense to me, I will try to draft something up in a new PR and we can discuss there.

Thanks both of you for taking a look at this. @jimczi and I discussed this offline, but I found this code difficult to understand + work with -- separate from this work, it would be nice to brainstorm some ideas about how it could be refactored.

This PR updates `KeyedJsonAtomicFieldData` to always return ordinals in the range `[0, (maxOrd - minOrd)]`, which is necessary for certain aggregations and sorting options to be supported. As discussed in #41220, I opted not to support `KeyedIndexFieldData#getOrdinalMap`, as it would add substantial complexity. The one place this affects is the 'low cardinality' optimization for terms aggregations, which now needs to be disabled for keyed JSON fields. It was fairly difficult to incorporate this change, and I have a couple follow-up refactors in mind to help simplify the global ordinals code. (I will likely wait until this feature branch is merged though before opening PRs on master).

jtibshirani added 4 commits April 15, 2019 12:35

Push getOrdinalMap down into AtomicOrdinalsFieldData.

9afed4e

Introduce a wrapper class for OrdinalMap, along with a rebased version.

88bb82c

Rebase global ordinals to lie in the range [0, (maxOrd-minOrd)].

1a7c496

Add unit tests that exercise ordinal rebasing.

3cee100

jtibshirani added >enhancement :Search Foundations/Mapping Index mappings, including merging and defining field types labels Apr 15, 2019

jtibshirani requested review from jimczi and jpountz April 15, 2019 23:36

jimczi reviewed Apr 16, 2019

View reviewed changes

jtibshirani closed this Apr 16, 2019

jtibshirani deleted the rebase-global-ords branch April 16, 2019 23:59

jtibshirani mentioned this pull request Apr 17, 2019

Rebase keyed JSON ordinals to start from zero. #41282

Merged

jtibshirani changed the title ~~Rebase keyed JSON ordinals to start from zero.~~ [DRAFT] Rebase keyed JSON ordinals to start from zero. Apr 17, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DRAFT] Rebase keyed JSON ordinals to start from zero. #41220

[DRAFT] Rebase keyed JSON ordinals to start from zero. #41220

jtibshirani commented Apr 15, 2019 •

edited

Loading

elasticmachine commented Apr 15, 2019

jtibshirani commented Apr 15, 2019

jpountz commented Apr 16, 2019

jimczi left a comment •

edited

Loading

jtibshirani commented Apr 16, 2019

jimczi commented Apr 16, 2019

jtibshirani commented Apr 16, 2019 •

edited

Loading

jimczi commented Apr 16, 2019

jtibshirani commented Apr 16, 2019 •

edited

Loading

jimczi commented Apr 16, 2019

jtibshirani commented Apr 16, 2019 •

edited

Loading

[DRAFT] Rebase keyed JSON ordinals to start from zero. #41220

[DRAFT] Rebase keyed JSON ordinals to start from zero. #41220

Conversation

jtibshirani commented Apr 15, 2019 • edited Loading

elasticmachine commented Apr 15, 2019

jtibshirani commented Apr 15, 2019

jpountz commented Apr 16, 2019

jimczi left a comment • edited Loading

Choose a reason for hiding this comment

jtibshirani commented Apr 16, 2019

jimczi commented Apr 16, 2019

jtibshirani commented Apr 16, 2019 • edited Loading

jimczi commented Apr 16, 2019

jtibshirani commented Apr 16, 2019 • edited Loading

jimczi commented Apr 16, 2019

jtibshirani commented Apr 16, 2019 • edited Loading

jtibshirani commented Apr 15, 2019 •

edited

Loading

jimczi left a comment •

edited

Loading

jtibshirani commented Apr 16, 2019 •

edited

Loading

jtibshirani commented Apr 16, 2019 •

edited

Loading

jtibshirani commented Apr 16, 2019 •

edited

Loading