Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some queries for runtime fields #58940

Conversation

nik9000
Copy link
Member

@nik9000 nik9000 commented Jul 2, 2020

This implements a few queries for runtime fields. It doesn't wire them into the
runtime fields infrastructure yet because, well, we haven't committed that. But
the queries are there and they work. As do the doc values implementations.

You may wonder, Nik, why are you reimplementing all of these queries? Lucene
has tons of queries, some of them even against doc values. You make doc values,
why not just query the doc values you make with the Lucene queries? That'd be
a good question! It turns out that Lucene's doc_values queries are quite tied
up with Lucene's doc values implementations. They compare values using the
binary format that lucene stores the doc values in. Which is a great idea! But
it means that they only really work on Lucene's doc values implementations.

It also tries to only call each script once, even asserting that in most tests.
This isn't strictly required but I think it is useful and something that
would be difficult to design "after the fact". Still, this behavior shouldn't
block implementing the feature. We'll keep it if we can. But if we can't, well,
we'll drop and it and come back to it later.

relates to #59332

javanna and others added 24 commits July 1, 2020 17:22
Put together some example script contexts for scripted fields. This is
almost certainly wrong, but it gets us something to think about.
This reworks the script contexts to support multiple values for each
hit. Just like before, these are almost certainly not the final
implementation, but they give us something to iterate on.
Painless extensions work differently now.
Uses a temporary hack to allow us to use painless in unit tests.
@nik9000 nik9000 requested a review from javanna July 2, 2020 16:15
@nik9000 nik9000 marked this pull request as ready for review July 2, 2020 16:26
@Override
public float matchCost() {
// TODO we have no idea what this should be and no real way to get one
return 1000f;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the cost can be for now the approximation.cost() , it is proportional to maxDocs if I remember correctly

@@ -21,6 +21,8 @@

import org.elasticsearch.search.lookup.SearchLookup;

import java.io.IOException;

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unused import

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

if (ifd instanceof SearchLookupAware) {
((SearchLookupAware) ifd).setSearchLookup(searchLookup);
}
return ifd.load(reader).getScriptValues();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not foresee the need for these changes but I see how they are caused by me not changing fielddataBuilder and rather hacking query shard context. Just to double check: this is to support runtime fields that refer to other runtime fields, otherwise they have no search lookup set?

To keep this contained and have the hack in a single place, would it work to rather modify QueryShardContext#lookup to do the following? It may have weird consequences but in our branch with a big TODO to revert it it may be ok? Also let's mention in a comment specifically why this is needed?

public SearchLookup lookup() {
    if (lookup == null) {
        lookup = new SearchLookup(getMapperService(), this::getForField);
    }
    return lookup;
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that is what my TODO was about - maybe this should go through getForField. I can look into making that change in master. I think it is good regardless. And, if it has consequences that we didn't think of we may as well see them earlier rather than later...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if this makes sense on master and what difference it would make, but I think it does make sense in our feature branch to isolate the hack around augmenting the fielddata impl.


@Override
public float matchCost() {
// TODO we don't have a good way of estimating the complexity of the script so we just go with 9000
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would remove the constant, what value does it add? I think that the important part is that a script needs to be run for each document, which is the cost approximation. Then one script can be heavier than another, but I wonder if that is negligible at this stage, unless we can calculate the approximate cost of a script.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the javadoc, I think it should actually just be some constant - matchCost says it is only for a single document.

If I just returned 1 here then I think Lucene'd reason about the query as though it were match_all, which is probably bad

return DocValues::new;
}

private class DocValues extends SortedBinaryDocValues {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is effectively the docvalues implementation that replaces the one that I had written which you deleted?

I was wondering, and not even sure if it matters, queries don't need sorted doc_values right? But we always implement the interface that we need for fielddata to simplify things? I know that I had implemented the lucene one for query but it made things more complex for almost no reason, I think.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is effectively the docvalues implementation that replaces the one that I had written which you deleted?

Yeah. I had mine sitting around and it had unit tests and stuff. Its the same deal.

I was wondering, and not even sure if it matters, queries don't need sorted doc_values right? But we always implement the interface that we need for fielddata to simplify things? I know that I had implemented the lucene one for query but it made things more complex for almost no reason, I think.

In the patch as it stands now the queries don't use this and look at the array of values directly. And they don't sort them unless they build this thing. When I was writing the queries it felt simpler not to target the doc values implementation - partially because most queries don't care about the values being sorted. And, if they did care about the values being sorted, then they'd want a sorted array. Which is what we use to build the doc values implementation anyway.

@nik9000 nik9000 force-pushed the runtime_field_string_term_query branch from be83b55 to e17fef8 Compare July 8, 2020 15:51
@@ -297,8 +297,7 @@ MappedFieldType failIfFieldMappingNotFound(String name, MappedFieldType fieldMap

public SearchLookup lookup() {
if (lookup == null) {
lookup = new SearchLookup(getMapperService(),
mappedFieldType -> indexFieldDataService.apply(mappedFieldType, fullyQualifiedIndex.getName()));
lookup = new SearchLookup(getMapperService(), this::getForField);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a TODO here, to remind us that we need to figure out if this should be merged upstream or not? once the searchlookupaware hack is removed and fielddatabuilder is adapted, we will not need this change anyways?

@@ -45,8 +45,7 @@

private int docId = -1;

LeafDocLookup(MapperService mapperService, Function<MappedFieldType, IndexFieldData<?>> fieldDataLookup,
LeafReaderContext reader) {
LeafDocLookup(MapperService mapperService, Function<MappedFieldType, IndexFieldData<?>> fieldDataLookup, LeafReaderContext reader) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you revert please? Trying to trim down changes to stuff that's outside of the runtime fields plugin

docLookup = new LeafDocLookup(mapperService,
ignored -> fieldData,
null);
docLookup = new LeafDocLookup(mapperService, ignored -> fieldData, null);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here, let's revert?

* <p>
* Inspired by the ForceNoBulkScoringQuery in Lucene's monitor project.
*/
class ForceNoBulkScoringQuery extends Query {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can remove this right?

@@ -27,7 +35,7 @@
private final StringScriptFieldScript.Factory scriptFactory;

RuntimeKeywordMappedFieldType(String name, Script script, StringScriptFieldScript.Factory scriptFactory, Map<String, String> meta) {
super(name, false, false, TextSearchInfo.NONE, meta);
super(name, false, false, TextSearchInfo.SIMPLE_MATCH_ONLY, meta);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why SIMPLE_MATCH_ONLY instead of NONE? I think we end up creating a lucene field type here which should not be needed?

public final class DoubleRuntimeFieldHelper {
@FunctionalInterface
public interface NewLeafLoader {
IntConsumer leafLoader(LeafReaderContext ctx, DoubleConsumer sync) throws IOException;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some of my confusion is also around where the consumer is placed I think, which forces these indirections. Ideally, the values consumer would be provided anew for each document, either as an execute argument or while we setDocument? Is there some technical reason why it needs to provided with the leaf reader context? Ideally, we would be able to abstract this away and even have in the script base class the collection of values, so that every script allows to simply retrieve the result by calling getResult(docId) ? Do you think that would be possible?

return new Values().new TermQuery(fieldName, value);
}

public Query termsQuery(String fieldName, double... value) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I struggle to see value in these helper classes, but maybe this is due to the confusion I mentioned in other comments. I would have expected these methods implemented in the mapped field type, creating the queries there directly, and have top-level classes like ScriptTermQuery etc. maybe that is because I am hoping that the collection of values can be abstracted and exposed as part of the script object itself or some wrapper around it.

If wonder if this Values and abstractions are some reminiscence of the previous implementation that was trying to cache values.


@Override
protected void sort() {
Arrays.sort(values, 0, count);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorting is always done in the same way regardless of the data type right? I mean calling Arrays.sort

}

private class Values extends AbstractRuntimeValues {
private String[] values = new String[1];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we just have the collection of values shared between the different types and use a typed ArrayList? Would that be a tragedy in terms of performance?

return DocValues::new;
}

private class DocValues extends SortedBinaryDocValues {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

like I said in other comments, I think that you have all the bits that are needed, I only would turn around a bit the dependencies/flow. I would keep the existing doc_values impl, which is only used in aggs/sorting (and it is totally fine that it's not used in queries), and I would make it lighter so that all it does is really returning the result of the script for the current doc, by calling that script.getResult(docId) method I mentioned above. I think that would clarify who does what. At the moment it is hard to follow who needs doc_values who needs them sorted etc.

@javanna javanna added the :Search/Search Search-related issues that do not fall into other categories label Jul 14, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search (:Search/Search)

@elasticmachine elasticmachine added the Team:Search Meta label for search team label Jul 14, 2020
@javanna
Copy link
Member

javanna commented Jul 14, 2020

shall we close this @nik9000 ? I think that it was a good experiment and we can now proceed getting in what we need step by step.

@nik9000
Copy link
Member Author

nik9000 commented Jul 15, 2020

We got this in with #59630, #59527, and #59523.

@nik9000 nik9000 closed this Jul 15, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Search/Search Search-related issues that do not fall into other categories Team:Search Meta label for search team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants