Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE][ML] Fetch from source when fields are more then docvalue limit #43204

Conversation

dimitris-athanasiou
Copy link
Contributor

There is an index level setting called index.max_docvalue_fields_search which limits the number of doc_value fields that can be returned from a search. This commit changes behaviour of the extractor so that it's now reading the value of that setting and if there are more fields we switch to fetching the fields from _source.

@dimitris-athanasiou dimitris-athanasiou added the :ml Machine learning label Jun 13, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core

Copy link
Member

@benwtrent benwtrent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this adds bugs for TimeField and GeoPointField extractors.

@@ -86,6 +88,10 @@ public ExtractedFields detect() {
if (extractedFields.getAllFields().isEmpty()) {
throw ExceptionsHelper.badRequestException("No compatible fields could be detected in index [{}]", index);
}
if (extractedFields.getAllFields().size() > docValueFieldsLimit) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be extractedFields.getDocValueFields().size()?

Copy link
Contributor Author

@dimitris-athanasiou dimitris-athanasiou Jun 14, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few lines above we filter only doc_value fields, so it makes no difference. But I see how it'd be clearer if I made this change. I shall!

@@ -86,6 +88,10 @@ public ExtractedFields detect() {
if (extractedFields.getAllFields().isEmpty()) {
throw ExceptionsHelper.badRequestException("No compatible fields could be detected in index [{}]", index);
}
if (extractedFields.getAllFields().size() > docValueFieldsLimit) {
extractedFields = new ExtractedFields(extractedFields.getAllFields().stream().map(ExtractedField::newFromSource)
.collect(Collectors.toList()));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to me that we only really need to transform as many as necessary to drop below the docValueFieldsLimit, and then only those that that supportsFromSource().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we only really need to transform as many as necessary to drop below the docValueFieldsLimit

As long as we have to touch the source, we might as well fetch them all from there. Performance-wise it's going to be better.

and then only those that that supportsFromSource()

But this is a very good poing. In case we have fields that don't support from-source, we should insist taking them from doc_values. Then we'll also need a check to see if we're still over the limit.


@Override
protected boolean supportsFromSource() {
return getExtractionMethod() == ExtractionMethod.DOC_VALUE;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, we are assuming that anything that is FromFields that has ExtractionMethod.DOC_VALUE also supports extraction via _source? This seems OK to me for the most part, except for GeoPointField, which may need to override supportsFromSource() to always return false.

I think we may want to do something similar with TimeField

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! Will adjust.

@dimitris-athanasiou dimitris-athanasiou force-pushed the fetch-from-source-when-too-many-fields branch from 02c056e to a3f79e0 Compare June 14, 2019 11:38
@dimitris-athanasiou dimitris-athanasiou force-pushed the fetch-from-source-when-too-many-fields branch from a3f79e0 to 5c5819e Compare June 14, 2019 12:19
Copy link
Member

@benwtrent benwtrent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🏑 🌱
:shipit:

@dimitris-athanasiou dimitris-athanasiou merged commit eced353 into elastic:feature-ml-data-frame-analytics Jun 14, 2019
@dimitris-athanasiou dimitris-athanasiou deleted the fetch-from-source-when-too-many-fields branch June 14, 2019 14:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:ml Machine learning
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants