Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New field type=alias including support for querying, aggs, suggestions + more read ops #30230

Closed
wants to merge 1 commit into from

Conversation

mP1
Copy link

@mP1 mP1 commented Apr 28, 2018

  • Field aliases #23714
  • attempts to index an alias field fail.
  • introduced MappedFieldType.nameForMessages(), nameForIndex(), fieldTypeForIndex(),
    in the right places these are used rather than just a "raw" field or MappedFieldType.name()

Goals and non goals.

  • Attempts to index into the alias field would result in an exception
    • it is read only
  • Queries
  • aggregations
    • "field" substituted with alias target.
    • "filter" fields substituted.
  • suggestions
  • scripts (using doc[]) (Not Done)
  • highlighting
  • fielddata_fields (deprecated/ Not Done)
  • docvalue_fields,
  • stored_fields would just get the data (and mapping) from the specified path
  • Source filtering would not work with the aliased field (Not done)
  • _source field when returned remains unmodified.
  • Have you signed the contributor license agreement?
    YES

  • Have you followed the contributor guidelines?
    YES

  • If submitting code, have you built your formula locally prior to submission with gradle check?
    YES

  • If submitting code, is your pull request against master? Unless there is a good reason otherwise, we prefer pull requests against master and will backport as needed.
    YES

  • If submitting code, have you checked that your submission is for an OS that we support?
    OSX(no platform dep code should work everywhere just fine)

  • If you are submitting this code for a class then read our policy for that.
    N/A

@karmi
Copy link
Contributor

karmi commented Apr 28, 2018

Hi @mP1, we have found your signature in our records, but it seems like you have signed with a different e-mail than the one used in yout Git commit. Can you please add both of these e-mails into your Github profile (they can be hidden), so we can match your e-mails to your Github profile?

@mP1
Copy link
Author

mP1 commented Apr 28, 2018

Im going to guess because this is not a trivial change this review will potentially be quite involved, due to me not being accustomed with the elastic way of doing things etc.

As a convenience in addition to the updated tests im including a fairly large list of operations to be used in kibana/curl to perform a quick smoke test of everything i "claim" has been done in my PR description.

PUT /index-xyz
{
	"mappings": {
		"document": {
			"properties": {
				"integer1": {
					"type": "integer"
				},
				"integer2": {
					"type": "integer"
				},
				"alias3": {
					"type": "alias",
					"path": "integer2"
				},
				"text4" : {
					"type": "text",
					"fielddata": true,
					"term_vector": "with_positions_offsets",
					"store": true
				},
				"alias5": {
				  "type": "alias",
				  "path": "text4"
				},
				"text6" : {
					"type": "text",
					"fielddata": true,
					"term_vector": "with_positions_offsets",
					"store": true
				}
			}
		}
	}
}

PUT /index-xyz/document/1
{
    "integer1": 11,
    "integer2": 101,
    "text4": "abc",
    "text6": "qqq"
}

PUT /index-xyz/document/2
{
    "integer1": 21,
    "integer2": 201,
    "text4": "def",
    "text6": "rrr"
}

PUT /index-xyz/document/3
{
    "integer1": 39,
    "integer2": 309,
    "text4": "abc",
    "text6": "sss"
}

PUT /index-xyz/document/4
{
    "text4": "abc",
    "text6": "sss"
}

GET /index-xyz

GET /index-xyz/_search


GET index-xyz/_search
{
    "query" : {
        "term" : { "alias5" : "abc" }
    }
}

GET /index-xyz/_search?q=alias5:abc

POST /index-xyz/_search?size=0
{
   "aggs": {
      "average-Integer1" : {
         "avg": { "field": "integer2" }
      }
   }
}

POST /index-xyz/_search?
{
   "aggs": {
      "average-Integer1" : {
         "avg": { "field": "alias3" }
      }
   }
}

POST /index-xyz/_search?
{
   "aggs": {
      "missing-Integer2" : {
         "missing" : { "field" : "integer2" }
      }
   }
}

POST /index-xyz/_search?
{
   "aggs": {
      "missing-Integer2" : {
         "missing" : { "field" : "alias3" }
      }
   }
}

POST /index-xyz/_search?
{
    "aggs" : {
        "integer2_ranges" : {
            "range" : {
                "field" : "integer2",
                "ranges" : [
                    { "from" : 1, "to" : 999 }
                ]
            }
        }
    }
}

POST /index-xyz/_search?
{
    "aggs" : {
        "integer2_ranges" : {
            "range" : {
                "field" : "alias3",
                "ranges" : [
                    { "from" : 1, "to" : 999 }
                ]
            }
        }
    }
}

POST /index-xyz/_search?size=0
{
    "aggs" : {
        "abc" : {
            "filter" : {
              "term": {
                "text4": "abc"
              }
            },
            "aggs" : {
                "avg_integer2": {
                  "avg" : { "field" : "integer2" }
                }
            }
        }
    }
}

POST /index-xyz/_search?size=0
{
    "aggs" : {
        "abc" : {
            "filter" : {
              "term": {
                "alias5": "abc"
              }
            },
            "aggs" : {
                "avg_alias3": {
                  "avg" : { "field" : "integer2" }
                }
            }
        }
    }
}

GET index-xyz/_search
{
  "aggs" : {
    "texttt" : {
      "filters" : {
        "filters" : {
          "text44" :   { "match" : { "text4" : "abc"   }},
          "text66" : { "match" : { "text6" : "qqq" }}
        }
      }
    }
  }
}

GET index-xyz/_search
{
  "aggs" : {
    "texttt" : {
      "filters" : {
        "filters" : {
          "text44" :   { "match" : { "alias5" : "abc"   }}
        }
      }
    }
  }
}

GET /index-xyz/_search
{
    "query" : {
        "match_all": {}
    },
    "docvalue_fields" : ["integer2", "alias3", "text4"]
}

GET /index-xyz/_search
{
    "query" : {
        "match_all": {}
    },
    "docvalue_fields" : ["alias3"]
}

GET /index-xyz/_search
{
    "query" : {
        "match_all": {}
    },
    "stored_fields" : ["integer2", "alias3", "text4"]
}

GET /index-xyz/_search
{
    "query" : {
        "match_all": {}
    },
    "stored_fields" : ["text4111"]
}

GET /index-xyz/_search
{
    "query" : {
        "match_all": {}
    },
    "stored_fields" : ["alias5"]
}

GET /index-xyz/_search
{
    "query" : {
        "match_all": {}
    },
    "stored_fields" : 123
}

POST /index-xyz/_search
{
  "suggest": {
    "text4-suggest" : {
      "text" : "abcd",
      "term" : {
        "field" : "text4"
      }
    },
    "alias5-to-text4-suggest" : {
      "text" : "abcd",
      "term" : {
        "field" : "alias5"
      }
    }
  }
}

POST /index-xyz/_search
{
  "suggest": {
    "my-suggest-1" : {
      "text" : "abcd",
      "term" : {
        "field" : "alias5"
      }
    }
  }
}

GET /index-xyz/_search
{
  "query": {
    "match": {
      "_all": "a" 
    }
  },
  "highlight": {
    "fields": {
      "text4": { 
        "require_field_match": false  
      }
    }
  }
}

GET /index-xyz/_search
{
  "query": {
    "match": {
      "alias5": "abc" 
    }
  },
  "highlight": {
    "fields": {
      "text4": { 
        "require_field_match": false,
        "type": "fvh"
      }
    }
  }
}

GET /index-xyz/_search
{
  "query": {
    "match": {
      "text4": "abc" 
    }
  },
  "highlight": {
    "fields": {
      "alias5": { 
        "require_field_match": false,
        "type": "fvh"
      }
    }
  }
}

GET /index-xyz/_search
{
  "query": {
    "match": {
      "alias5": "abc" 
    }
  },
  "highlight": {
    "fields": {
      "text4": { 
        "require_field_match": false,
        "type": "plain"
      }
    }
  }
}

GET /index-xyz/_search
{
  "query": {
    "match": {
      "text4": "abc" 
    }
  },
  "highlight": {
    "fields": {
      "alias5": { 
        "require_field_match": false,
        "type": "plain"
      }
    }
  }
}

GET /index-xyz/_search
{
  "query": {
    "match": {
      "alias5": "abc" 
    }
  },
  "highlight": {
    "fields": {
      "text4": { 
        "require_field_match": false,
        "type": "unified"
      }
    }
  }
}

GET /index-xyz/_search
{
  "query": {
    "match": {
      "text4": "abc" 
    }
  },
  "highlight": {
    "fields": {
      "alias5": { 
        "require_field_match": false,
        "type": "unified"
      }
    }
  }
}

@mP1
Copy link
Author

mP1 commented Apr 28, 2018

This comment will attempt to describe the "idea" in my approach to the support this new feature.

Ive introduced a new alias field mapping and type. The mapping performs various checks at the mapping phase and reports errors to the user. The core of the functionality is the new AliasFieldType which knows holds both the alias details and alias target, and delegates most methods to the target. New methods were also defined on MappingFieldType, with defaults that use the this.name(), except for AliasFieldType which does the right thing for an alias.

MappingFieldType.nameForIndex
This new method is ONLY overridden in AliasFieldType, the default for all other types is to return this.name. AFT fetches the nameForIndex of the "target" it is an alias for, theres obviously no point using the alias field when performing lucene operations as the alias field never exists there.

I have made changes to the system wherever it uses the "field" or "MappingFieldType" to build a lucene query, fetch a mapping and so on.

MappingFieldType.nameForMessages.
This new method is ONLY overridden in AliasFieldType, the default for all other types is to return this.name. THe purpose of this method is to produce error messages for the user, about a specific field.

Thus if the user was attempting an index update for a field for an alias, this is an error, and we want to tell them that the alias field name along with the error message. As a convenience in AFT, i return both the alias and the target field. This would look something like this

[alias] -> [target field name].

The default for MappingFieldType would just return the field name, eg

[this.name()]

Ive updated numerous places that build error messages removing the left/right bracket in their big string concat, and replaced this.name with this.nameForString() etc.

I always try and pass the AliasFieldType around and resolve to the target as late as possible which enables me hopefully to keep both the alias and target for messages, building responses and doing index operations. By introducing the above and a few other methods i can let polymorphism do the right thing bet it an index or response or message.

@@ -111,7 +111,7 @@ public synchronized void clearField(final String fieldName) {

@SuppressWarnings("unchecked")
public <IFD extends IndexFieldData<?>> IFD getForField(MappedFieldType fieldType, String fullyQualifiedIndexName) {
final String fieldName = fieldType.name();
final String fieldName = fieldType.nameForIndex();
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is an example where i always try and pass the real MappedFieldType even if its an AliasFieldType and let polymorphism pick the right "name" for the index operation.

if (format != null) {
throw new IllegalArgumentException("Field [" + name() + "] of type [" + typeName() + "] does not support custom formats");
this.failUnsupportedFeature("custom formats");
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quite a lot of places had the same "copy/paste" of this line so i extracted it into the a base class.

* @param root The enclosing {@link RootObjectMapper}
*/
protected void rootObjectMapper(final RootObjectMapper root) {
}
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This callback happens after all fields are processed and known to the context, and only then can we ask the alias to do its checks, does the target field exist etc.

*/
public Mapper mapperForPath(final String path) {
return this.pathToMapper.get(path);
}
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

only really used by the alias to (validate) and fetch the target field its aliased too etc.

package org.elasticsearch.index.mapper;

public class AliasFieldTypeBooleanFieldTypeTests extends BooleanFieldTypeTests {

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

basically for all XXXFieldType tests i wanted to repeat them but this time using an alias to the "XXXFieldType". This way i can leverage all the BooleanFieldTypeTests, but this time both via the boolean field name and an alias.

@@ -24,6 +24,9 @@
import org.junit.Before;

public class BooleanFieldTypeTests extends FieldTypeTestCase {

protected static final String BOOLEANFIELD = "booleanField1";
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the BFTT uses this so its clear the field is pointing to a boolean field. This is particular useful for AliasFieldTypeBooleanFieldTypeTests so we can clearly see in the mapping/error messages whether we are seeing the name of the boolean or the alias.

THis same "refactoring" has been applied to all XXXFieldTypeTests and the matching sub class that uses aliases.

String date = "2015-10-12T14:10:55";
long instant = DateFieldMapper.DEFAULT_DATE_TIME_FORMATTER.parser().parseDateTime(date).getMillis();
ft.setIndexOptions(IndexOptions.DOCS);
Query expected = new IndexOrDocValuesQuery(
LongPoint.newRangeQuery("field", instant, instant + 999),
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a horrible name, it doesnt mean anything especially when our tests get a bit more complicated than just a mapping with one field, eg like my new alias sub classes of this test.

.endObject().endArray()
.endObject());
public void testFieldName() throws Exception {
checkFieldNameAndValidation("root", "branch", "leaf");
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

THis testsuite had a lot of copy paste code (horrible) and i wanted to repeat the same tests but with aliases. To avoid copying basically the entire class i extracted the common stuff so all these tests simply pass parameters (the different stuff).


public abstract class FieldMapperTestCase extends ESSingleNodeTestCase {

public static String fieldPath(final String field, String...fields){
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

THeres a lot of copy paste in all FieldMapperTests they really should be using a base class like this to avoid all that duplication and boilerplate noise.

package org.elasticsearch.index.mapper;

public final class MapperTesting {

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Introduced this so the ctor below can remain package private, but for the tests that need to create a AliasFieldType, they can call this "test only" factory.

throw new IllegalArgumentException(pathStartOrEndingWithDotAmbiguous(makeFullPath(path, context)));
}

return new String[] {path};
}
}

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pkg private static methods are used by some of my tests so they dont have to hard code literals in their assertions.

- #23714
- attempts to index an alias field fail.
- introduced MappedFieldType.nameForMessages(), nameForIndex(), fieldTypeForIndex(),
      in the right places these are used rather than just a "raw" field or MappedFieldType.name()

    Goals and non goals.
- Attempts to index into the alias field would result in an exception
  - it is read only
- Queries
- aggregations
  - "field" substituted with alias target.
  - "filter" fields substituted.
- suggestions
- scripts (using doc[]) (Not Done)
- highlighting
- fielddata_fields (deprecated/ Not Done)
- docvalue_fields,
- stored_fields would just get the data (and mapping) from the specified path
- Source filtering would not work with the aliased field (Not done)
- _source field when returned remains unmodified.
@@ -135,7 +135,7 @@ public Query existsQuery(QueryShardContext context) {

@Override
public Query termQuery(Object value, QueryShardContext context) {
throw new QueryShardException(context, "Murmur3 fields are not searchable: [" + name() + "]");
throw new QueryShardException(context, "Murmur3 fields are not searchable: " + this.nameForMessages());
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this.nameForMessages auto includes the surround brackets. See other comments where i explain WHY this method was introduced.

}

// intended to only be overridden by AliasFieldMapper
void init()
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Introduced this method because AliasFieldType doesnt want to do the setTokenised and other defaults, so this is my way (via polymorphism) of stopping these defaults.

return (IFD) indexFieldDataService.apply(fieldType, fullyQualifiedIndexName);
Objects.requireNonNull(fieldType, "fieldType is null");

return (IFD) indexFieldDataService.apply(fieldType.fieldTypeForIndex(), fullyQualifiedIndexName);
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if a non alias field type will simply return its name, for aliases it will return the name of the alias target, and the rest will just work!

// we know its low level reader, and matching docId, since that's how we call the highlighter with
SourceLookup sourceLookup = searchContext.lookup().source();
sourceLookup.setSegmentAndDocument((LeafReaderContext) reader.getContext(), docId);

List<Object> values = sourceLookup.extractRawValues(mapper.fieldType().name());
final String fieldInIndex = mapper.fieldType().nameForIndex();
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does the right thing if an alias...

@colings86 colings86 added >feature :Search Foundations/Mapping Index mappings, including merging and defining field types labels May 1, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search-aggs

@mP1 mP1 changed the title New field type=alias including support for querying and most reading. New field type=alias including support for querying, aggs, suggestions + more read ops May 2, 2018
Copy link
Contributor

@jtibshirani jtibshirani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @mP1, thank you so much for your contribution! As you mentioned, this is quite a substantial feature.

I’m sorry it took so long to review. Before jumping into details, I have a few high-level comments about the PR that should be addressed:

  • I agree that we should create something like an AliasFieldMapper to parse the mappings for the alias type, but I don’t think it makes sense to have an AliasFieldType — instead aliases should be linked to an existing concrete MappedFieldType. Conceptually, there isn’t really such a thing as an ‘alias field type’ , and creating the class AliasFieldType has led to some complexity, like having to pull out an init method, and introducing a MappedFieldType#fieldTypeForIndex method that consumers must remember to call.
  • The current way in which aliases are linked to their field types is a bit messy, with everything happening in the RootObjectMapper constructor. I also don’t think this approach lets you add a field mapping, then later add an alias to it in a separate mappings update. I think this linking needs to be pulled up closer to the MapperService level. I did a quick prototype where I instead made AliasFieldMapper extend Mapper directly, and pulled up the aliasing logic into FieldTypeLookup — this direction may be worth trying out.
  • I think that creating a nameForMessages vs. nameForIndex distinction adds more complexity than it’s worth. Let’s stick with a single field type name for now, until we see a strong reason to hold onto the alias name in terms of debuggability, etc.
  • It would be good to add unit/ integration tests that cover all of the cases in your test script. We can likely add cases that focus on field aliases to existing tests. I don’t think the current set of tests you added covers important changes around suggestions, sorting, etc.
  • The PR includes several refactors that aren’t core to the feature, like pulling out a FieldMapperTests base class and the changes to HighlighterSearchIT. To keep this (already substantial) PR well-scoped and possible to review, it would be good to remove the non-core changes, and potentially save those refactors for a different PR.

@mP1
Copy link
Author

mP1 commented May 24, 2018

I agree that we should create something like an AliasFieldMapper to parse the mappings for the alias type, but I don’t think it makes sense to have an AliasFieldType — instead aliases should be directly linked to an existing MappedFieldType.

The reason about AFM isnt about mappings, its about polymorphism to get the "right" field name when building queries, aggregations, error messages, processing results etc.

Conceptually, there isn’t really such a thing as an ‘alias field type’ ,

Sorry you are wrong, this entire ticket refers to alias field type as a concept.

and creating the class AliasFieldType has led to some complexity,

No, thats called object orientated programming. Just like all the other field type sub classes, im creating a new sub class so i can hide custom behaviour based on the type.

like having to pull out an init method,

The comment explains why this had to be done. The base class made some bad assumptions that arent always going to be true, such as introducing an AliasFieldType which doesnt want to clone those extra fields.

and introducing a MappedFieldType#fieldTypeForIndex method that consumers must remember to call.

You are looking at this all wrong. You are forgetting that there are dozens of places that need to pick the right name when they work with the index, creating a query, aggregate etc. All those dozens of places no longer care if they have an alias reference, they just call nameForIndex() and it does the right thing regardless whether its an alias or not. Same goes for the dozens of places building error messages, they dont need to check, they call the nameForMessages() and they get the right answer and their error messages are correct from the users perspective.

Any other solution is going to involve a lot more code, copy and pasted dozens of times.

@mP1
Copy link
Author

mP1 commented May 24, 2018

The current way in which aliases are linked to their field types is a bit messy, with everything happening in the RootObjectMapper constructor. I also don’t think this approach lets you add a field, then later add an alias to it in a mappings update.

It works, my sample of kibana operations shows this.

I did a quick prototype where I instead made AliasFieldMapper extend Mapper directly, and pulled up the aliasing logic into FieldTypeLookup — this direction may be worth trying out.

Bit hard without any code for me to see to know exactly what you mean...

I dont think you have looked hard at the problem of using the right "name" depending on whether the field is a simple field reference of an alias. If you followed it thru for all the use cases, its not as simple as what you think.

@jpountz
Copy link
Contributor

jpountz commented May 24, 2018

Many good things in this PR, I like the proposed API and existing restrictions such as requiring concrete fields (no objects) that fields already exist in the mappings, failing documents that have a value for index fields, etc.

There is one additional restriction that I would like to add which is to prevent source and targets of aliases to have different nested scopes. For instance it shouldn't be possible to alias a root field to the sub field of an object that uses a nested mapping or vice-versa. We already have such validation for copy_to, maybe we can reuse.

If someone creates mappings with an alias and then updates the mapping of the target field without re-submitting the alias field, will it be reflected on the alias field? I suspect not based on the diff, but maybe I missed the place where this is handled. I suspect we might need to introduce new cases to the merge logic of MapperService specifically for this.

I understand your motivation to do proper OO programming with the field type wrapper, but I'm afraid this will cause issues given the number of places where we do instanceof checks to make sure that a field is of the expected type. See for instance GeoBoundingBoxQueryBuilder which makes sure that the field type is a GeoPointFieldType or QueryStringQueryParser which checks whether a field type implementes StringFieldType to know whether prefix queries make sense. We could remove these instanceof checks in favor of new methods on MappedFieldType but at the same time this is the first time that we might need to wrap field types, and these instanceof calls tend to help keep the MappedFieldType API reasonable, maybe @jtibshirani 's idea of pointing directly to the target field type would help in this regard.

I agree with @jtibshirani that we might want to stick with a single name for now, which I think should be the name of the target field. My reasoning is that things like queries and aggregations already know the name that was used to adress the field, which is usually a member of the QueryBuilder or AggregationBuilder, so they don't need the MappedFieldType to provide it for them. It might need some tweaks to work correctly, such as catching exceptions that occur on the lower layers to add a new exception message that includes the name of the alias, but I think it's fine to do this on a best-effort basis: it doesn't sound wrong to me that a query on an alias complains about some configuration issue on the target of the alias.

Sorry you are wrong
No, thats called object orientated programming.

These two sentences look slightly aggressive. Written communication is hard, as the lack of tone to clarify your intent makes it easy for some statements to be misunderstood. Please feel free to state that your disagree, but in a way that targets a statement rather than someone (eg. *this* is wrong rather than *you* are wrong) and that explains the trade-off instead of trying to kill an argument (eg. In my opinion, following this object oriented programming principle makes the code easier to maintain in spite of verbosity).

@mP1
Copy link
Author

mP1 commented May 24, 2018

@jpountz

There is one additional restriction that I would like to add which is to prevent source and targets of aliases to have different nested scopes. For instance it shouldn't be possible to alias a root field to the sub field of an object that uses a nested mapping or vice-versa. We already have such validation for copy_to, maybe we can reuse.

One problem, the only example given by a ES team member, is exactly the opposite of what you say...

The very first example given in the ticket seems to contradict what you ask...the "alias" is hanging off the root to a target field deep inside a nested objectg.

clintongormley commented on 24 Mar 2017

We can introduce a new field type called alias which simply points to another field, eg:

PUT my_index
{
  "mappings": {
    "my_type": {
      "properties": {
        "host": {
          "properties": {
            "source_ip": {
              "type": "ip"
            }
          }
        },
        "sourceIP": {
          "type": "alias",
          "path": "host.source_ip"
        }
      }
    }
  }
}

@mP1
Copy link
Author

mP1 commented May 24, 2018

@jpountz

If someone creates mappings with an alias and then updates the mapping of the target field without re-submitting the alias field, will it be reflected on the alias field? I suspect not based on the diff, but maybe I missed the place where this is handled. I suspect we might need to introduce new cases to the merge logic of MapperService specifically for this

Will have to add tests to verify...

Will do.

@mP1
Copy link
Author

mP1 commented May 24, 2018

@jpountz

I understand your motivation to do proper OO programming with the field type wrapper, but I'm afraid this will cause issues given the number of places where we do instanceof checks to make sure that a field is of the expected type. See for instance GeoBoundingBoxQueryBuilder which makes sure that the field type is a GeoPointFieldType or QueryStringQueryParser which checks whether a field type implementes StringFieldType to know whether prefix queries make sense.

Instance of tests in java code are a bad code smell, pretty sure its mentioned multiple times in Effective java etc. All those instanceof checks should be replaced by calls too methods isGeo(). Thats the proper OO way to know if something is a type.

Can fix.

@mP1
Copy link
Author

mP1 commented May 24, 2018

@jpountz

I agree with @jtibshirani that we might want to stick with a single name for now, which I think should be the name of the target field. My reasoning is that things like queries and aggregations already know the name that was used to adress the field, which is usually a member of the QueryBuilder or AggregationBuilder, so they don't need the MappedFieldType to provide it for them.

Please read my previous response...

The system needs to know both the alias AND the target field....

We want to return error messages with the alias name if the original request used the alias name, otherwise the user will be completely confused. If they do a search on a date field using an alias, and the format is invalid, how can the error message complaining about the format being invalid know what name to use in the text message if we dont pass it ? If we replace aliases with target very early on and work with that, our error messages will never have the alias when they need to be created.

My code does the right thing, error messages use aliases if thats what was used for problem field.

Same goes for aggregations, if we replace the alias with the target, our "answer" will not know the original alias because all it will have is the target.

There are other examples.

@mP1
Copy link
Author

mP1 commented May 24, 2018

@jpountz

Please feel free to state that your disagree, but in a way that targets a statement rather than someone (eg. this is wrong rather than you are wrong) and that explains the trade-off instead of trying to kill an argument (eg. In my opinion, following this object oriented programming principle makes the code easier to maintain in spite of verbosity).

Sorry i was trying to keep this brief, as i had previously written up a screenful of text explaining the approach and the benefits it gives to this enhancement, and addressing the very subject of the comment.

#30230 (comment)

@mP1
Copy link
Author

mP1 commented May 24, 2018

@jpountz

We could remove these instanceof checks in favor of new methods on MappedFieldType but at the same time this is the first time that we might need to wrap field types,

Unfortunately we need to wrap here, because we need to hold the alias and target field names for the reasons i mention in my large comment that accompanies the PR.

-> #30230 (comment)

and these instanceof calls tend to help keep the MappedFieldType API reasonable, maybe @jtibshirani 's idea of pointing directly to the target field type would help in this regard.

Exactly we need to enhance MFT to support alias fields.

@jpountz
Copy link
Contributor

jpountz commented May 24, 2018

The very first example given in the ticket seems to contradict what you ask...the "alias" is hanging off the root to a target field deep inside a nested objectg.

I don't think I disagree with Clinton. My point was only about objects that have a nested mapping, which are different since they are their own Lucene document, require different strategies at query time, aggregation time, etc. I don't mind aliasing a to b.c in { b : { c: 42 }} as long as b is mapped as object and not nested.

Instance of tests in java code are a bad code smell, pretty sure its mentioned multiple times in Effective java etc. All those instanceof checks should be replaced by calls too methods isGeo(). Thats the proper OO way to know if something is a type.

I would agree in general, but sometimes things are a bit more complicated. This would require to add many more methods to MappedFieldType. I'm not saying we won't end up doing this in the end, but I need to better weigh the pros and cons first.

We want to return error messages with the alias name if the original request used the alias name, otherwise the user will be completely confused. If they do a search on a date field using an alias, and the format is invalid, how can the error message complaining about the format being invalid know what name to use in the text message if we dont pass it ?

Maybe we can do differently. To reuse your example, let's assume a range query on a date field. The creation of the Lucene query object happens in DateFieldMapper.doToQuery where there is the following bit of code:

MappedFieldType mapper = context.fieldMapper(this.fieldName);

If MappedFieldType.name() remains the name of the field in the index, then at this point we have access to both the name of the field in the index via mapper.name() and the name of the alias via this.fieldName. So we could do something like this to expose the name of the alias in error messages:

MappedFieldType mapper = context.fieldMapper(this.fieldName);
try {
  Query query = mapper.rangeQuery(
                    from, to, includeLower, includeUpper,
                    relation, timeZone, forcedDateParser, context);
} catch (IllegalArgumentException e) {
  // The error message of this exception will refer to the target field.
  // Make sure to expose the name of the field that was used for querying in the error message.
  throw new IllegalArgumentException("Failed to create range query on field [" + this.fieldName + "]", e);
}

Sorry i was trying to keep this brief, as i had previously written up a screenful of text explaining the approach

No worries. I read your explanations, I think there is just disagreement about trade-offs, eg. you seem to care about following oo programming principles, while I'm ok with not following them when it allows to keep APIs simpler.

@mP1
Copy link
Author

mP1 commented May 25, 2018

@jpountz

I don't think I disagree with Clinton. My point was only about objects that have a nested mapping, which are different since they are their own Lucene document, require different strategies at query time, aggregation time, etc. I don't mind aliasing a to b.c in { b : { c: 42 }} as long as b is mapped as object and not nested.

I will get back to these outstanding issues, im just trying to resolve the advantages of my introducing AliasFieldType, because the outcome of that changes many things.

@mP1
Copy link
Author

mP1 commented May 25, 2018

Maybe we can do differently. To reuse your example, let's assume a range query on a date field. The creation of the Lucene query object happens in DateFieldMapper.doToQuery where there is the following bit of code:
MappedFieldType mapper = context.fieldMapper(this.fieldName);
If MappedFieldType.name() remains the name of the field in the index, then at this point we have access to both the name of the field in the index via mapper.name() and the name of the alias via this.fieldName. So we could do something like this to expose the name of the alias in error messages:
This approach doesn’t not work at all and my approach does. I will create a simple example, again using our example of a date field search which has an invalid format, and a “error message” needs to be reported.
Imagine the format is invalid and the throws is about to happen, note by this stage as you noted above it has the “index name”.
throw new IllegalArgumentException("Failed to create range query on field [" + this.fieldName + "]", e);
If the user searched with $alias, the field name here is actually the $indexName. If the user searched with the $indexName then here its also the $indexName.

The code context at this stage has no way of knowing which field name appeared in the original query. Our error message is always going to be wrong half the time. In aggregation results its also going to be wrong because it doesn’t know the original field name, its “assuming” … because it doesn’t know.

This is the very reason I introduced AliasFieldType so I don’t have both can I can pick the right “name” based on context, error messages, aggs, etc use FieldType.nameForMessages and “others” use FieldType.nameForIndex. There is never a “guess” or “mistake”, the code know exactly what the search query had etc.

You can verify this by putting breakpoints at a few places where my code is calling FieldType.nameForMessage() and when the breakpoint hits, try the execute expression, there is no way of knowing whether the original query had the alias or the target for a field with both.

Theres also another problem which leads to a LOT of changes.

MappedFieldType mapper = context.fieldMapper(this.fieldName);

The context object is not always available, please search my pull request for “nameForMessages” and check the method they exist. You will find a context is not present, and to “pass” on to that very spot would require many many changes along the code path. There are plenty more, maybe one or two are easy to fix, but there are STILL many that would require dozens ofchanges to pass the context in.

https://github.com/mP1/elasticsearch/blob/ebf1c8a1f545ce5f091d4b21f9b64816dd4bc871/server/src/main/java/org/elasticsearch/index/fielddata/IndexFieldDataService.java

https://github.com/mP1/elasticsearch/blob/ebf1c8a1f545ce5f091d4b21f9b64816dd4bc871/plugins/mapper-murmur3/src/main/java/org/elasticsearch/index/mapper/murmur3/Murmur3FieldMapper.java

@jpountz
Copy link
Contributor

jpountz commented May 25, 2018

Don't get me wrong, I understand your concerns and agree that displaying the right name may improve the user-experience. But on the other hand, this solution requires to add two new methods to MappedFieldType, and all implementations but one would just delegate to the name. This may look like a minor change, but we keep adding lots of features and if we had let them all add new APIs without questioning them, this software would be much harder to maintain than it is now.

@mP1
Copy link
Author

mP1 commented May 25, 2018

@jpountz

I understand your concerns and agree that displaying the right name may improve the user-experience.

No its far more important than that, aggregations and other features that require the field name like highlights etc will ALL be wrong in the same ways i described above for the error message case.

But on the other hand, this solution requires to add two new methods to MappedFieldType, and all implementations but one would just delegate to the name

I fail to see the problem of adding two methods, they actually read well and are self explaining. Javadoc could be added to summarise the problem they solve just to be sure.

Im sorry but all the suggestions presented are far more confusing, and whats more important they are ALL broken for the use case i have finished explaining in my last comment.

There is no other way to make this use case work.

This may look like a minor change, but we keep adding lots of features and if we had let them all add new APIs without questioning them, this software would be much harder to maintain than it is now.

So question away.

However your approach doesnt work at all, and would require LOTS of changes to pass the context around and that is only the start. We pass A FieldType instance for all fields, so it makes sense to let the instance know both the what i call nameForMessages and nameForIndex.

Theres nothing wrong with introducing a new type AliasFieldType and let polymorphism help is in giving answers about the field name for messages(and results like agg) and indexs (creating queries etc).

I believe you guys are trying to hard to introduce AliasFieldType and thats just creating more and more hacks all the place and in the end its fundamentally broken. If we followed what you suggest as a foundation and try to make it work the diff will be at least 5x larger, with code scattered in many many places. My diff however is pretty simple all things considered.

Basically the core of the code is in AliasFieldType, a few places use nameForMessage and nameForIndex and one or two other code blocks.

I know there are a few small outstanding things ive skipped for a few comments, but they are relatively easy to fix, once my approach is accepted.

@jpountz
Copy link
Contributor

jpountz commented May 25, 2018

I am aware of the points that you mentioned. To me the problem is that if the only way to implement this feature properly requires to add so much complexity to MappedFieldType, then maybe we shouldn't implement it. So I want to think about the other options that we have.

@jtibshirani
Copy link
Contributor

jtibshirani commented May 28, 2018

I wanted to clarify one of my original comments:

The current way in which aliases are linked to their field types is a bit messy, with everything happening in the RootObjectMapper constructor. I also don’t think this approach lets you add a field mapping, then later add an alias to it in a separate mappings update.

Here is a short reproduction which currently results in the error "Property [path] is not mapped to a field path=[old_field]":

PUT index
{
    "mappings" : {
        "type" : {
            "properties" : {
                "old_field" : { "type" : "keyword", "store": true, "index": false }
            }
        }
    }
}

PUT index/_mapping/type
{
    "properties" : {
        "new_field" : { "type" : "alias", "path": "old_field"}
    }
}

Ideally this sequence of commands would work, as put mappings requests should be allowed to reference other field mappings not in the request. One example where this pattern currently shows up is in field mappings with copy_to.

Finally, I'll plan to clean up the prototype I mentioned above and link it here so that we can get a better sense of our options + the associated trade-offs.

@jtibshirani
Copy link
Contributor

jtibshirani commented May 29, 2018

Here is a prototype that explores the idea of creating a top-level Mapper type for field alias mappers: jtibshirani#1. More details about the advantages + disadvantages of the approach can be found on the PR itself -- it would be great if you took a look.

@jtibshirani
Copy link
Contributor

Given the advantages of the prototype linked above, and that we feel we can adequately address its disadvantages, I would vote that we instead proceed with the approach there. @jpountz @mP1 let me know what you think in terms of moving forward.

@jpountz
Copy link
Contributor

jpountz commented May 31, 2018

Sounds good to me. Let's try to fix highligting to work on field types rather than mappers first maybe? Also how do we want aliases to interact with wildcards eventually?

@jtibshirani
Copy link
Contributor

jtibshirani commented May 31, 2018

I will take a crack at the highlighter fix in a separate PR. Regardless of what happens with field aliases, moving highlighters to use MappedFieldType instead of FieldMapper seems like a good refactor.

I think that ideally we'd support matches on aliases in all places where field wildcards are supported (and aliases are supported). Both approaches proposed should support field wildcards in queries and field capabilities. Like I mentioned on the prototype PR, there is some trickiness around supporting wildcards when fetching fields -- I'll take a closer look at this as well.

@jtibshirani
Copy link
Contributor

At this point, I think we should open a new PR with the alternative approach. @mP1 would you be interested in taking the approach in the prototype and trying to move it to completion (writing tests, adding mapping validation, etc.)? If not, I am happy to take that up.

@mP1
Copy link
Author

mP1 commented Jun 5, 2018

@jtibshirani I think it would be best if you complete the tests, given you know exactly from a conceptual viewpoint what needs to be done.

@jtibshirani
Copy link
Contributor

Sounds good. Development has now been moved to the branch jtibshirani:feature/field-aliases -- will open a PR soon.

@mP1
Copy link
Author

mP1 commented Jun 5, 2018

@jtibshirani
May i suggest that you update the field type and similar tests to use proper names, rather than "foo" and "bar" etc. In tests involving aliases and the actual target, calling them foo and bar doesnt make for very readable messages, at least if the fields were called after their type and the same for aliases with a number its instantly clear what the message is talking about. Currenlty any message telling me foo is not found or similar, isnt really clear whether its complaining about the "target" or the "alias".

@jtibshirani
Copy link
Contributor

Closing this PR in favor of #31287, which follows the alternative design we discussed. Thank you @mP1 for your work + suggestions on this issue, it really helped in thinking through how we should approach it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>feature :Search Foundations/Mapping Index mappings, including merging and defining field types
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants