-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multiplexing token filter #31208
Multiplexing token filter #31208
Changes from 10 commits
b3275b7
0f6598b
721de2c
ac86ce3
7ad7d9d
3cc89b2
f367fef
692542c
24de7ad
65064d9
7d17964
0aa4bc1
f936171
de07870
b497fe2
024d67a
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,113 @@ | ||
[[analysis-multiplexer-tokenfilter]] | ||
=== Multiplexer Token Filter | ||
|
||
A token filter of type `multiplexer` will emit multiple tokens at the same position, | ||
each version of the token having been run through a different filter. | ||
|
||
[float] | ||
=== Options | ||
[horizontal] | ||
filters:: a list of token filters to apply to incoming tokens. These can be any | ||
token filters defined elsewhere in the index mappings. Filters can be chained | ||
using a comma-delimited string, so for example `"lowercase, porter_stem"` would | ||
apply the `lowercase` filter and then the `porter_stem` filter to a single token. | ||
WARNING: Shingle or multi-word synonym token filters will not function normally | ||
when they are declared in the filters array because they read ahead internally | ||
which is unsupported by the multiplexer | ||
preserve_original:: if `true` (the default) then emit the original token in | ||
addition to the filtered tokens | ||
|
||
[float] | ||
=== Settings example | ||
|
||
You can set it up like: | ||
|
||
[source,js] | ||
-------------------------------------------------- | ||
PUT /multiplexer_example | ||
{ | ||
"settings" : { | ||
"analysis" : { | ||
"analyzer" : { | ||
"my_analyzer" : { | ||
"tokenizer" : "standard", | ||
"filter" : [ "my_multiplexer" ] | ||
} | ||
}, | ||
"filter" : { | ||
"my_multiplexer" : { | ||
"type" : "multiplexer", | ||
"filters" : [ "lowercase", "lowercase, porter_stem" ] | ||
} | ||
} | ||
} | ||
} | ||
} | ||
-------------------------------------------------- | ||
// CONSOLE | ||
|
||
And test it like: | ||
|
||
[source,js] | ||
-------------------------------------------------- | ||
POST /multiplexer_example/_analyze | ||
{ | ||
"analyzer" : "my_analyzer", | ||
"text" : "Going HOME" | ||
} | ||
-------------------------------------------------- | ||
// CONSOLE | ||
// TEST[continued] | ||
|
||
And it'd respond: | ||
|
||
[source,js] | ||
-------------------------------------------------- | ||
{ | ||
"tokens": [ | ||
{ | ||
"token": "Going", | ||
"start_offset": 0, | ||
"end_offset": 5, | ||
"type": "<WORD>", | ||
"position": 1 | ||
}, | ||
{ | ||
"token": "going", | ||
"start_offset": 0, | ||
"end_offset": 5, | ||
"type": "<WORD>", | ||
"position": 1 | ||
}, | ||
{ | ||
"token": "go", | ||
"start_offset": 0, | ||
"end_offset": 5, | ||
"type": "<WORD>", | ||
"position": 1 | ||
}, | ||
{ | ||
"token": "HOME", | ||
"start_offset": 6, | ||
"end_offset": 10, | ||
"type": "<WORD>", | ||
"position": 2 | ||
}, | ||
{ | ||
"token": "home", | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hmmmm. It might be nice to add a callout (looks like There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I was going to add a deduplication step as well to remove this confusion, but then I noticed that we don't seem to have the deduplication filter exposed in ES anywhere. I'll open another issue for that, as I think it will be very useful combined with a multiplexer There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I wonder that we might want to remove duplicates by default (or even enforce it). Otherwise eg. terms that are not modified through lowercasing or stemming will artificially get higher term freqs? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'd prefer to to document the duplicates and have folks use the deduplicating filter that @romseygeek proposed today. I like documenting it so folks understand what costs they are paying. Also, if the token steam comes with duplicates on the way into this token filter then adding deduplicating filter by default would deduplicate the existing duplicates as a side effect, right? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sorry, I just saw this message. I don't think the documentation path gives the best user experience as I can't think of a use-case to retain duplicates if multiple filters produce the same token. That said, I agree that simply adding a deduplicating token filter feels wrong if the original stream has duplicates, so maybe this is something that needs to be implemented directly into this new token filter. |
||
"start_offset": 6, | ||
"end_offset": 10, | ||
"type": "<WORD>", | ||
"position": 2 | ||
}, | ||
{ | ||
"token": "home", | ||
"start_offset": 6, | ||
"end_offset": 10, | ||
"type": "<WORD>", | ||
"position": 2 | ||
} | ||
] | ||
} | ||
-------------------------------------------------- | ||
// TESTRESPONSE |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,200 @@ | ||
/* | ||
* Licensed to Elasticsearch under one or more contributor | ||
* license agreements. See the NOTICE file distributed with | ||
* this work for additional information regarding copyright | ||
* ownership. Elasticsearch licenses this file to you under | ||
* the Apache License, Version 2.0 (the "License"); you may | ||
* not use this file except in compliance with the License. | ||
* You may obtain a copy of the License at | ||
* | ||
* http://www.apache.org/licenses/LICENSE-2.0 | ||
* | ||
* Unless required by applicable law or agreed to in writing, | ||
* software distributed under the License is distributed on an | ||
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY | ||
* KIND, either express or implied. See the License for the | ||
* specific language governing permissions and limitations | ||
* under the License. | ||
*/ | ||
|
||
package org.elasticsearch.analysis.common; | ||
|
||
import org.apache.lucene.analysis.TokenFilter; | ||
import org.apache.lucene.analysis.TokenStream; | ||
import org.apache.lucene.analysis.miscellaneous.ConditionalTokenFilter; | ||
import org.apache.lucene.analysis.tokenattributes.PositionIncrementAttribute; | ||
import org.elasticsearch.common.Strings; | ||
import org.elasticsearch.common.settings.Settings; | ||
import org.elasticsearch.env.Environment; | ||
import org.elasticsearch.index.IndexSettings; | ||
import org.elasticsearch.index.analysis.AbstractTokenFilterFactory; | ||
import org.elasticsearch.index.analysis.ReferringFilterFactory; | ||
import org.elasticsearch.index.analysis.TokenFilterFactory; | ||
import org.elasticsearch.indices.analysis.AnalysisModule; | ||
|
||
import java.io.IOException; | ||
import java.io.UncheckedIOException; | ||
import java.util.ArrayList; | ||
import java.util.List; | ||
import java.util.Map; | ||
import java.util.function.Function; | ||
import java.util.function.Supplier; | ||
import java.util.stream.Collectors; | ||
|
||
public class MultiplexingTokenFilterFactory extends AbstractTokenFilterFactory implements ReferringFilterFactory { | ||
|
||
private List<TokenFilterFactory> filters; | ||
private List<String> filterNames; | ||
private final boolean preserveOriginal; | ||
|
||
private static final TokenFilterFactory IDENTITY_FACTORY = new TokenFilterFactory() { | ||
@Override | ||
public String name() { | ||
return "identity"; | ||
} | ||
|
||
@Override | ||
public TokenStream create(TokenStream tokenStream) { | ||
return tokenStream; | ||
} | ||
}; | ||
|
||
public MultiplexingTokenFilterFactory(IndexSettings indexSettings, Environment env, String name, Settings settings) throws IOException { | ||
super(indexSettings, name, settings); | ||
this.filterNames = settings.getAsList("filters"); | ||
this.preserveOriginal = settings.getAsBoolean("preserveOriginal", true); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. use underscore case instead? |
||
} | ||
|
||
@Override | ||
public TokenStream create(TokenStream tokenStream) { | ||
List<Function<TokenStream, TokenStream>> functions = new ArrayList<>(); | ||
for (TokenFilterFactory tff : filters) { | ||
functions.add(tff::create); | ||
} | ||
return new MultiplexTokenFilter(tokenStream, functions); | ||
} | ||
|
||
@Override | ||
public void setReferences(Map<String, TokenFilterFactory> factories) { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I wonder if It'd be cleaner if There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe a thing to do in a followup because its large but almost entirely mechanical. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It would also be backwards breaking, so would have to be a master-only change, I think? TokenFilterFactory is an API you're encouraged to use via analysis plugins There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We have a special tag for things that break the plugin or java client APIs: "breaking-java". We allow ourselves to use it much more liberally. One day we will have a properly semver-ed plugin API but we don't have one now. |
||
filters = new ArrayList<>(); | ||
if (preserveOriginal) { | ||
filters.add(IDENTITY_FACTORY); | ||
} | ||
for (String filter : filterNames) { | ||
String[] parts = Strings.tokenizeToStringArray(filter, ","); | ||
if (parts.length == 1) { | ||
filters.add(resolveFilterFactory(factories, parts[0])); | ||
} | ||
else { | ||
List<TokenFilterFactory> chain = new ArrayList<>(); | ||
for (String subfilter : parts) { | ||
chain.add(resolveFilterFactory(factories, subfilter)); | ||
} | ||
filters.add(chainFilters(filter, chain)); | ||
} | ||
} | ||
} | ||
|
||
private TokenFilterFactory chainFilters(String name, List<TokenFilterFactory> filters) { | ||
return new TokenFilterFactory() { | ||
@Override | ||
public String name() { | ||
return name; | ||
} | ||
|
||
@Override | ||
public TokenStream create(TokenStream tokenStream) { | ||
for (TokenFilterFactory tff : filters) { | ||
tokenStream = tff.create(tokenStream); | ||
} | ||
return tokenStream; | ||
} | ||
}; | ||
} | ||
|
||
private TokenFilterFactory resolveFilterFactory(Map<String, TokenFilterFactory> factories, String name) { | ||
if (factories.containsKey(name) == false) { | ||
throw new IllegalArgumentException("Multiplexing filter [" + name() + "] refers to undefined tokenfilter [" + name + "]"); | ||
} | ||
else { | ||
return factories.get(name); | ||
} | ||
} | ||
|
||
private final class MultiplexTokenFilter extends TokenFilter { | ||
|
||
private final TokenStream source; | ||
private final int filterCount; | ||
|
||
private int selector; | ||
|
||
/** | ||
* Creates a MultiplexTokenFilter on the given input with a set of filters | ||
*/ | ||
MultiplexTokenFilter(TokenStream input, List<Function<TokenStream, TokenStream>> filters) { | ||
super(input); | ||
TokenStream source = new MultiplexerFilter(input); | ||
for (int i = 0; i < filters.size(); i++) { | ||
final int slot = i; | ||
source = new ConditionalTokenFilter(source, filters.get(i)) { | ||
@Override | ||
protected boolean shouldFilter() { | ||
return slot == selector; | ||
} | ||
}; | ||
} | ||
this.source = source; | ||
this.filterCount = filters.size(); | ||
this.selector = filterCount - 1; | ||
} | ||
|
||
@Override | ||
public boolean incrementToken() throws IOException { | ||
return source.incrementToken(); | ||
} | ||
|
||
@Override | ||
public void end() throws IOException { | ||
source.end(); | ||
} | ||
|
||
@Override | ||
public void reset() throws IOException { | ||
source.reset(); | ||
} | ||
|
||
private final class MultiplexerFilter extends TokenFilter { | ||
|
||
State state; | ||
PositionIncrementAttribute posIncAtt = addAttribute(PositionIncrementAttribute.class); | ||
|
||
private MultiplexerFilter(TokenStream input) { | ||
super(input); | ||
} | ||
|
||
@Override | ||
public boolean incrementToken() throws IOException { | ||
if (selector >= filterCount - 1) { | ||
selector = 0; | ||
if (input.incrementToken() == false) { | ||
return false; | ||
} | ||
state = captureState(); | ||
return true; | ||
} | ||
restoreState(state); | ||
posIncAtt.setPositionIncrement(0); | ||
selector++; | ||
return true; | ||
} | ||
|
||
@Override | ||
public void reset() throws IOException { | ||
super.reset(); | ||
selector = filterCount - 1; | ||
this.state = null; | ||
} | ||
} | ||
|
||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is seriously one of my favorite APIs in Elasticsearch.