Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for inlined user dictionary in Nori #36123

Merged
merged 12 commits into from
Dec 7, 2018
2 changes: 1 addition & 1 deletion docs/plugins/analysis-nori.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -173,7 +173,7 @@ PUT nori_sample
"nori_user_dict": {
"type": "nori_tokenizer",
"decompound_mode": "mixed",
"user_dictionary_rules": ["c++", "C샤프", "세종", "세종시", "세종", "시"]
"user_dictionary_rules": ["c++", "C샤프", "세종", "세종시 세종 시"]
}
},
"analyzer": {
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -31,8 +31,10 @@
import java.io.Reader;
import java.io.StringReader;
import java.util.Collections;
import java.util.HashSet;
import java.util.List;
import java.util.Locale;
import java.util.Set;

public class NoriTokenizerFactory extends AbstractTokenizerFactory {
private static final String USER_DICT_PATH_OPTION = "user_dictionary";
Expand All @@ -48,32 +50,24 @@ public NoriTokenizerFactory(IndexSettings indexSettings, Environment env, String
}

public static UserDictionary getUserDictionary(Environment env, Settings settings) {
if (settings.get(USER_DICT_PATH_OPTION) != null && settings.get(USER_DICT_RULES_OPTION) != null) {
throw new ElasticsearchException("It is not allowed to use [" + USER_DICT_PATH_OPTION + "] in conjunction" +
" with [" + USER_DICT_RULES_OPTION + "]");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the reason you decided to not use this check anymore? I cannot find it in the refactored method.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch thanks, I restored this check in 5fcfad4


List<String> ruleList = Analysis.getWordList(env, settings, USER_DICT_PATH_OPTION, USER_DICT_RULES_OPTION);
StringBuilder sb = new StringBuilder();
if (ruleList == null || ruleList.isEmpty()) {
return null;
}
String path = settings.get(USER_DICT_PATH_OPTION);
if (path != null) {
try (Reader rulesReader = Analysis.getReaderFromFile(env, settings, USER_DICT_PATH_OPTION)) {
return rulesReader == null ? null : UserDictionary.open(rulesReader);
} catch (IOException e) {
throw new ElasticsearchException("failed to load nori user dictionary", e);
}
} else {
List<String> rulesList = settings.getAsList(USER_DICT_RULES_OPTION, Collections.emptyList(), false);
if (rulesList == null || rulesList.size() == 0) {
return null;
}
StringBuilder sb = new StringBuilder();
for (String line : rulesList) {
sb.append(line).append(System.lineSeparator());
}
try (Reader rulesReader = new StringReader(sb.toString())) {
return UserDictionary.open(rulesReader);
} catch (IOException e) {
throw new ElasticsearchException("failed to load nori user dictionary", e);
// check for duplicate terms
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where does the requirement for checking for duplicates come from? Would it make sense to simply ignore them if they happen or to log it only instead of throwing an error? I can imagine ppl compiling larger dictionaries where duplicates might sneak in over time, maybe this shouldn't block loading the whole analyzer?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to align the Kuromoji and Nori regarding duplicates in the user dictionary. The Japanese tokenizer doesn't accept duplicates (#36100) while Nori just ignores them. However I agree that this is not the scope of this pr so I removed the duplicate detection and will open a new pr in a follow up.

Set<String> terms = new HashSet<>();
for (String line : ruleList) {
String[] split = line.split("\\s+");
if (terms.add(split[0]) == false) {
throw new IllegalArgumentException("Found duplicate term: [" + split[0] + "] in user dictionary. ");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this stays an error or maybe gets converted to a log message, the line number would be a helpful debugging information for the user the file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, I'll make sure we report the line number in the follow up pr.

}
sb.append(line).append(System.lineSeparator());
}
try (Reader rulesReader = new StringReader(sb.toString())) {
return UserDictionary.open(rulesReader);
} catch (IOException e) {
throw new ElasticsearchException("failed to load nori user dictionary", e);
}
}

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,7 @@
import java.nio.file.Files;
import java.nio.file.Path;

import static org.hamcrest.Matchers.containsString;
import static org.hamcrest.Matchers.instanceOf;

public class NoriAnalysisTests extends ESTokenStreamTestCase {
Expand Down Expand Up @@ -82,15 +83,25 @@ public void testNoriAnalyzerUserDict() throws Exception {
.build();
TestAnalysis analysis = createTestAnalysis(settings);
Analyzer analyzer = analysis.indexAnalyzers.get("my_analyzer");
try (TokenStream stream = analyzer.tokenStream("", "세종시" )) {
assertTokenStreamContents(stream, new String[] {"세종", "시"});
try (TokenStream stream = analyzer.tokenStream("", "세종시")) {
assertTokenStreamContents(stream, new String[]{"세종", "시"});
}

try (TokenStream stream = analyzer.tokenStream("", "c++world")) {
assertTokenStreamContents(stream, new String[] {"c++", "world"});
assertTokenStreamContents(stream, new String[]{"c++", "world"});
}
}

public void testNoriAnalyzerUserDictWithDuplicates() throws Exception {
Settings settings = Settings.builder()
.put("index.analysis.analyzer.my_analyzer.type", "nori")
.putList("index.analysis.analyzer.my_analyzer.user_dictionary_rules", "세종", "C샤프", "세종", "세종 세 종")
.build();
IllegalArgumentException exc = expectThrows(IllegalArgumentException.class, () -> createTestAnalysis(settings));
assertThat(exc.getMessage(), containsString("Found duplicate term: [세종]"));

}

public void testNoriAnalyzerUserDictPath() throws Exception {
Settings settings = Settings.builder()
.put("index.analysis.analyzer.my_analyzer.type", "nori")
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -221,10 +221,21 @@ public static CharArraySet getWordSet(Environment env, Settings settings, String
* If the word list cannot be found at either key.
*/
public static List<String> getWordList(Environment env, Settings settings, String settingPrefix) {
String wordListPath = settings.get(settingPrefix + "_path", null);
return getWordList(env, settings, settingPrefix + "_path", settingPrefix);
}

/**
* Fetches a list of words from the specified settings file. The list should either be available at the key
* specified by <code>settingList</code> or in a file specified by <code>settingPath</code>.
*
* @throws IllegalArgumentException
* If the word list cannot be found at either key.
*/
public static List<String> getWordList(Environment env, Settings settings, String settingPath, String settingList) {
String wordListPath = settings.get(settingPath, null);

if (wordListPath == null) {
List<String> explicitWordList = settings.getAsList(settingPrefix, null);
List<String> explicitWordList = settings.getAsList(settingList, null);
if (explicitWordList == null) {
return null;
} else {
Expand All @@ -239,10 +250,10 @@ public static List<String> getWordList(Environment env, Settings settings, Strin
} catch (CharacterCodingException ex) {
String message = String.format(Locale.ROOT,
"Unsupported character encoding detected while reading %s_path: %s - files must be UTF-8 encoded",
settingPrefix, path.toString());
settingPath, path.toString());
throw new IllegalArgumentException(message, ex);
} catch (IOException ioe) {
String message = String.format(Locale.ROOT, "IOException while reading %s_path: %s", settingPrefix, path.toString());
String message = String.format(Locale.ROOT, "IOException while reading %s_path: %s", settingPath, path.toString());
throw new IllegalArgumentException(message, ioe);
}
}
Expand Down