You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am implementing the PatternCaptureGroupTokenFilter in my code to generate tokens based on multiple regular expressions, with the goal of highlighting any matches found within the string. Currently, I am working with Lucene 9, but I am encountering the following error during execution.
I'm using one of the latest Lucene jars (9.11.1) for PatternCaptureGroupTokenFilter, but the token positions are not as expected, making it difficult to accurately highlight the search results.
The generated offsets are not compatible for highlighting the searchedQuery.
To generate different offsets, i tried using the org.apache.lucene.analysis.tokenattributes.OffsetAttribute Lucene package. which is giving me the different offsets but i am encountering another error while indexing the document.
I am using the below java code.
`public final class PatternCaptureGroupTokenFilter extends TokenFilter {
private final CharTermAttribute charTermAttr = addAttribute(CharTermAttribute.class);
private final PositionIncrementAttribute posAttr = addAttribute(PositionIncrementAttribute.class);
private final OffsetAttribute offsetAtt = addAttribute(OffsetAttribute.class);
private final TypeAttribute typeAttribute = addAttribute(TypeAttribute.class);
private State state;
private final Matcher[] matchers;
private final CharsRefBuilder spare = new CharsRefBuilder();
private final int[] groupCounts;
private final boolean preserveOriginal;
private int[] currentGroup;
private int currentMatcher;
private int main_token_start;
private int main_token_end;
public PatternCaptureGroupTokenFilter(TokenStream input,
boolean preserveOriginal, Pattern... patterns) {
super(input);
this.preserveOriginal = preserveOriginal;
this.matchers = new Matcher[patterns.length];
this.groupCounts = new int[patterns.length];
this.currentGroup = new int[patterns.length];
for (int i = 0; i < patterns.length; i++) {
this.matchers[i] = patterns[i].matcher("");
this.groupCounts[i] = this.matchers[i].groupCount();
this.currentGroup[i] = -1;
}
}
private boolean nextCapture() {
int min_offset = Integer.MAX_VALUE;
currentMatcher = -1;
Matcher matcher;
for (int i = 0; i < matchers.length; i++) {
matcher = matchers[i];
if (currentGroup[i] == -1) {
currentGroup[i] = matcher.find() ? 1 : 0;
}
if (currentGroup[i] != 0) {
while (currentGroup[i] < groupCounts[i] + 1) {
final int start = matcher.start(currentGroup[i]);
final int end = matcher.end(currentGroup[i]);
if (start == end || preserveOriginal && start == 0
&& spare.length() == end) {
currentGroup[i]++;
continue;
}
if (start < min_offset) {
min_offset = start;
currentMatcher = i;
}
break;
}
if (currentGroup[i] == groupCounts[i] + 1) {
currentGroup[i] = -1;
i--;
}
}
}
return currentMatcher != -1;
}
@Override
public boolean incrementToken() throws IOException {
if (currentMatcher != -1 && nextCapture()) {
assert state != null;
clearAttributes();
restoreState(state);
final int start = matchers[currentMatcher]
.start(currentGroup[currentMatcher]);
final int end = matchers[currentMatcher]
.end(currentGroup[currentMatcher]);
// modified code starts
main_token_start = offsetAtt.startOffset();
main_token_end = offsetAtt.endOffset();
final int newStart = start + main_token_start;
final int newEnd = end + main_token_start;
offsetAtt.setOffset(newStart, newEnd);
// modified code ends
posAttr.setPositionIncrement(0);
charTermAttr.copyBuffer(spare.chars(), start, end - start);
currentGroup[currentMatcher]++;
return true;
}
if (!input.incrementToken()) {
return false;
}
char[] buffer = charTermAttr.buffer();
int length = charTermAttr.length();
spare.copyChars(buffer, 0, length);
state = captureState();
for (int i = 0; i < matchers.length; i++) {
matchers[i].reset(spare.get());
currentGroup[i] = -1;
}
if (preserveOriginal) {
currentMatcher = 0;
} else if (nextCapture()) {
final int start = matchers[currentMatcher]
.start(currentGroup[currentMatcher]);
final int end = matchers[currentMatcher]
.end(currentGroup[currentMatcher]);
// if we start at 0 we can simply set the length and save the copy
if (start == 0) {
charTermAttr.setLength(end);
} else {
charTermAttr.copyBuffer(spare.chars(), start, end - start);
}
currentGroup[currentMatcher]++;
}
return true;
}
@Override
public void reset() throws IOException {
super.reset();
state = null;
currentMatcher = -1;
}
}`
I came across this GitHub link for reference: #9820. However, I couldn't find a relevant solution to the issue, as it is still marked open.
Is this achievable? Please let me know if anyone has any suggestions.
Version and environment details
I am using lucene's 9.6.0 version.
The text was updated successfully, but these errors were encountered:
Description
I am implementing the PatternCaptureGroupTokenFilter in my code to generate tokens based on multiple regular expressions, with the goal of highlighting any matches found within the string. Currently, I am working with Lucene 9, but I am encountering the following error during execution.
I'm using one of the latest Lucene jars (9.11.1) for PatternCaptureGroupTokenFilter, but the token positions are not as expected, making it difficult to accurately highlight the search results.
Here are the tokens generated for the string:
{ "tokens" : [ { "token" : "test:data", "start_offset" : 0, "end_offset" : 9, "type" : "word", "position" : 0 }, { "token" : "test", "start_offset" : 0, "end_offset" : 9, "type" : "word", "position" : 0 }, { "token" : ":data", "start_offset" : 0, "end_offset" : 9, "type" : "word", "position" : 0 }, { "token" : "test:", "start_offset" : 0, "end_offset" : 9, "type" : "word", "position" : 0 }, { "token" : "test", "start_offset" : 10, "end_offset" : 14, "type" : "word", "position" : 1 } ] }
The generated offsets are not compatible for highlighting the searchedQuery.
To generate different offsets, i tried using the
org.apache.lucene.analysis.tokenattributes.OffsetAttribute
Lucene package. which is giving me the different offsets but i am encountering another error while indexing the document.I am using the below java code.
`public final class PatternCaptureGroupTokenFilter extends TokenFilter {
}`
I came across this GitHub link for reference: #9820. However, I couldn't find a relevant solution to the issue, as it is still marked open.
Is this achievable? Please let me know if anyone has any suggestions.
Version and environment details
I am using lucene's 9.6.0 version.
The text was updated successfully, but these errors were encountered: