-
Notifications
You must be signed in to change notification settings - Fork 65
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: implement text chunking processor with fixed token length and delimiter algorithm #607
feat: implement text chunking processor with fixed token length and delimiter algorithm #607
Conversation
For now, this PR is a POC for the RFC. I will mark this PR as ready when we finalize the high level design and add corresponding unit tests and integration tests. |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #607 +/- ##
============================================
+ Coverage 82.62% 84.19% +1.56%
- Complexity 666 743 +77
============================================
Files 52 59 +7
Lines 2072 2309 +237
Branches 334 370 +36
============================================
+ Hits 1712 1944 +232
- Misses 212 214 +2
- Partials 148 151 +3 ☔ View full report in Codecov by Sentry. |
src/main/java/org/opensearch/neuralsearch/processor/chunker/FixedTokenLengthChunker.java
Outdated
Show resolved
Hide resolved
src/main/java/org/opensearch/neuralsearch/processor/chunker/FixedTokenLengthChunker.java
Outdated
Show resolved
Hide resolved
src/main/java/org/opensearch/neuralsearch/processor/chunker/FixedTokenLengthChunker.java
Outdated
Show resolved
Hide resolved
src/main/java/org/opensearch/neuralsearch/processor/chunker/FixedTokenLengthChunker.java
Show resolved
Hide resolved
30fd0eb
to
57a4a20
Compare
Hi @zane-neo! I have modified the PR according your comments. Feel free to review my code. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thank you for the draft @yuye-aws, I would like us to follow the upcoming new feature release process.
- Lets make sure all feature spec feedback is collected in the RFC [RFC] Text chunking design #548
- Lets create a meta issue with design (I can create one and link it)
- We will move forward with the changes
Do you mean the high level design about the document chunking processor? Is Interface Design section in RFC what you are looking for? |
src/main/java/org/opensearch/neuralsearch/processor/DocumentChunkingProcessor.java
Outdated
Show resolved
Hide resolved
src/main/java/org/opensearch/neuralsearch/processor/DocumentChunkingProcessor.java
Outdated
Show resolved
Hide resolved
src/main/java/org/opensearch/neuralsearch/processor/DocumentChunkingProcessor.java
Outdated
Show resolved
Hide resolved
Signed-off-by: yuye-aws <yuyezhu@amazon.com>
Signed-off-by: yuye-aws <yuyezhu@amazon.com>
Signed-off-by: yuye-aws <yuyezhu@amazon.com>
Signed-off-by: yuye-aws <yuyezhu@amazon.com>
Signed-off-by: yuye-aws <yuyezhu@amazon.com>
Signed-off-by: yuye-aws <yuyezhu@amazon.com>
Signed-off-by: yuye-aws <yuyezhu@amazon.com>
Signed-off-by: yuye-aws <yuyezhu@amazon.com>
Signed-off-by: yuye-aws <yuyezhu@amazon.com>
Signed-off-by: yuye-aws <yuyezhu@amazon.com>
Signed-off-by: yuye-aws <yuyezhu@amazon.com>
Signed-off-by: yuye-aws <yuyezhu@amazon.com>
Signed-off-by: yuye-aws <yuyezhu@amazon.com>
Signed-off-by: yuye-aws <yuyezhu@amazon.com>
Signed-off-by: yuye-aws <yuyezhu@amazon.com>
Signed-off-by: yuye-aws <yuyezhu@amazon.com>
src/main/java/org/opensearch/neuralsearch/processor/TextChunkingProcessor.java
Show resolved
Hide resolved
private static final Set<String> WORD_TOKENIZERS = Set.of( | ||
"standard", | ||
"letter", | ||
"lowercase", | ||
"whitespace", | ||
"uax_url_email", | ||
"classic", | ||
"thai" | ||
); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently let's don't support any customized tokenizer there, to avoid ones with overlapping. We can have some intelligent checker for tokenizers later.
throw new IllegalStateException( | ||
String.format(Locale.ROOT, "%s algorithm encounters exception in tokenization: %s", ALGORITHM_NAME, e.getMessage()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is ok to include the original message, but the wording is too simple. We need to explain why this is happening.
…elimiter algorithm (#607) * implement chunking processor and fixed token length Signed-off-by: yuye-aws <yuyezhu@amazon.com> * initialize node client for document chunking processor Signed-off-by: yuye-aws <yuyezhu@amazon.com> * initialize document chunking processor with analysis registry Signed-off-by: yuye-aws <yuyezhu@amazon.com> * chunker factory create with analysis registry Signed-off-by: yuye-aws <yuyezhu@amazon.com> * implement tokenizer in fixed token length algorithm with analysis registry Signed-off-by: yuye-aws <yuyezhu@amazon.com> * add max token count parsing logic Signed-off-by: yuye-aws <yuyezhu@amazon.com> * bug fix for non-existing index Signed-off-by: yuye-aws <yuyezhu@amazon.com> * change error log Signed-off-by: yuye-aws <yuyezhu@amazon.com> * implement evenly chunk Signed-off-by: yuye-aws <yuyezhu@amazon.com> * unit tests for chunker factory Signed-off-by: yuye-aws <yuyezhu@amazon.com> * unit tests for chunker factory Signed-off-by: yuye-aws <yuyezhu@amazon.com> * add error message for chunker factory tests Signed-off-by: yuye-aws <yuyezhu@amazon.com> * resolve comments Signed-off-by: yuye-aws <yuyezhu@amazon.com> * Revert "implement evenly chunk" This reverts commit 93dd2f4. Signed-off-by: yuye-aws <yuyezhu@amazon.com> * add default value logic back Signed-off-by: yuye-aws <yuyezhu@amazon.com> * implement unit test for fixed token length chunker Signed-off-by: yuye-aws <yuyezhu@amazon.com> * add test cases in unit test for fixed token length chunker Signed-off-by: yuye-aws <yuyezhu@amazon.com> * support map type as an input Signed-off-by: yuye-aws <yuyezhu@amazon.com> * support map type as an input Signed-off-by: yuye-aws <yuyezhu@amazon.com> * bug fix for map type Signed-off-by: yuye-aws <yuyezhu@amazon.com> * bug fix for map type Signed-off-by: yuye-aws <yuyezhu@amazon.com> * bug fix for map type in document chunking processor Signed-off-by: yuye-aws <yuyezhu@amazon.com> * remove system out println Signed-off-by: yuye-aws <yuyezhu@amazon.com> * add delimiter chunker Signed-off-by: xinyual <xinyual@amazon.com> Signed-off-by: yuye-aws <yuyezhu@amazon.com> * add UT for delimiter chunker Signed-off-by: xinyual <xinyual@amazon.com> Signed-off-by: yuye-aws <yuyezhu@amazon.com> * add delimiter chunker processor Signed-off-by: xinyual <xinyual@amazon.com> Signed-off-by: yuye-aws <yuyezhu@amazon.com> * add more UTs Signed-off-by: xinyual <xinyual@amazon.com> Signed-off-by: yuye-aws <yuyezhu@amazon.com> * add more UTs Signed-off-by: xinyual <xinyual@amazon.com> Signed-off-by: yuye-aws <yuyezhu@amazon.com> * basic unit tests for document chunking processor Signed-off-by: yuye-aws <yuyezhu@amazon.com> * fix tests for getProcessors in neural search Signed-off-by: yuye-aws <yuyezhu@amazon.com> * add unit tests with string, map and nested map type for document chunking processor Signed-off-by: yuye-aws <yuyezhu@amazon.com> * add unit tests for parameter valdiation in document chunking processor Signed-off-by: yuye-aws <yuyezhu@amazon.com> * add back deleted xml file Signed-off-by: yuye-aws <yuyezhu@amazon.com> * restore xml file Signed-off-by: yuye-aws <yuyezhu@amazon.com> * integration tests for document chunking processor Signed-off-by: yuye-aws <yuyezhu@amazon.com> * add back Run_Neural_Search.xml Signed-off-by: yuye-aws <yuyezhu@amazon.com> * restore Run_Neural_Search.xml Signed-off-by: yuye-aws <yuyezhu@amazon.com> * add changelog Signed-off-by: yuye-aws <yuyezhu@amazon.com> * update integration test for cascade processor Signed-off-by: yuye-aws <yuyezhu@amazon.com> * add max chunk limit Signed-off-by: xinyual <xinyual@amazon.com> Signed-off-by: yuye-aws <yuyezhu@amazon.com> * remove useless and apply spotless Signed-off-by: xinyual <xinyual@amazon.com> Signed-off-by: yuye-aws <yuyezhu@amazon.com> * update error message Signed-off-by: yuye-aws <yuyezhu@amazon.com> * change field UT Signed-off-by: xinyual <xinyual@amazon.com> Signed-off-by: yuye-aws <yuyezhu@amazon.com> * remove useless and apply spotless Signed-off-by: xinyual <xinyual@amazon.com> Signed-off-by: yuye-aws <yuyezhu@amazon.com> * change logic of max chunk number Signed-off-by: xinyual <xinyual@amazon.com> Signed-off-by: yuye-aws <yuyezhu@amazon.com> * add max chunk limit into fixed token length algorithm Signed-off-by: yuye-aws <yuyezhu@amazon.com> * Support list<list<string>> type in embedding and extract validation logic to common class Signed-off-by: zane-neo <zaniu@amazon.com> Signed-off-by: yuye-aws <yuyezhu@amazon.com> * fix unit tests for inference processor Signed-off-by: yuye-aws <yuyezhu@amazon.com> * implement unit tests for unit tests with max_chunk_limit in fixed token length Signed-off-by: yuye-aws <yuyezhu@amazon.com> * constructor for inference processor Signed-off-by: yuye-aws <yuyezhu@amazon.com> * use inference processor Signed-off-by: xinyual <xinyual@amazon.com> Signed-off-by: yuye-aws <yuyezhu@amazon.com> * draft code for extending inference processor with document chunking processor Signed-off-by: yuye-aws <yuyezhu@amazon.com> * api refactor for document chunking processor Signed-off-by: yuye-aws <yuyezhu@amazon.com> * remove nested list key for chunking processor Signed-off-by: yuye-aws <yuyezhu@amazon.com> * remove unused function Signed-off-by: yuye-aws <yuyezhu@amazon.com> * remove processor validator Signed-off-by: yuye-aws <yuyezhu@amazon.com> * remove processor validator Signed-off-by: yuye-aws <yuyezhu@amazon.com> * Revert InferenceProcessor.java Signed-off-by: Yuye Zhu <yuyezhu@amazon.com> Signed-off-by: yuye-aws <yuyezhu@amazon.com> * revert changes in text embedding and sparse encoding processor Signed-off-by: yuye-aws <yuyezhu@amazon.com> * implement chunk with map in document chunking processor Signed-off-by: yuye-aws <yuyezhu@amazon.com> * add default delimiter value Signed-off-by: Lu <xinyual@88665a36eec8.ant.amazon.com> Signed-off-by: yuye-aws <yuyezhu@amazon.com> * implement max chunk logic in document chunking processor Signed-off-by: yuye-aws <yuyezhu@amazon.com> * add initial value for max chunk limit in document chunking processor Signed-off-by: yuye-aws <yuyezhu@amazon.com> * bug fix in chunking processor: allow 0 max_chunk_limit Signed-off-by: yuye-aws <yuyezhu@amazon.com> * implement overlap rate with big decimal Signed-off-by: yuye-aws <yuyezhu@amazon.com> * update max chunk limit in delimiter Signed-off-by: yuye-aws <yuyezhu@amazon.com> * update parameter setting for fixed token length algorithm Signed-off-by: yuye-aws <yuyezhu@amazon.com> * update max chunk limit implementation in chunking processor Signed-off-by: yuye-aws <yuyezhu@amazon.com> * fix unit tests for fixed token length algorithm Signed-off-by: yuye-aws <yuyezhu@amazon.com> * spotless apply for document chunking processor Signed-off-by: yuye-aws <yuyezhu@amazon.com> * initialize current chunk count Signed-off-by: yuye-aws <yuyezhu@amazon.com> * parameter validation for max chunk limit Signed-off-by: yuye-aws <yuyezhu@amazon.com> * fix integration tests Signed-off-by: yuye-aws <yuyezhu@amazon.com> * fix current UT Signed-off-by: xinyual <xinyual@amazon.com> Signed-off-by: yuye-aws <yuyezhu@amazon.com> * change delimiter UT Signed-off-by: xinyual <xinyual@amazon.com> Signed-off-by: yuye-aws <yuyezhu@amazon.com> * remove delimiter useless code Signed-off-by: xinyual <xinyual@amazon.com> Signed-off-by: yuye-aws <yuyezhu@amazon.com> * add more UT Signed-off-by: xinyual <xinyual@amazon.com> Signed-off-by: yuye-aws <yuyezhu@amazon.com> * add UT for list inside map Signed-off-by: xinyual <xinyual@amazon.com> Signed-off-by: yuye-aws <yuyezhu@amazon.com> * add UT for list inside map Signed-off-by: xinyual <xinyual@amazon.com> Signed-off-by: yuye-aws <yuyezhu@amazon.com> * update unit tests for chunking processor Signed-off-by: yuye-aws <yuyezhu@amazon.com> * add more unit tests for chunking processor Signed-off-by: yuye-aws <yuyezhu@amazon.com> * resolve code review comments Signed-off-by: yuye-aws <yuyezhu@amazon.com> * add java doc Signed-off-by: yuye-aws <yuyezhu@amazon.com> * update java doc Signed-off-by: yuye-aws <yuyezhu@amazon.com> * update java doc Signed-off-by: yuye-aws <yuyezhu@amazon.com> * fix import order Signed-off-by: yuye-aws <yuyezhu@amazon.com> * update java doc Signed-off-by: yuye-aws <yuyezhu@amazon.com> * fix java doc error Signed-off-by: yuye-aws <yuyezhu@amazon.com> * fix update ut for fixed token length chunker Signed-off-by: yuye-aws <yuyezhu@amazon.com> * resolve code review comments Signed-off-by: yuye-aws <yuyezhu@amazon.com> * resolve code review comments Signed-off-by: yuye-aws <yuyezhu@amazon.com> * resolve code review comments Signed-off-by: yuye-aws <yuyezhu@amazon.com> * resolve code review comments Signed-off-by: yuye-aws <yuyezhu@amazon.com> * implement chunk count wrapper for max chunk limit Signed-off-by: yuye-aws <yuyezhu@amazon.com> * rename variable end to nextDelimiterPosition Signed-off-by: yuye-aws <yuyezhu@amazon.com> * adjust method place Signed-off-by: yuye-aws <yuyezhu@amazon.com> * update java doc for fixed token length algorithm Signed-off-by: yuye-aws <yuyezhu@amazon.com> * reanme interface name and fixed token length algorithm name Signed-off-by: yuye-aws <yuyezhu@amazon.com> * update fixed token length algorithm configuration for integration tests Signed-off-by: yuye-aws <yuyezhu@amazon.com> * make delimiter member variables static Signed-off-by: yuye-aws <yuyezhu@amazon.com> * remove redundant set field value in execute method Signed-off-by: yuye-aws <yuyezhu@amazon.com> * resolve code review comments Signed-off-by: yuye-aws <yuyezhu@amazon.com> * add integration tests with more tokenizers Signed-off-by: yuye-aws <yuyezhu@amazon.com> * bug fix: unit test failure due to invalid tokenizer Signed-off-by: yuye-aws <yuyezhu@amazon.com> * bug fix: token concatenation in fixed token length algorithm Signed-off-by: yuye-aws <yuyezhu@amazon.com> * update chunker interface Signed-off-by: yuye-aws <yuyezhu@amazon.com> * track chunkCount within function Signed-off-by: yuye-aws <yuyezhu@amazon.com> * bug fix: allow white space as the delimiter Signed-off-by: yuye-aws <yuyezhu@amazon.com> * fix fixed length chunker Signed-off-by: xinyual <xinyual@amazon.com> * fix delimiter chunker Signed-off-by: xinyual <xinyual@amazon.com> * fix chunker factory Signed-off-by: xinyual <xinyual@amazon.com> * fix UTs Signed-off-by: xinyual <xinyual@amazon.com> * fix UT and chunker factory Signed-off-by: xinyual <xinyual@amazon.com> * move analysis_registry to non-runtime parameters Signed-off-by: xinyual <xinyual@amazon.com> * fix Uts Signed-off-by: xinyual <xinyual@amazon.com> * avoid java doc change Signed-off-by: xinyual <xinyual@amazon.com> * move validate to commonUtlis Signed-off-by: xinyual <xinyual@amazon.com> * remove useless function Signed-off-by: xinyual <xinyual@amazon.com> * change java doc Signed-off-by: xinyual <xinyual@amazon.com> * fix Document process ut Signed-off-by: xinyual <xinyual@amazon.com> * fixed token length: re-implement with start and end offset Signed-off-by: yuye-aws <yuyezhu@amazon.com> * update exception message Signed-off-by: yuye-aws <yuyezhu@amazon.com> * fix document chunking processor IT Signed-off-by: yuye-aws <yuyezhu@amazon.com> * bug fix: adjust start, end content position in fixed token length algorithm Signed-off-by: yuye-aws <yuyezhu@amazon.com> * update changelog for 2.x release Signed-off-by: yuye-aws <yuyezhu@amazon.com> * rename processor Signed-off-by: yuye-aws <yuyezhu@amazon.com> * update default delimiter to be \n\n Signed-off-by: yuye-aws <yuyezhu@amazon.com> * remove change log in 3.0 unreleased Signed-off-by: yuye-aws <yuyezhu@amazon.com> * fix IT failure due to chunking processor rename Signed-off-by: yuye-aws <yuyezhu@amazon.com> * update javadoc for text chunking processor factory Signed-off-by: yuye-aws <yuyezhu@amazon.com> * adjust functions in chunker interface Signed-off-by: yuye-aws <yuyezhu@amazon.com> * move algorithm name definition to concrete chunker class Signed-off-by: yuye-aws <yuyezhu@amazon.com> * update string formatted message for text chunking processor Signed-off-by: yuye-aws <yuyezhu@amazon.com> * update string formatted message for chunker factory Signed-off-by: yuye-aws <yuyezhu@amazon.com> * update string formatted message for chunker parameter validator Signed-off-by: yuye-aws <yuyezhu@amazon.com> * update java doc for delimiter algorithm Signed-off-by: yuye-aws <yuyezhu@amazon.com> * support range double in chunker parameter validator Signed-off-by: yuye-aws <yuyezhu@amazon.com> * update string formatted message for fixed token length algorithm Signed-off-by: yuye-aws <yuyezhu@amazon.com> * update sneaky throw with text chunking processor it Signed-off-by: yuye-aws <yuyezhu@amazon.com> * add word tokenizer restriction for fixed token length algorithm Signed-off-by: yuye-aws <yuyezhu@amazon.com> * update error message for multiple algorithms in text chunking processor Signed-off-by: yuye-aws <yuyezhu@amazon.com> * add comment in text chunking processor Signed-off-by: yuye-aws <yuyezhu@amazon.com> * validate max chunk limit with util parameter class Signed-off-by: yuye-aws <yuyezhu@amazon.com> * update comments Signed-off-by: yuye-aws <yuyezhu@amazon.com> * update comments Signed-off-by: yuye-aws <yuyezhu@amazon.com> * update java doc Signed-off-by: yuye-aws <yuyezhu@amazon.com> * update java doc Signed-off-by: yuye-aws <yuyezhu@amazon.com> * make parameter final Signed-off-by: yuye-aws <yuyezhu@amazon.com> * implement a map from chunker name to constuctor function in chunker factory Signed-off-by: yuye-aws <yuyezhu@amazon.com> * bug fix in chunker factory Signed-off-by: yuye-aws <yuyezhu@amazon.com> * remove get all chunkers in chunker factory Signed-off-by: yuye-aws <yuyezhu@amazon.com> * remove type check for parameter check for max token count Signed-off-by: yuye-aws <yuyezhu@amazon.com> * remove type check for parameter check for analysis registry Signed-off-by: yuye-aws <yuyezhu@amazon.com> * implement parser and validator Signed-off-by: yuye-aws <yuyezhu@amazon.com> * update comment Signed-off-by: yuye-aws <yuyezhu@amazon.com> * provide fixed token length as the default algorithm Signed-off-by: yuye-aws <yuyezhu@amazon.com> * adjust exception message Signed-off-by: yuye-aws <yuyezhu@amazon.com> * adjust exception message Signed-off-by: yuye-aws <yuyezhu@amazon.com> * use object nonnull and require nonnull Signed-off-by: yuye-aws <yuyezhu@amazon.com> * apply final to ingest document and chunk count Signed-off-by: yuye-aws <yuyezhu@amazon.com> * merge parameter validator into the parser Signed-off-by: yuye-aws <yuyezhu@amazon.com> * assign positive default value for max chunk limit Signed-off-by: yuye-aws <yuyezhu@amazon.com> * validate supported chunker algorithm in text chunking processor Signed-off-by: yuye-aws <yuyezhu@amazon.com> * update parameter setting of max chunk limit Signed-off-by: yuye-aws <yuyezhu@amazon.com> * add unit test with non list of string Signed-off-by: yuye-aws <yuyezhu@amazon.com> * add unit test with null input Signed-off-by: yuye-aws <yuyezhu@amazon.com> * add unit test for tokenization excpetion in fixed token length algorithm Signed-off-by: yuye-aws <yuyezhu@amazon.com> * tune method name in text chunking processor unit test Signed-off-by: yuye-aws <yuyezhu@amazon.com> * tune method name in delimiter algorithm unit test Signed-off-by: yuye-aws <yuyezhu@amazon.com> * add unit test for overlap rate too small in fixed token length algorithm Signed-off-by: yuye-aws <yuyezhu@amazon.com> * tune method modifier for all classes Signed-off-by: yuye-aws <yuyezhu@amazon.com> * tune code Signed-off-by: yuye-aws <yuyezhu@amazon.com> * tune code Signed-off-by: yuye-aws <yuyezhu@amazon.com> * tune exception type in parameter parser Signed-off-by: yuye-aws <yuyezhu@amazon.com> * tune comment Signed-off-by: yuye-aws <yuyezhu@amazon.com> * tune comment Signed-off-by: yuye-aws <yuyezhu@amazon.com> * include max chunk limit in both algorithms Signed-off-by: yuye-aws <yuyezhu@amazon.com> * tune comment Signed-off-by: yuye-aws <yuyezhu@amazon.com> * allow 0 for max chunk limit Signed-off-by: yuye-aws <yuyezhu@amazon.com> * update runtime max chunk limit in text chunking processor Signed-off-by: yuye-aws <yuyezhu@amazon.com> * tune code for chunker Signed-off-by: yuye-aws <yuyezhu@amazon.com> * implement test for multiple field max chunk limit exceed Signed-off-by: yuye-aws <yuyezhu@amazon.com> * tune methods name in text chunking proceesor unit tests Signed-off-by: yuye-aws <yuyezhu@amazon.com> * add unit tests for both algorithms with max chunk limit Signed-off-by: yuye-aws <yuyezhu@amazon.com> * optimize code Signed-off-by: yuye-aws <yuyezhu@amazon.com> * extract max chunk limit check to util class Signed-off-by: yuye-aws <yuyezhu@amazon.com> * resolve code review comments Signed-off-by: yuye-aws <yuyezhu@amazon.com> * fix unit tests Signed-off-by: yuye-aws <yuyezhu@amazon.com> * bug fix: only update runtime max chunk limit when enabled Signed-off-by: yuye-aws <yuyezhu@amazon.com> --------- Signed-off-by: yuye-aws <yuyezhu@amazon.com> Signed-off-by: xinyual <xinyual@amazon.com> Signed-off-by: zane-neo <zaniu@amazon.com> Signed-off-by: Yuye Zhu <yuyezhu@amazon.com> Signed-off-by: Lu <xinyual@88665a36eec8.ant.amazon.com> Co-authored-by: xinyual <xinyual@amazon.com> Co-authored-by: zane-neo <zaniu@amazon.com> Co-authored-by: Lu <xinyual@88665a36eec8.ant.amazon.com> (cherry picked from commit eea53aa)
…en length and delimiter algorithm (#644) * feat: implement text chunking processor with fixed token length and delimiter algorithm (#607) * implement chunking processor and fixed token length Signed-off-by: yuye-aws <yuyezhu@amazon.com> * initialize node client for document chunking processor Signed-off-by: yuye-aws <yuyezhu@amazon.com> * initialize document chunking processor with analysis registry Signed-off-by: yuye-aws <yuyezhu@amazon.com> * chunker factory create with analysis registry Signed-off-by: yuye-aws <yuyezhu@amazon.com> * implement tokenizer in fixed token length algorithm with analysis registry Signed-off-by: yuye-aws <yuyezhu@amazon.com> * add max token count parsing logic Signed-off-by: yuye-aws <yuyezhu@amazon.com> * bug fix for non-existing index Signed-off-by: yuye-aws <yuyezhu@amazon.com> * change error log Signed-off-by: yuye-aws <yuyezhu@amazon.com> * implement evenly chunk Signed-off-by: yuye-aws <yuyezhu@amazon.com> * unit tests for chunker factory Signed-off-by: yuye-aws <yuyezhu@amazon.com> * unit tests for chunker factory Signed-off-by: yuye-aws <yuyezhu@amazon.com> * add error message for chunker factory tests Signed-off-by: yuye-aws <yuyezhu@amazon.com> * resolve comments Signed-off-by: yuye-aws <yuyezhu@amazon.com> * Revert "implement evenly chunk" This reverts commit 93dd2f4. Signed-off-by: yuye-aws <yuyezhu@amazon.com> * add default value logic back Signed-off-by: yuye-aws <yuyezhu@amazon.com> * implement unit test for fixed token length chunker Signed-off-by: yuye-aws <yuyezhu@amazon.com> * add test cases in unit test for fixed token length chunker Signed-off-by: yuye-aws <yuyezhu@amazon.com> * support map type as an input Signed-off-by: yuye-aws <yuyezhu@amazon.com> * support map type as an input Signed-off-by: yuye-aws <yuyezhu@amazon.com> * bug fix for map type Signed-off-by: yuye-aws <yuyezhu@amazon.com> * bug fix for map type Signed-off-by: yuye-aws <yuyezhu@amazon.com> * bug fix for map type in document chunking processor Signed-off-by: yuye-aws <yuyezhu@amazon.com> * remove system out println Signed-off-by: yuye-aws <yuyezhu@amazon.com> * add delimiter chunker Signed-off-by: xinyual <xinyual@amazon.com> Signed-off-by: yuye-aws <yuyezhu@amazon.com> * add UT for delimiter chunker Signed-off-by: xinyual <xinyual@amazon.com> Signed-off-by: yuye-aws <yuyezhu@amazon.com> * add delimiter chunker processor Signed-off-by: xinyual <xinyual@amazon.com> Signed-off-by: yuye-aws <yuyezhu@amazon.com> * add more UTs Signed-off-by: xinyual <xinyual@amazon.com> Signed-off-by: yuye-aws <yuyezhu@amazon.com> * add more UTs Signed-off-by: xinyual <xinyual@amazon.com> Signed-off-by: yuye-aws <yuyezhu@amazon.com> * basic unit tests for document chunking processor Signed-off-by: yuye-aws <yuyezhu@amazon.com> * fix tests for getProcessors in neural search Signed-off-by: yuye-aws <yuyezhu@amazon.com> * add unit tests with string, map and nested map type for document chunking processor Signed-off-by: yuye-aws <yuyezhu@amazon.com> * add unit tests for parameter valdiation in document chunking processor Signed-off-by: yuye-aws <yuyezhu@amazon.com> * add back deleted xml file Signed-off-by: yuye-aws <yuyezhu@amazon.com> * restore xml file Signed-off-by: yuye-aws <yuyezhu@amazon.com> * integration tests for document chunking processor Signed-off-by: yuye-aws <yuyezhu@amazon.com> * add back Run_Neural_Search.xml Signed-off-by: yuye-aws <yuyezhu@amazon.com> * restore Run_Neural_Search.xml Signed-off-by: yuye-aws <yuyezhu@amazon.com> * add changelog Signed-off-by: yuye-aws <yuyezhu@amazon.com> * update integration test for cascade processor Signed-off-by: yuye-aws <yuyezhu@amazon.com> * add max chunk limit Signed-off-by: xinyual <xinyual@amazon.com> Signed-off-by: yuye-aws <yuyezhu@amazon.com> * remove useless and apply spotless Signed-off-by: xinyual <xinyual@amazon.com> Signed-off-by: yuye-aws <yuyezhu@amazon.com> * update error message Signed-off-by: yuye-aws <yuyezhu@amazon.com> * change field UT Signed-off-by: xinyual <xinyual@amazon.com> Signed-off-by: yuye-aws <yuyezhu@amazon.com> * remove useless and apply spotless Signed-off-by: xinyual <xinyual@amazon.com> Signed-off-by: yuye-aws <yuyezhu@amazon.com> * change logic of max chunk number Signed-off-by: xinyual <xinyual@amazon.com> Signed-off-by: yuye-aws <yuyezhu@amazon.com> * add max chunk limit into fixed token length algorithm Signed-off-by: yuye-aws <yuyezhu@amazon.com> * Support list<list<string>> type in embedding and extract validation logic to common class Signed-off-by: zane-neo <zaniu@amazon.com> Signed-off-by: yuye-aws <yuyezhu@amazon.com> * fix unit tests for inference processor Signed-off-by: yuye-aws <yuyezhu@amazon.com> * implement unit tests for unit tests with max_chunk_limit in fixed token length Signed-off-by: yuye-aws <yuyezhu@amazon.com> * constructor for inference processor Signed-off-by: yuye-aws <yuyezhu@amazon.com> * use inference processor Signed-off-by: xinyual <xinyual@amazon.com> Signed-off-by: yuye-aws <yuyezhu@amazon.com> * draft code for extending inference processor with document chunking processor Signed-off-by: yuye-aws <yuyezhu@amazon.com> * api refactor for document chunking processor Signed-off-by: yuye-aws <yuyezhu@amazon.com> * remove nested list key for chunking processor Signed-off-by: yuye-aws <yuyezhu@amazon.com> * remove unused function Signed-off-by: yuye-aws <yuyezhu@amazon.com> * remove processor validator Signed-off-by: yuye-aws <yuyezhu@amazon.com> * remove processor validator Signed-off-by: yuye-aws <yuyezhu@amazon.com> * Revert InferenceProcessor.java Signed-off-by: Yuye Zhu <yuyezhu@amazon.com> Signed-off-by: yuye-aws <yuyezhu@amazon.com> * revert changes in text embedding and sparse encoding processor Signed-off-by: yuye-aws <yuyezhu@amazon.com> * implement chunk with map in document chunking processor Signed-off-by: yuye-aws <yuyezhu@amazon.com> * add default delimiter value Signed-off-by: Lu <xinyual@88665a36eec8.ant.amazon.com> Signed-off-by: yuye-aws <yuyezhu@amazon.com> * implement max chunk logic in document chunking processor Signed-off-by: yuye-aws <yuyezhu@amazon.com> * add initial value for max chunk limit in document chunking processor Signed-off-by: yuye-aws <yuyezhu@amazon.com> * bug fix in chunking processor: allow 0 max_chunk_limit Signed-off-by: yuye-aws <yuyezhu@amazon.com> * implement overlap rate with big decimal Signed-off-by: yuye-aws <yuyezhu@amazon.com> * update max chunk limit in delimiter Signed-off-by: yuye-aws <yuyezhu@amazon.com> * update parameter setting for fixed token length algorithm Signed-off-by: yuye-aws <yuyezhu@amazon.com> * update max chunk limit implementation in chunking processor Signed-off-by: yuye-aws <yuyezhu@amazon.com> * fix unit tests for fixed token length algorithm Signed-off-by: yuye-aws <yuyezhu@amazon.com> * spotless apply for document chunking processor Signed-off-by: yuye-aws <yuyezhu@amazon.com> * initialize current chunk count Signed-off-by: yuye-aws <yuyezhu@amazon.com> * parameter validation for max chunk limit Signed-off-by: yuye-aws <yuyezhu@amazon.com> * fix integration tests Signed-off-by: yuye-aws <yuyezhu@amazon.com> * fix current UT Signed-off-by: xinyual <xinyual@amazon.com> Signed-off-by: yuye-aws <yuyezhu@amazon.com> * change delimiter UT Signed-off-by: xinyual <xinyual@amazon.com> Signed-off-by: yuye-aws <yuyezhu@amazon.com> * remove delimiter useless code Signed-off-by: xinyual <xinyual@amazon.com> Signed-off-by: yuye-aws <yuyezhu@amazon.com> * add more UT Signed-off-by: xinyual <xinyual@amazon.com> Signed-off-by: yuye-aws <yuyezhu@amazon.com> * add UT for list inside map Signed-off-by: xinyual <xinyual@amazon.com> Signed-off-by: yuye-aws <yuyezhu@amazon.com> * add UT for list inside map Signed-off-by: xinyual <xinyual@amazon.com> Signed-off-by: yuye-aws <yuyezhu@amazon.com> * update unit tests for chunking processor Signed-off-by: yuye-aws <yuyezhu@amazon.com> * add more unit tests for chunking processor Signed-off-by: yuye-aws <yuyezhu@amazon.com> * resolve code review comments Signed-off-by: yuye-aws <yuyezhu@amazon.com> * add java doc Signed-off-by: yuye-aws <yuyezhu@amazon.com> * update java doc Signed-off-by: yuye-aws <yuyezhu@amazon.com> * update java doc Signed-off-by: yuye-aws <yuyezhu@amazon.com> * fix import order Signed-off-by: yuye-aws <yuyezhu@amazon.com> * update java doc Signed-off-by: yuye-aws <yuyezhu@amazon.com> * fix java doc error Signed-off-by: yuye-aws <yuyezhu@amazon.com> * fix update ut for fixed token length chunker Signed-off-by: yuye-aws <yuyezhu@amazon.com> * resolve code review comments Signed-off-by: yuye-aws <yuyezhu@amazon.com> * resolve code review comments Signed-off-by: yuye-aws <yuyezhu@amazon.com> * resolve code review comments Signed-off-by: yuye-aws <yuyezhu@amazon.com> * resolve code review comments Signed-off-by: yuye-aws <yuyezhu@amazon.com> * implement chunk count wrapper for max chunk limit Signed-off-by: yuye-aws <yuyezhu@amazon.com> * rename variable end to nextDelimiterPosition Signed-off-by: yuye-aws <yuyezhu@amazon.com> * adjust method place Signed-off-by: yuye-aws <yuyezhu@amazon.com> * update java doc for fixed token length algorithm Signed-off-by: yuye-aws <yuyezhu@amazon.com> * reanme interface name and fixed token length algorithm name Signed-off-by: yuye-aws <yuyezhu@amazon.com> * update fixed token length algorithm configuration for integration tests Signed-off-by: yuye-aws <yuyezhu@amazon.com> * make delimiter member variables static Signed-off-by: yuye-aws <yuyezhu@amazon.com> * remove redundant set field value in execute method Signed-off-by: yuye-aws <yuyezhu@amazon.com> * resolve code review comments Signed-off-by: yuye-aws <yuyezhu@amazon.com> * add integration tests with more tokenizers Signed-off-by: yuye-aws <yuyezhu@amazon.com> * bug fix: unit test failure due to invalid tokenizer Signed-off-by: yuye-aws <yuyezhu@amazon.com> * bug fix: token concatenation in fixed token length algorithm Signed-off-by: yuye-aws <yuyezhu@amazon.com> * update chunker interface Signed-off-by: yuye-aws <yuyezhu@amazon.com> * track chunkCount within function Signed-off-by: yuye-aws <yuyezhu@amazon.com> * bug fix: allow white space as the delimiter Signed-off-by: yuye-aws <yuyezhu@amazon.com> * fix fixed length chunker Signed-off-by: xinyual <xinyual@amazon.com> * fix delimiter chunker Signed-off-by: xinyual <xinyual@amazon.com> * fix chunker factory Signed-off-by: xinyual <xinyual@amazon.com> * fix UTs Signed-off-by: xinyual <xinyual@amazon.com> * fix UT and chunker factory Signed-off-by: xinyual <xinyual@amazon.com> * move analysis_registry to non-runtime parameters Signed-off-by: xinyual <xinyual@amazon.com> * fix Uts Signed-off-by: xinyual <xinyual@amazon.com> * avoid java doc change Signed-off-by: xinyual <xinyual@amazon.com> * move validate to commonUtlis Signed-off-by: xinyual <xinyual@amazon.com> * remove useless function Signed-off-by: xinyual <xinyual@amazon.com> * change java doc Signed-off-by: xinyual <xinyual@amazon.com> * fix Document process ut Signed-off-by: xinyual <xinyual@amazon.com> * fixed token length: re-implement with start and end offset Signed-off-by: yuye-aws <yuyezhu@amazon.com> * update exception message Signed-off-by: yuye-aws <yuyezhu@amazon.com> * fix document chunking processor IT Signed-off-by: yuye-aws <yuyezhu@amazon.com> * bug fix: adjust start, end content position in fixed token length algorithm Signed-off-by: yuye-aws <yuyezhu@amazon.com> * update changelog for 2.x release Signed-off-by: yuye-aws <yuyezhu@amazon.com> * rename processor Signed-off-by: yuye-aws <yuyezhu@amazon.com> * update default delimiter to be \n\n Signed-off-by: yuye-aws <yuyezhu@amazon.com> * remove change log in 3.0 unreleased Signed-off-by: yuye-aws <yuyezhu@amazon.com> * fix IT failure due to chunking processor rename Signed-off-by: yuye-aws <yuyezhu@amazon.com> * update javadoc for text chunking processor factory Signed-off-by: yuye-aws <yuyezhu@amazon.com> * adjust functions in chunker interface Signed-off-by: yuye-aws <yuyezhu@amazon.com> * move algorithm name definition to concrete chunker class Signed-off-by: yuye-aws <yuyezhu@amazon.com> * update string formatted message for text chunking processor Signed-off-by: yuye-aws <yuyezhu@amazon.com> * update string formatted message for chunker factory Signed-off-by: yuye-aws <yuyezhu@amazon.com> * update string formatted message for chunker parameter validator Signed-off-by: yuye-aws <yuyezhu@amazon.com> * update java doc for delimiter algorithm Signed-off-by: yuye-aws <yuyezhu@amazon.com> * support range double in chunker parameter validator Signed-off-by: yuye-aws <yuyezhu@amazon.com> * update string formatted message for fixed token length algorithm Signed-off-by: yuye-aws <yuyezhu@amazon.com> * update sneaky throw with text chunking processor it Signed-off-by: yuye-aws <yuyezhu@amazon.com> * add word tokenizer restriction for fixed token length algorithm Signed-off-by: yuye-aws <yuyezhu@amazon.com> * update error message for multiple algorithms in text chunking processor Signed-off-by: yuye-aws <yuyezhu@amazon.com> * add comment in text chunking processor Signed-off-by: yuye-aws <yuyezhu@amazon.com> * validate max chunk limit with util parameter class Signed-off-by: yuye-aws <yuyezhu@amazon.com> * update comments Signed-off-by: yuye-aws <yuyezhu@amazon.com> * update comments Signed-off-by: yuye-aws <yuyezhu@amazon.com> * update java doc Signed-off-by: yuye-aws <yuyezhu@amazon.com> * update java doc Signed-off-by: yuye-aws <yuyezhu@amazon.com> * make parameter final Signed-off-by: yuye-aws <yuyezhu@amazon.com> * implement a map from chunker name to constuctor function in chunker factory Signed-off-by: yuye-aws <yuyezhu@amazon.com> * bug fix in chunker factory Signed-off-by: yuye-aws <yuyezhu@amazon.com> * remove get all chunkers in chunker factory Signed-off-by: yuye-aws <yuyezhu@amazon.com> * remove type check for parameter check for max token count Signed-off-by: yuye-aws <yuyezhu@amazon.com> * remove type check for parameter check for analysis registry Signed-off-by: yuye-aws <yuyezhu@amazon.com> * implement parser and validator Signed-off-by: yuye-aws <yuyezhu@amazon.com> * update comment Signed-off-by: yuye-aws <yuyezhu@amazon.com> * provide fixed token length as the default algorithm Signed-off-by: yuye-aws <yuyezhu@amazon.com> * adjust exception message Signed-off-by: yuye-aws <yuyezhu@amazon.com> * adjust exception message Signed-off-by: yuye-aws <yuyezhu@amazon.com> * use object nonnull and require nonnull Signed-off-by: yuye-aws <yuyezhu@amazon.com> * apply final to ingest document and chunk count Signed-off-by: yuye-aws <yuyezhu@amazon.com> * merge parameter validator into the parser Signed-off-by: yuye-aws <yuyezhu@amazon.com> * assign positive default value for max chunk limit Signed-off-by: yuye-aws <yuyezhu@amazon.com> * validate supported chunker algorithm in text chunking processor Signed-off-by: yuye-aws <yuyezhu@amazon.com> * update parameter setting of max chunk limit Signed-off-by: yuye-aws <yuyezhu@amazon.com> * add unit test with non list of string Signed-off-by: yuye-aws <yuyezhu@amazon.com> * add unit test with null input Signed-off-by: yuye-aws <yuyezhu@amazon.com> * add unit test for tokenization excpetion in fixed token length algorithm Signed-off-by: yuye-aws <yuyezhu@amazon.com> * tune method name in text chunking processor unit test Signed-off-by: yuye-aws <yuyezhu@amazon.com> * tune method name in delimiter algorithm unit test Signed-off-by: yuye-aws <yuyezhu@amazon.com> * add unit test for overlap rate too small in fixed token length algorithm Signed-off-by: yuye-aws <yuyezhu@amazon.com> * tune method modifier for all classes Signed-off-by: yuye-aws <yuyezhu@amazon.com> * tune code Signed-off-by: yuye-aws <yuyezhu@amazon.com> * tune code Signed-off-by: yuye-aws <yuyezhu@amazon.com> * tune exception type in parameter parser Signed-off-by: yuye-aws <yuyezhu@amazon.com> * tune comment Signed-off-by: yuye-aws <yuyezhu@amazon.com> * tune comment Signed-off-by: yuye-aws <yuyezhu@amazon.com> * include max chunk limit in both algorithms Signed-off-by: yuye-aws <yuyezhu@amazon.com> * tune comment Signed-off-by: yuye-aws <yuyezhu@amazon.com> * allow 0 for max chunk limit Signed-off-by: yuye-aws <yuyezhu@amazon.com> * update runtime max chunk limit in text chunking processor Signed-off-by: yuye-aws <yuyezhu@amazon.com> * tune code for chunker Signed-off-by: yuye-aws <yuyezhu@amazon.com> * implement test for multiple field max chunk limit exceed Signed-off-by: yuye-aws <yuyezhu@amazon.com> * tune methods name in text chunking proceesor unit tests Signed-off-by: yuye-aws <yuyezhu@amazon.com> * add unit tests for both algorithms with max chunk limit Signed-off-by: yuye-aws <yuyezhu@amazon.com> * optimize code Signed-off-by: yuye-aws <yuyezhu@amazon.com> * extract max chunk limit check to util class Signed-off-by: yuye-aws <yuyezhu@amazon.com> * resolve code review comments Signed-off-by: yuye-aws <yuyezhu@amazon.com> * fix unit tests Signed-off-by: yuye-aws <yuyezhu@amazon.com> * bug fix: only update runtime max chunk limit when enabled Signed-off-by: yuye-aws <yuyezhu@amazon.com> --------- Signed-off-by: yuye-aws <yuyezhu@amazon.com> Signed-off-by: xinyual <xinyual@amazon.com> Signed-off-by: zane-neo <zaniu@amazon.com> Signed-off-by: Yuye Zhu <yuyezhu@amazon.com> Signed-off-by: Lu <xinyual@88665a36eec8.ant.amazon.com> Co-authored-by: xinyual <xinyual@amazon.com> Co-authored-by: zane-neo <zaniu@amazon.com> Co-authored-by: Lu <xinyual@88665a36eec8.ant.amazon.com> (cherry picked from commit eea53aa) * bug fix: fix compile error in integration test (#645) Signed-off-by: yuye-aws <yuyezhu@amazon.com> --------- Signed-off-by: yuye-aws <yuyezhu@amazon.com> Co-authored-by: Yuye Zhu <yuyezhu@amazon.com>
// chunk the object when target key is of leaf type (null, string and list of string) | ||
Object chunkObject = sourceAndMetadataMap.get(originalKey); | ||
List<String> chunkedResult = chunkLeafType(chunkObject, runtimeParameters); | ||
sourceAndMetadataMap.put(String.valueOf(targetKey), chunkedResult); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sourceAndMetadataMap
contains some metadata fields such as _index
, _routing
and _id
, if the targetKey
equals the name of the metadata field, may cause accident.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A simple solution is to prohibiting targetKey starting with "_".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me check the behavior of other ingestion processors.
Description
This PR implements the text chunking processor in RFC. We have implemented two algorithms: fixed token length algorithm and delimiter algorithm. Users can use the chunking ingest processor as the following:
And then obtain the response:
You can refer to the RFC for detailed parameter description.
User Cases
Text Embedding
After configuring the text_embedding processor and obtain the model id. We can chain chunking processor together with the text_embedding processor to obtain the embedding vectors for each chunked passages. Here is an example:
And we obtain the following results:
Cascaded Chunking Processors
Users can chain multiple chunking processor together. For example, if a user wish to split documents according to paragraphs, they can apply the Delimiter algorithm and specify the parameter to be "\n\n". In case that a paragraph exceeds the token limit, the user can then append another chunking processor with Fixed Token Length algorithm. The ingestion pipeline in this example should be configured like:
Issues Resolved
Implement document chunking processor and fixed token length algorithm
Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.