-
Notifications
You must be signed in to change notification settings - Fork 72
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] Text chunking design #548
Comments
Could you provide examples of index after chunking? In your example(1000 tokens, overlap 0.5), will there be three KNN index records created? Are they all pointing to same source data? |
By the example where there are 1000 tokens in raw document, token limit is 500 and overlap degree is 0.5 , we will get 3 documents. The tokens range from 1 ~ 500, 251 ~ 750, 501 ~ 1000. Suppose the input field is |
Are there any connection among those three documents? Or they are just three independent docs? |
These three documents are chunked results from the raw document. The first document takes 1st token to 500th. The second document takes 251st token to 750th. The third document takes 501st token to 1000th. Once the raw document get chunked into three passages, we can consider them as independent and perform downstream processors. |
I c, wonder if we can store the relationship among those docs so we can also return the full doc for user reference. |
Suppose the input field is
The full doc is available in |
Should we be able to combine delimiters with fixed token length? Also, can you please show some actual concrete examples for input and output? currently the description of the RFC is only showing various configurations, if you can actually show input/output examples that would be great and would allow to factor in the edge cases. |
If we build a KNN field after chunking, we still need the model_id to convert text to vectors. |
Which tokenizer will be used? Can user customize the tokenizer? |
Nice suggestions. I will elaborate more on my RFC. I will ping you when I update this doc. |
We can apply a subsequent text embedding ingest processor to build KNN field. |
I think we can start with built-in tokenizers listed in https://opensearch.org/docs/latest/analyzers/tokenizers/index/. |
Based on the table, the tokenization results of these tokenizers are word-level. And it will drop some special characters like "-", ",", "(". The |
Agreed. Let me check whether tokenizers for supported pretrained models in OpenSearch is available. |
Hi @samuel-oci! I have updated the documents with a few examples for fixed token length and delimiters algorithm. Please take a look. |
I think no. We come up with two solutions for the corner cases about too long paragraphs.
For too short paragraphs, I cannot come up with a perfect solution to eliminate all corner cases. Maybe we can try to merge the paragraph with the previous or the subsequent paragraph. Do you have any better ideas? |
English tokenizers do not have much difference. For now, I think we can just implement built-in tokenizers with analyzer. |
Can you provide an example of how to chain together this processor with the embedding processors? I expect that one should be able to do the following: data source / ingest processor > (in:string) text chunker (mapped output: string array?) > (in: string array?) embedding/inference processor (mapped output: nested field vector) Specifically, I like clarity on the interface design for mapping the outputted text chunks to an inference processor to produced tested vector chunks. |
Can you also describe your proposal for extensibility. Like the embedding processors, I would like the option to run this workload off-cluster. (adding @ylwu-amzn ) Ideally, we enhance the connector framework in ml-commons so that this processor is an abstraction which could be executed through an external tool integrated through a connector (eg. apache spark, data prepper/opensearch ingestion service), kafka...etc.) or on the OS cluster. |
Sure, I will include this example into my RFC document. |
The proposal does not break any features in extensibility. Here, extensibility means the remote inference feature in ml-commons plugins. The document chunking processor is just a preprocess step before the text embedding processor, which provides an abstraction of the remote inference. The specified model id in the embedding processor can either be a local model or a remote model. |
@yuye-aws, as described in the first line of this feature brief, inference was the "first phase". There are more details provided in internal documents. The long term intent is to create an extensibility framework that allows OpenSearch to be embedded into the entire ML-lifecycle. This requires support for our users existing ML tools. Thus, we need the ability to integrate with tools for data preparation to model training and tuning--not just "remote inference". With that said, "text chunking" support should be design in a way:
With that said, I think it's valuable to provide the an abstraction similar to what was provided for ML inference. This allows users to run the text chunking and embedding processing ingestion pipeline on the cluster for convenience during development, and the ability to run the pipeline off cluster with my choice of data processing tools in production. |
The first step is to chunk one field in the index based on user's configuration. We are not supporting other open source software like Apache Spark and Data Prepper. We prefer to use the ingestion pipeline, which is very convenient. |
Our proposed chunking processor is just a feature to split a string into a list of passages. Unlike inference models, chunking processors do not require many computation resources (does not require GPUs). It is already efficient enough to perform the chunking operation inside the OpenSearch cluster. |
I have updated the RFC document with an example. You can refer to Text Embedding User Case for more details. |
Using the Standard analyzer provided by Lucene can create. The analyser can only produce 100 Tokens. Ref this IT which is failing: https://github.com/opensearch-project/neural-search/actions/runs/8241329594/job/22538377969?pr=607 Looking at the text chunks which users will be creating I can easily see the limit of 100 getting breached. What is our plan for fixing it? cc: @model-collapse , @yuye-aws , @zane-neo |
By default, the analyzer can produce 10,000 tokens. If user wants to generate more token. They can update their index setting like this:
|
I am looking into the failing IT. It takes some time because I can pass it locally. |
This makes this processor dependent on the index settings. Is this a right design choice? |
The dependency of chunking processor on index settings is due to the tokenization analyzer in fixed token length algorithm. The tokenization analyzer needs fetch the Besides, the chunking processor is a part of the ingestion pipeline. When ingesting document with a configured pipeline, users should be aware which index they are ingesting documents into. They should be able to configure the index settings. |
Good news! I have just fixed the failing integration test. |
@yuye-aws can we close this GH issue as the feature is released in 2.13? |
We still need to implement the Markdown algorithm in the future. Shall I include the algorithm design in this RFC or raise a new issue? |
Having a separate issue will be great. As most of the features for this RFC is completed we should close out this. |
Sure. I will close this issue. |
Background
In neural-search plugin, the documents are firstly translated into vectors using embedding models via ingestion processors. Usually, these embedding models will have suggested token limit. For example, the sparse encoding model
amazon/neural-sparse/opensearch-neural-sparse-encoding-v1
has token limit with 512. Truncating tokens for long documents will lead to information loss. This problem was tracked in this issue: #482, and to solve it, we propose to add chucking processor in ingestion pipelines. In OpenSearch 2.13 , we are planning to release two algorithms: Fixed Token Length and Delimiter. The Markdown algorithm will be available in 2.14 release.Different Options
Option 1
Implement chunking processor with both chunking algorithms and text embedding. The input is a long document to be chunked and the output is an array with text embeddings of chunked passages.
Pros
Cons
Option 2 (Selected)
Implement the chunking processor with chunking algorithms alone. The input is a long document to be chunked and the output is chunked passages. As we prefer this option, the following user cases and interface design will be based on this option.
Pros
Cons
User Cases
Text Embedding
We can chain chunking processor together with the text_embedding processor to obtain the embedding vectors for each chunked passages. Here is an example:
And we obtain the following results:
Cascaded Chunking Processors
Users can chain multiple chunking processor together. For example, if a user wish to split documents according to paragraphs, they can apply the Delimiter algorithm and specify the parameter to be "\n\n". In case that a paragraph exceeds the token limit, the user can then append another chunking processor with Fixed Token Length algorithm. The ingestion pipeline in this example should be configured like:
Interface Design
Let’s take an overview of the class hierarchy related to the text chunking processor. Three chunkers(FixedTokenLengthChunker, DelimiterChunker an MarkdownChunker) will implement the Chunker interface. All three chunkers will register themselves in the chunker factory. Upon initialization, the text chunking processor will instantiate a chunker object and validate the parameters. When performing chunking, it will also create a chunker object and the apply the chunking algorithm.
Chunker Interface
This is the interface for developers to implement their chunking algorithms. All detailed chunkers should implement this interface, including two methods: validateParameters and chunk.
Chunker factory
The class ChunkerFactory provides two methods create and getAllChunkers. The create method takes in a String, indicating the type of chunking algorithm, and returns an instantiated Object of Chunker. The getAllChunkers method returns a set of string, indicating available chunking algorithms, which enables the chunking processor to validate input parameters.
Chunking Algorithms
Intuitively, documents can be chunked according to token length, or delimiter. We also provide a hierarchical segmentation algorithm to offer a commonly used method for markdown formatted text.
Fixed Token Length
Chunking by token length is a simple solution. Here, documents will be split into several smaller parts. As chunked passages would begin and end abruptly, users can specify the degree of overlap to contain more information. The current passage would include the ending part from the previous passage.
Suppose that a document has 2000 tokens in length. The token limit is set to 500 and overlapping degree is 0.5, meaning that the first 50% tokens of the current passage will be the last 50% tokens from the last passage. Therefore, the first passage starts from 1st token and ends with 500th token. The second passage starts from 251st token and ends with 750th token, etc.
Example 1
Example 2
Delimiter
Chunking by delimiter is toward the scenario of segmenting by sentences or paragraphs, which are explicitly marked with punctuations. For example, we can segment English sentences using ‘.’ and chunk the paragraphs using ‘\n’ or ‘\n\n’. The delimiter will appear at the end of each chunk.
Example 1
Example 2
Example 3
Markdown
This is a dedicated algorithm for markdown files. The hierarchy structure within markdown document provides related context for subtitle passages. We can construct a tree based on the title levels from the doc. Given a node in the tree, we include titles and contents from its path to the root title, including all ancestor nodes. Users can configure the max depth of tree node. Here is a simple example of markdown file:
By this example, the constructed tree should be like
For Title 2.2, the expected title should be
The expected content should also include the title.
Please be aware that this algorithm only applies to documents with markdown format.
API
The following are a few examples to create a pipeline for different algorithms.
Fixed Token Length
Delimiter
Markdown
Parameters
The following table lists the required and optional parameters for the
chunking
processor.We set the default of token_limit to be 384 because most model have token limit with 512. The standard tokenizer in OpenSearch mainly tokenize text according to words. By OpenAI, one token is approximately 0.75 words for English text. That's why we set 512 * 0.75 = 384 as the default token limit.
The algorithm_name parameter can be either fixed_token_length or delimiter.
The text was updated successfully, but these errors were encountered: