[RFC] Text chunking design #548

yuye-aws · 2024-01-23T13:43:11Z

Background

In neural-search plugin, the documents are firstly translated into vectors using embedding models via ingestion processors. Usually, these embedding models will have suggested token limit. For example, the sparse encoding model amazon/neural-sparse/opensearch-neural-sparse-encoding-v1 has token limit with 512. Truncating tokens for long documents will lead to information loss. This problem was tracked in this issue: #482, and to solve it, we propose to add chucking processor in ingestion pipelines. In OpenSearch 2.13 , we are planning to release two algorithms: Fixed Token Length and Delimiter. The Markdown algorithm will be available in 2.14 release.

Different Options

Option 1

Implement chunking processor with both chunking algorithms and text embedding. The input is a long document to be chunked and the output is an array with text embeddings of chunked passages.

Pros

Less storage consumption. Sometimes, chunked passages are intermediate results for text embeddings. The users may only care about the text embedding vectors and no longer need the chunked passages. Option 1 does not include chunked passages into the index.

Cons

Users are restricted by option 1. They can only perform text embedding on chunked passages. If users want to perform other embedding methods like sparse encoding, we have to implement another processor with combined chunking processor and sparse encoding processor.
Code duplication. The combined processor would share similar embedding code as the text embedding processor.
Users may keep the chunked passage information. This drawback might be addressed with a configurable boolean parameter.

Option 2 (Selected)

Implement the chunking processor with chunking algorithms alone. The input is a long document to be chunked and the output is chunked passages. As we prefer this option, the following user cases and interface design will be based on this option.

Pros

Users can flexibly append any downstream processor (eg. text embedding processor and sparse encoding processor) after the chunking processor.
Option 2 provides better isolation among different ingestion processors. The chunking processor only needs to concentrate on chunking algorithms.

Cons

More storage consumption (mentioned in Pros of Option 1). If users want to avoid unnecessary storage consumption by intermediate results, it requires a bit user effort to append a remove ingest processor after the text embedding processor.

User Cases

Text Embedding

We can chain chunking processor together with the text_embedding processor to obtain the embedding vectors for each chunked passages. Here is an example:

POST _ingest/pipeline/_simulate
{
  "pipeline": {
    "processors": [
      {
        "text_chunking": {
          "algorithm": {
            "fixed_token_length": {
              "token_limit": 10,
              "overlap_rate": 0.2,
              "tokenizer": "standard"
            }
          },
          "field_map": {
            "body": "body_chunk"
          }
        }
      },
      {
        "text_embedding": {
          "model_id": "IYMBDo4BwlxmLrDqUr0a",
          "field_map": {
            "body_chunk": "body_chunk_embedding"
          }
        }
      }
    ]
  },
  "docs": [
    {
      "_source": {
        "body": "This is an example document to be chunked. The document contains a single paragraph, two sentences and 24 tokens by standard tokenizer in OpenSearch."
      }
    }
  ]
}

And we obtain the following results:

{
  "docs": [
    {
      "doc": {
        "_index": "_index",
        "_id": "_id",
        "_source": {
          "body_chunk": [
            "This is an example document to be chunked The document",
            "The document contains a single paragraph two sentences and 24",
            "and 24 tokens by standard tokenizer in OpenSearch"
          ],
          "body_chunk_embedding": [
            {
              "knn": [...]
            },
            {
              "knn": [...]
            },
            {
              "knn": [...]
            }
          ],
          "body": "This is an example document to be chunked. The document contains a single paragraph, two sentences and 24 tokens by standard tokenizer in OpenSearch."
        },
        "_ingest": {
          "timestamp": "2024-03-05T09:49:37.131255Z"
        }
      }
    }
  ]
}

Cascaded Chunking Processors

Users can chain multiple chunking processor together. For example, if a user wish to split documents according to paragraphs, they can apply the Delimiter algorithm and specify the parameter to be "\n\n". In case that a paragraph exceeds the token limit, the user can then append another chunking processor with Fixed Token Length algorithm. The ingestion pipeline in this example should be configured like:

PUT _ingest/pipeline/chunking-pipeline
{
  "processors": [
    {
      "text_chunking": {
        "algorithm": {
          "delimiter": {
            "delimiter": "\n\n"
          }
        },
        "field_map": {
          "body": "body_chunk1"
        }
      }
    },
    {
      "text_chunking": {
        "algorithm": {
          "fixed_token_length": {
            "token_limit": 500,
            "overlap_rate": 0.2,
            "tokenizer": "standard"
          }
        },
        "field_map": {
          "body_chunk1": "body_chunk2"
        }
      }
    }
  ]
}

Interface Design

Let’s take an overview of the class hierarchy related to the text chunking processor. Three chunkers(FixedTokenLengthChunker, DelimiterChunker an MarkdownChunker) will implement the Chunker interface. All three chunkers will register themselves in the chunker factory. Upon initialization, the text chunking processor will instantiate a chunker object and validate the parameters. When performing chunking, it will also create a chunker object and the apply the chunking algorithm.

Chunker Interface

public interface Chunker {
    
    void validateParameters(Map<String, Object> parameters);
    
    List<String> chunk(String field, Map<String, Object> parameters);
}

This is the interface for developers to implement their chunking algorithms. All detailed chunkers should implement this interface, including two methods: validateParameters and chunk.

Chunker factory

public class ChunkerFactory {

    public static Chunker create(String type) {}

    public static Set<String> getAllChunkers() {}
}

The class ChunkerFactory provides two methods create and getAllChunkers. The create method takes in a String, indicating the type of chunking algorithm, and returns an instantiated Object of Chunker. The getAllChunkers method returns a set of string, indicating available chunking algorithms, which enables the chunking processor to validate input parameters.

Chunking Algorithms

Intuitively, documents can be chunked according to token length, or delimiter. We also provide a hierarchical segmentation algorithm to offer a commonly used method for markdown formatted text.

Fixed Token Length

Chunking by token length is a simple solution. Here, documents will be split into several smaller parts. As chunked passages would begin and end abruptly, users can specify the degree of overlap to contain more information. The current passage would include the ending part from the previous passage.

Suppose that a document has 2000 tokens in length. The token limit is set to 500 and overlapping degree is 0.5, meaning that the first 50% tokens of the current passage will be the last 50% tokens from the last passage. Therefore, the first passage starts from 1st token and ends with 500th token. The second passage starts from 251st token and ends with 750th token, etc.

Example 1

// input (overlap_rate = 0.2, token_limit = 5, tokenizer = "standard")
"This is an example document to be chunked. The document contains a single paragraph, two sentences and 24 tokens by standard tokenizer in OpenSearch."

// output 
["This is an example document", "document to be chunked The", "The document contains a single", "single paragraph two sentences and", "and 24 tokens by standard", "standard tokenizer in OpenSearch"]

Example 2

// input (overlap_rate = 0.2, token_limit = 10, tokenizer = "standard")
"This is an example document to be chunked. The document contains a single  paragraph, two sentences and 24 tokens by standard tokenizer in OpenSearch."

// output
["This is an example document to be chunked The document", "The document contains a single paragraph two sentences and 24", "and 24 tokens by standard tokenizer in OpenSearch"]

Delimiter

Chunking by delimiter is toward the scenario of segmenting by sentences or paragraphs, which are explicitly marked with punctuations. For example, we can segment English sentences using ‘.’ and chunk the paragraphs using ‘\n’ or ‘\n\n’. The delimiter will appear at the end of each chunk.

Example 1

// input (delimiter = ".")
"This is an example document to be chunked. The document contains a single paragraph, two sentences and 24 tokens by standard tokenizer in OpenSearch."

// output 
["This is an example document to be chunked.", "The document contains a single paragraph, two sentences and 24 tokens by standard tokenizer in OpenSearch."]

Example 2

// input (delimiter = "\n\n")
"This is an example document to be chunked. The document contains a single paragraph, two sentences and 24 tokens by standard tokenizer in OpenSearch."

// output 
["This is an example document to be chunked. The document contains a single paragraph, two sentences and 24 tokens by standard tokenizer in OpenSearch."]

Example 3

// input (delimiter = " ")
"This is an example document to be chunked. The document contains a single paragraph, two sentences and 24 tokens by standard tokenizer in OpenSearch."

// output 
["This ", "is ", "an ", "example ", "document ", "to ", "be ", "chunked. ", ...]

Markdown

This is a dedicated algorithm for markdown files. The hierarchy structure within markdown document provides related context for subtitle passages. We can construct a tree based on the title levels from the doc. Given a node in the tree, we include titles and contents from its path to the root title, including all ancestor nodes. Users can configure the max depth of tree node. Here is a simple example of markdown file:

# Root title
Root content
## Title 1
Content 1
### Title 1.1
Content 1.1
### Title 1.2
Content 1.2
## Sub title 2
Content 2
### Title 2.1
Content 2.1
### Title 2.2
Content 2.2

By this example, the constructed tree should be like

For Title 2.2, the expected title should be

Root title, Title 2, Title 2.2

The expected content should also include the title.

Root title
Root content
Title 2
Content 2
Title 2.2
Content 2.2

Please be aware that this algorithm only applies to documents with markdown format.

API

The following are a few examples to create a pipeline for different algorithms.

Fixed Token Length

PUT _ingest/pipeline/fixed-length-chunking-pipeline
{
  "description": "This pipeline performs chunking with fixed token length algorithm",
  "processors": [
    {
      "text_chunking": {
        "algorithm": {
          "fixed_token_length": {
            "token_limit": 500,
            "overlap_rate": 0.2,
            "tokenizer": "standard",
            "max_chunk_limit": 100
          }
        },
        "field_map": {
          "<input_field>": "<output_field>"
        }
      }
    }
  ]
}

Delimiter

PUT _ingest/pipeline/delimiter-chunking-pipeline
{
  "description": "This pipeline performs chunking with delimiter algorithm",
  "processors": [
    {
      "text_chunking": {
        "algorithm": {
          "delimiter": {
            "delimiter": ".",
            "max_chunk_limit": 100
          }
        },
        "field_map": {
          "<input_field>": "<output_field>"
        }
      }
    }
  ]
}

Markdown

PUT _ingest/pipeline/markdown-chunking-pipeline
{
  "description": "This pipeline performs chunking for markdown files",
  "processors": [
    {
      "text_chunking": {
        "algorithm": {
          "markdown": {
            "max_depth": ".",
            "max_chunk_limit": 100
          }
        },
        "field_map": {
          "<input_field>": "<output_field>"
        }
      }
    }
  ]
}

Parameters

The following table lists the required and optional parameters for the chunking processor.

Parameter	Required/Optional	Data type	Description
field_map	Required	Object	Contains key-value pairs that specify the parameters for chunking processor.
field_map.<input_field>	Required	String	The name of the field from which to obtain document for chunking.
field_map.<output_field>	Required	String	The name of the field in which to store the chunked results.
algorithm.fixed_token_length.token_limit	Optional	Int	The token limit for chunking algorithms. Should be an integer at least 1. Default is 384.
algorithm.fixed_token_length.tokenizer	Optional	String	The word tokenizer in OpenSearch. Default is "standard".
algorithm.fixed_token_length.overlap_rate	Optional	Float	The overlapping degree in token algorithm. Should be a float between 0 and 0.5. Default is 0.
algorithm.delimiter.delimiter	Optional	String	A string as the paragraph split indicator. Default is "\n\n".
algorithm.markdown.max_depth	Optional	Int	The max depth for markdown algorithm. Default is 3.
algorithm.algorithm_name.max_chunk_limit	Optional	Int	The chunk limit for chunking algorithms. Default is 100. Users can set this value to -1 to disable this parameter.

We set the default of token_limit to be 384 because most model have token limit with 512. The standard tokenizer in OpenSearch mainly tokenize text according to words. By OpenAI, one token is approximately 0.75 words for English text. That's why we set 512 * 0.75 = 384 as the default token limit.

The algorithm_name parameter can be either fixed_token_length or delimiter.

The text was updated successfully, but these errors were encountered:

xluo-aws · 2024-01-28T14:37:52Z

Could you provide examples of index after chunking? In your example(1000 tokens, overlap 0.5), will there be three KNN index records created? Are they all pointing to same source data?

yuye-aws · 2024-01-29T02:20:03Z

By the example where there are 1000 tokens in raw document, token limit is 500 and overlap degree is 0.5 , we will get 3 documents. The tokens range from 1 ~ 500, 251 ~ 750, 501 ~ 1000. Suppose the input field is body and the output field is body_chunk. body_chunk should be a list of chunked documents (length 3).

xluo-aws · 2024-01-29T04:17:34Z

Are there any connection among those three documents? Or they are just three independent docs?

yuye-aws · 2024-01-29T06:00:24Z

These three documents are chunked results from the raw document. The first document takes 1st token to 500th. The second document takes 251st token to 750th. The third document takes 501st token to 1000th. Once the raw document get chunked into three passages, we can consider them as independent and perform downstream processors.

xluo-aws · 2024-01-29T06:29:05Z

I c, wonder if we can store the relationship among those docs so we can also return the full doc for user reference.

yuye-aws · 2024-01-29T06:58:30Z

Suppose the input field is body and output field is body_chunk, the processed record in the index should be something like:

{
  "body": "..."
  "body_chunk": ["...", "...", "..."]
}

The full doc is available in body field.

sam-herman · 2024-01-30T00:34:28Z

Should we be able to combine delimiters with fixed token length?
For example, what happens if I have a delimiter of a . but my paragraph is too long? Or the opposite, delimiter is creating a too short paragraph and I would like to go by precedence of token length?

Also, can you please show some actual concrete examples for input and output? currently the description of the RFC is only showing various configurations, if you can actually show input/output examples that would be great and would allow to factor in the edge cases.

xinyual · 2024-01-30T06:08:57Z

If we build a KNN field after chunking, we still need the model_id to convert text to vectors.

zhichao-aws · 2024-01-30T08:13:23Z

Which tokenizer will be used? Can user customize the tokenizer?

yuye-aws · 2024-01-30T08:30:26Z

Should we be able to combine delimiters with fixed token length? For example, what happens if I have a delimiter of a . but my paragraph is too long? Or the opposite, delimiter is creating a too short paragraph and I would like to go by precedence of token length?

Also, can you please show some actual concrete examples for input and output? currently the description of the RFC is only showing various configurations, if you can actually show input/output examples that would be great and would allow to factor in the edge cases.

Nice suggestions. I will elaborate more on my RFC. I will ping you when I update this doc.

yuye-aws · 2024-01-30T08:32:17Z

If we build a KNN field after chunking, we still need the model_id to convert text to vectors.

We can apply a subsequent text embedding ingest processor to build KNN field.

yuye-aws · 2024-01-30T08:34:17Z

Which tokenizer will be used? Can user customize the tokenizer?

I think we can start with built-in tokenizers listed in https://opensearch.org/docs/latest/analyzers/tokenizers/index/.

zhichao-aws · 2024-01-30T08:50:38Z

Which tokenizer will be used? Can user customize the tokenizer?

I think we can start with built-in tokenizers listed in https://opensearch.org/docs/latest/analyzers/tokenizers/index/.

Based on the table, the tokenization results of these tokenizers are word-level. And it will drop some special characters like "-", ",", "(". The token concept in built-in tokenizer is not identical with what we usually use in NLP models. I'm ok for starting with built-in tokenizer, but I think we should target for the "tokenizers" used by NLP models(this is the reason we need chunking, right?) in the end. And users can customize the vocabulary.

yuye-aws · 2024-01-30T09:01:20Z

Agreed. Let me check whether tokenizers for supported pretrained models in OpenSearch is available.

yuye-aws · 2024-02-06T11:51:38Z

Hi @samuel-oci! I have updated the documents with a few examples for fixed token length and delimiters algorithm. Please take a look.

yuye-aws · 2024-02-06T11:59:15Z

Should we be able to combine delimiters with fixed token length?
For example, what happens if I have a delimiter of a . but my paragraph is too long? Or the opposite, delimiter is creating a too short paragraph and I would like to go by precedence of token length?

I think no. We come up with two solutions for the corner cases about too long paragraphs.

Cascaded processors. Ingestion pipeline in OpenSearch allows users to flexibly concatenate several processors together. Users can simply cascade a fixed token length processor after the delimiter processor to avoid too long paragraphs.
Recursive chunking We can chunking documents in a recursive manner. Users can configure their regex as a list of indicators like ["\n\n", "\n", ".", " "]. The algorithm will iterate the regex list and perform chunking with each indicator until the chunked passages are short enough. By this method, users can specify the max_token_limit parameter.

For too short paragraphs, I cannot come up with a perfect solution to eliminate all corner cases. Maybe we can try to merge the paragraph with the previous or the subsequent paragraph. Do you have any better ideas?

yuye-aws · 2024-02-06T13:49:28Z

Which tokenizer will be used? Can user customize the tokenizer?

English tokenizers do not have much difference. For now, I think we can just implement built-in tokenizers with analyzer.

dylan-tong-aws · 2024-02-09T19:45:41Z

Can you provide an example of how to chain together this processor with the embedding processors?

I expect that one should be able to do the following:

data source / ingest processor > (in:string) text chunker (mapped output: string array?) > (in: string array?) embedding/inference processor (mapped output: nested field vector)

Specifically, I like clarity on the interface design for mapping the outputted text chunks to an inference processor to produced tested vector chunks.

Ref: opensearch-project/k-NN#1065

dylan-tong-aws · 2024-02-09T22:00:16Z

Can you also describe your proposal for extensibility. Like the embedding processors, I would like the option to run this workload off-cluster.

(adding @ylwu-amzn )

Ideally, we enhance the connector framework in ml-commons so that this processor is an abstraction which could be executed through an external tool integrated through a connector (eg. apache spark, data prepper/opensearch ingestion service), kafka...etc.) or on the OS cluster.

yuye-aws · 2024-02-18T11:58:41Z

Can you provide an example of how to chain together this processor with the embedding processors?

Sure, I will include this example into my RFC document.

yuye-aws · 2024-02-19T06:14:32Z

Can you also describe your proposal for extensibility. Like the embedding processors, I would like the option to run this workload off-cluster.

The proposal does not break any features in extensibility. Here, extensibility means the remote inference feature in ml-commons plugins. The document chunking processor is just a preprocess step before the text embedding processor, which provides an abstraction of the remote inference. The specified model id in the embedding processor can either be a local model or a remote model.

dylan-tong-aws · 2024-02-19T18:38:34Z

@yuye-aws, as described in the first line of this feature brief, inference was the "first phase". There are more details provided in internal documents. The long term intent is to create an extensibility framework that allows OpenSearch to be embedded into the entire ML-lifecycle. This requires support for our users existing ML tools. Thus, we need the ability to integrate with tools for data preparation to model training and tuning--not just "remote inference".

With that said, "text chunking" support should be design in a way:

Provides support for popular data preparation and processing tools like Apache Spark (important for adoption). OpenSearch also has a tool called Data Prepper (and the OpenSearch Ingestion Service on AWS). What's the plan for supporting this?
Support a design that delivers comparable performance, scalability and price/performance profile. (performance, scale, cost and adoption). If you only run text chunking on the cluster, this won't be as scalable, performant and cost effective compared to the popular alternatives. I could, for instance, run splitting and vector generation on Apache Spark with GPU acceleration via RAPIDS. This will be more performant, scalable, cost effective and feature rich. We created Data Prepper/OSI so that ingestion workloads can be decoupled for performance and efficiency.

With that said, I think it's valuable to provide the an abstraction similar to what was provided for ML inference. This allows users to run the text chunking and embedding processing ingestion pipeline on the cluster for convenience during development, and the ability to run the pipeline off cluster with my choice of data processing tools in production.

@elfisher @sean-zheng-amazon @ylwu-amzn @minalsha @vamshin @model-collapse

yuye-aws · 2024-02-20T06:09:54Z

Provides support for popular data preparation and processing tools like Apache Spark (important for adoption). OpenSearch also has a tool called Data Prepper (and the OpenSearch Ingestion Service on AWS). What's the plan for supporting this?

The first step is to chunk one field in the index based on user's configuration. We are not supporting other open source software like Apache Spark and Data Prepper. We prefer to use the ingestion pipeline, which is very convenient.

yuye-aws · 2024-02-20T06:14:05Z

2. Support a design that delivers comparable performance, scalability and price/performance profile. (performance, scale, cost and adoption). If you only run text chunking on the cluster, this won't be as scalable, performant and cost effective compared to the popular alternatives. I could, for instance, run splitting and vector generation on Apache Spark with GPU acceleration via RAPIDS. This will be more performant, scalable, cost effective and feature rich. We created Data Prepper/OSI so that ingestion workloads can be decoupled for performance and efficiency.

Our proposed chunking processor is just a feature to split a string into a list of passages. Unlike inference models, chunking processors do not require many computation resources (does not require GPUs). It is already efficient enough to perform the chunking operation inside the OpenSearch cluster.

yuye-aws · 2024-02-20T06:17:44Z

Can you provide an example of how to chain together this processor with the embedding processors?

I have updated the RFC document with an example. You can refer to Text Embedding User Case for more details.

navneet1v · 2024-03-12T00:47:07Z

The analyzer tokenizer. Default is "standard".

Using the Standard analyzer provided by Lucene can create. The analyser can only produce 100 Tokens. Ref this IT which is failing: https://github.com/opensearch-project/neural-search/actions/runs/8241329594/job/22538377969?pr=607

Looking at the text chunks which users will be creating I can easily see the limit of 100 getting breached. What is our plan for fixing it?

cc: @model-collapse , @yuye-aws , @zane-neo

yuye-aws · 2024-03-12T02:18:04Z

The analyser can only produce 100 Tokens.

By default, the analyzer can produce 10,000 tokens. If user wants to generate more token. They can update their index setting like this:

PUT <index_name>
{
  "settings" : {
    "index.analyze.max_token_count" : <max_token_count>
  }
}

yuye-aws · 2024-03-12T02:18:53Z

Ref this IT which is failing: https://github.com/opensearch-project/neural-search/actions/runs/8241329594/job/22538377969?pr=607

I am looking into the failing IT. It takes some time because I can pass it locally.

navneet1v · 2024-03-12T03:08:09Z

Ref this IT which is failing: https://github.com/opensearch-project/neural-search/actions/runs/8241329594/job/22538377969?pr=607

I am looking into the failing IT. It takes some time because I can pass it locally.

This makes this processor dependent on the index settings. Is this a right design choice?

yuye-aws · 2024-03-12T04:02:18Z

Ref this IT which is failing: https://github.com/opensearch-project/neural-search/actions/runs/8241329594/job/22538377969?pr=607

I am looking into the failing IT. It takes some time because I can pass it locally.

This makes this processor dependent on the index settings. Is this a right design choice?

The dependency of chunking processor on index settings is due to the tokenization analyzer in fixed token length algorithm. The tokenization analyzer needs fetch the max_token_count parameter from the index settings. This setting may vary across different indices. This explains why we need to read index settings in the chunking processor.

Besides, the chunking processor is a part of the ingestion pipeline. When ingesting document with a configured pipeline, users should be aware which index they are ingesting documents into. They should be able to configure the index settings.

yuye-aws · 2024-03-12T06:46:06Z

Ref this IT which is failing: https://github.com/opensearch-project/neural-search/actions/runs/8241329594/job/22538377969?pr=607

Good news! I have just fixed the failing integration test.

navneet1v · 2024-04-01T06:16:30Z

@yuye-aws can we close this GH issue as the feature is released in 2.13?

yuye-aws · 2024-04-01T08:15:25Z

We still need to implement the Markdown algorithm in the future. Shall I include the algorithm design in this RFC or raise a new issue?

navneet1v · 2024-04-01T21:47:06Z

We still need to implement the Markdown algorithm in the future. Shall I include the algorithm design in this RFC or raise a new issue?

Having a separate issue will be great. As most of the features for this RFC is completed we should close out this.

yuye-aws · 2024-04-02T02:08:09Z

We still need to implement the Markdown algorithm in the future. Shall I include the algorithm design in this RFC or raise a new issue?

Having a separate issue will be great. As most of the features for this RFC is completed we should close out this.

Sure. I will close this issue.

github-actions bot added the untriaged label Jan 23, 2024

yuye-aws mentioned this issue Jan 23, 2024

[FEATURE]Ability to chunk the documents and generate multiple embeddings using k-NN nested fields. #482

Closed

navneet1v added RFC and removed untriaged labels Jan 31, 2024

model-collapse assigned yuye-aws Feb 9, 2024

yuye-aws mentioned this issue Feb 18, 2024

feat: implement text chunking processor with fixed token length and delimiter algorithm #607

Merged

5 tasks

sam-herman mentioned this issue Feb 23, 2024

[META] Chunking and querying of long passages for vector search #612

Closed

model-collapse mentioned this issue Mar 13, 2024

[DOC] Document chunking processor. opensearch-project/documentation-website#6663

Closed

3 tasks

yuye-aws changed the title ~~[RFC] Document chunking design~~ [RFC] Text chunking design Mar 13, 2024

model-collapse added the backport 2.x Label will add auto workflow to backport PR to 2.x branch label Mar 18, 2024

yuye-aws mentioned this issue Mar 18, 2024

Add documentation for text chunking processor opensearch-project/documentation-website#6707

Merged

1 task

yuye-aws closed this as completed Apr 2, 2024

chishui mentioned this issue May 9, 2024

[FEATURE] Split remote inference text list if its number exceeds user configured limitation opensearch-project/ml-commons#2428

Closed

yuye-aws mentioned this issue May 16, 2024

[RFC] Markdown Chunking Algorithm #751

Open

[RFC] Text chunking design #548

[RFC] Text chunking design #548

Comments

yuye-aws commented Jan 23, 2024 • edited Loading

Background

Different Options

Option 1

Pros

Cons

Option 2 (Selected)

Pros

Cons

User Cases

Text Embedding

Cascaded Chunking Processors

Interface Design

Chunker Interface

Chunker factory

Chunking Algorithms

Fixed Token Length

Delimiter

Markdown

API

Fixed Token Length

Delimiter

Markdown

Parameters

xluo-aws commented Jan 28, 2024

yuye-aws commented Jan 29, 2024

xluo-aws commented Jan 29, 2024

yuye-aws commented Jan 29, 2024 • edited Loading

xluo-aws commented Jan 29, 2024

yuye-aws commented Jan 29, 2024

sam-herman commented Jan 30, 2024 • edited Loading

xinyual commented Jan 30, 2024

zhichao-aws commented Jan 30, 2024

yuye-aws commented Jan 30, 2024

yuye-aws commented Jan 30, 2024

yuye-aws commented Jan 30, 2024

zhichao-aws commented Jan 30, 2024

yuye-aws commented Jan 30, 2024

yuye-aws commented Feb 6, 2024

yuye-aws commented Feb 6, 2024 • edited Loading

yuye-aws commented Feb 6, 2024

dylan-tong-aws commented Feb 9, 2024 • edited Loading

dylan-tong-aws commented Feb 9, 2024 • edited Loading

yuye-aws commented Feb 18, 2024

yuye-aws commented Feb 19, 2024

dylan-tong-aws commented Feb 19, 2024 • edited Loading

yuye-aws commented Feb 20, 2024 • edited Loading

yuye-aws commented Feb 20, 2024

yuye-aws commented Feb 20, 2024

navneet1v commented Mar 12, 2024

yuye-aws commented Mar 12, 2024

yuye-aws commented Mar 12, 2024

navneet1v commented Mar 12, 2024

yuye-aws commented Mar 12, 2024

yuye-aws commented Mar 12, 2024

navneet1v commented Apr 1, 2024

yuye-aws commented Apr 1, 2024

navneet1v commented Apr 1, 2024

yuye-aws commented Apr 2, 2024

yuye-aws commented Jan 23, 2024 •

edited

Loading

yuye-aws commented Jan 29, 2024 •

edited

Loading

sam-herman commented Jan 30, 2024 •

edited

Loading

yuye-aws commented Feb 6, 2024 •

edited

Loading

dylan-tong-aws commented Feb 9, 2024 •

edited

Loading

dylan-tong-aws commented Feb 9, 2024 •

edited

Loading

dylan-tong-aws commented Feb 19, 2024 •

edited

Loading

yuye-aws commented Feb 20, 2024 •

edited

Loading