Add documentation for text chunking processor #6707

yuye-aws · 2024-03-18T14:26:42Z

Description

Add documentation for text chunking processor

Issues Resolved

Implements RFC: opensearch-project/neural-search#548

Closes #6663

Checklist

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license and subject to the Developers Certificate of Origin.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

kolchfa-aws · 2024-03-18T16:28:39Z

Thanks so much, @yuye-aws! I will review shortly and push my edits in the same PR.

yuye-aws · 2024-03-19T01:05:45Z

Thanks @kolchfa-aws ! I will fix the style check today.

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

yuye-aws · 2024-03-19T05:21:29Z

Hi @kolchfa-aws ! The style check error have been fixed. You can review this PR. If you have any concerns or questions, feel free to reach out to me.

Signed-off-by: Fanit Kolchina <kolchfa@amazon.com>

yuye-aws · 2024-03-20T02:29:55Z

_search-plugins/neural-sparse-search.md

@@ -55,6 +55,8 @@ PUT /_ingest/pipeline/nlp-ingest-pipeline-sparse
 ```
 {% include copy-curl.html %}

+To split long text into paragraphs, use the `text_chunking` ingest processor before the `sparse_encoding` processor. For more information, see [Text chunking processor]({{site.url}}{{site.baseurl}}/ingest-pipelines/processors/text-chunking/).


Can we be more specific and change To split long text into paragraphs to To avoid information loss by truncating tokens for long documents? The text_chunking processor is a text preprocess step before embedding processor. You can refer to the RFC for details.

yuye-aws · 2024-03-20T02:30:04Z

_search-plugins/semantic-search.md

@@ -48,6 +48,8 @@ PUT /_ingest/pipeline/nlp-ingest-pipeline
 ```
 {% include copy-curl.html %}

+To split long text into paragraphs, use the `text_chunking` ingest processor before the `text_embedding` processor. For more information, see [Text chunking processor]({{site.url}}{{site.baseurl}}/ingest-pipelines/processors/text-chunking/).


Same as above

yuye-aws · 2024-03-20T02:36:22Z

_ingest-pipelines/processors/text-chunking.md

 ---

 # Text chunking processor

-The `text_chunking` processor is used to chunk a long document into paragraphs. The following is the syntax for the `text_chunking` processor:
+The `text_chunking` processor is used to chunk a long document into paragraphs on a delimiter or chunks of a certain size. The following is the syntax for the `text_chunking` processor:


certain size -> certain token size

yuye-aws · 2024-03-20T02:36:55Z

_ingest-pipelines/processors/text-chunking.md


-Users can set parameter `overlap_rate` from 0 to 50 percent. According to [bedrock](https://aws.amazon.com/blogs/aws/knowledge-bases-now-delivers-fully-managed-rag-experience-in-amazon-bedrock/), we recommend users to set this parameter between 0–20 percent to help improve accuracy.
+You can set the `overlap_rate` to a decimal equal to 0 to 50 percent. Per [Bedrock](https://aws.amazon.com/blogs/aws/knowledge-bases-now-delivers-fully-managed-rag-experience-in-amazon-bedrock/), we recommend setting this parameter to a value of 0–20 percent to improve accuracy.


a decimal equal to 0 to 50 percent -> a decimal equal from 0 to 50 percent

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

Signed-off-by: Fanit Kolchina <kolchfa@amazon.com>

natebower

@yuye-aws @kolchfa-aws Please see my comments and changes and let me know if you have any questions. Thanks!

natebower · 2024-03-21T10:55:41Z

_ingest-pipelines/processors/text-chunking.md

+
+# Text chunking processor
+
+The `text_chunking` processor splits a long document into shorter passages. The processor supports the following algorithms for splitting text:


"text splitting" instead of "splitting text"?

_ingest-pipelines/processors/text-chunking.md

natebower · 2024-03-21T10:57:57Z

_ingest-pipelines/processors/text-chunking.md

+|:---|:---|:---|:---|
+| `field_map` | Object | Required	 | Contains key-value pairs that specify the mapping of a text field to the output field for the text chunking processor.	  |
+| `field_map.<input_field>`	  | String	| Required	 | The name of the field from which to obtain text for generating chunked passages.	                                   |
+| `field_map.<output_field>`	 | String	    | Required	 | The name of the field in which to store the chunking results.	|


"chunked" instead of "chunking"?

_ingest-pipelines/processors/text-chunking.md

natebower · 2024-03-21T11:24:01Z

_ingest-pipelines/processors/text-chunking.md

+}
+```
+
+Once you have created an ingest pipeline, you need to create an index for ingestion and ingest documents into the index. To learn more, see [Step 2: Create an index for ingestion]({{site.url}}{{site.baseurl}}/search-plugins/neural-sparse-search/#step-2-create-an-index-for-ingestion) and [Step 3: Ingest documents into the index]({{site.url}}{{site.baseurl}}/search-plugins/neural-sparse-search/#step-3-ingest-documents-into-the-index) of [Neural sparse search]({{site.url}}{{site.baseurl}}/search-plugins/neural-sparse-search/).


At the end, let's clarify slightly and lowercase "neural". "of the neural sparse search documentation"?

_ingest-pipelines/processors/text-chunking.md

Co-authored-by: Nathan Bower <nbower@amazon.com> Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com>

kolchfa-aws

LGTM. Thank you, @yuye-aws!

* add documentation for text chunking processor Signed-off-by: yuye-aws <yuyezhu@amazon.com> * fix style check errors Signed-off-by: yuye-aws <yuyezhu@amazon.com> * fix style check errors Signed-off-by: yuye-aws <yuyezhu@amazon.com> * fix style check errors Signed-off-by: yuye-aws <yuyezhu@amazon.com> * Doc review Signed-off-by: Fanit Kolchina <kolchfa@amazon.com> * add example on text embedding and cascade chunking Signed-off-by: yuye-aws <yuyezhu@amazon.com> * fix style check Signed-off-by: yuye-aws <yuyezhu@amazon.com> * add comment Signed-off-by: yuye-aws <yuyezhu@amazon.com> * Review additional requested info Signed-off-by: Fanit Kolchina <kolchfa@amazon.com> * Apply suggestions from code review Co-authored-by: Nathan Bower <nbower@amazon.com> Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> --------- Signed-off-by: yuye-aws <yuyezhu@amazon.com> Signed-off-by: Fanit Kolchina <kolchfa@amazon.com> Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> Co-authored-by: Fanit Kolchina <kolchfa@amazon.com> Co-authored-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> Co-authored-by: Nathan Bower <nbower@amazon.com>

add documentation for text chunking processor

a7d04b2

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

yuye-aws requested review from hdhalter, kolchfa-aws, Naarcha-AWS, vagimeli, AMoo-Miki, natebower, dlvenable and stephen-crawford as code owners March 18, 2024 14:26

yuye-aws marked this pull request as draft March 18, 2024 14:26

yuye-aws mentioned this pull request Mar 18, 2024

[DOC] Document chunking processor. #6663

Closed

3 tasks

kolchfa-aws self-assigned this Mar 18, 2024

kolchfa-aws added v2.13.0 release-notes PR: Include this PR in the automated release notes labels Mar 18, 2024

hdhalter changed the title ~~add documentation for text chunking processor~~ Add documentation for text chunking processor Mar 18, 2024

yuye-aws added 3 commits March 19, 2024 13:14

fix style check errors

f1313f9

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

fix style check errors

9bf6bfe

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

fix style check errors

b71e621

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

Doc review

8c2a029

Signed-off-by: Fanit Kolchina <kolchfa@amazon.com>

yuye-aws marked this pull request as ready for review March 20, 2024 02:15

yuye-aws commented Mar 20, 2024

View reviewed changes

yuye-aws and others added 5 commits March 20, 2024 11:16

add example on text embedding and cascade chunking

4ceffbe

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

fix style check

f4e8a97

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

add comment

e8b1343

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

Merge branch 'main' into text_chunking

fd1ec59

Review additional requested info

59a808c

Signed-off-by: Fanit Kolchina <kolchfa@amazon.com>

hdhalter added the 4 - Doc review PR: Doc review in progress label Mar 20, 2024

natebower reviewed Mar 21, 2024

View reviewed changes

kolchfa-aws reviewed Mar 21, 2024

View reviewed changes

_ingest-pipelines/processors/text-chunking.md Outdated Show resolved Hide resolved

kolchfa-aws reviewed Mar 21, 2024

View reviewed changes

_ingest-pipelines/processors/text-chunking.md Outdated Show resolved Hide resolved

kolchfa-aws reviewed Mar 21, 2024

View reviewed changes

_ingest-pipelines/processors/text-chunking.md Outdated Show resolved Hide resolved

kolchfa-aws reviewed Mar 21, 2024

View reviewed changes

_ingest-pipelines/processors/text-chunking.md Outdated Show resolved Hide resolved

Apply suggestions from code review

cec9b8a

Co-authored-by: Nathan Bower <nbower@amazon.com> Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com>

kolchfa-aws approved these changes Mar 21, 2024

View reviewed changes

kolchfa-aws merged commit 69b9384 into opensearch-project:main Mar 21, 2024
3 checks passed

hdhalter added 3 - Done Issue is done/complete and removed 4 - Doc review PR: Doc review in progress labels Mar 22, 2024

yuye-aws deleted the text_chunking branch March 26, 2024 02:24

yuye-aws mentioned this pull request Mar 27, 2024

Add example to text chunking processor documentation #6794

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add documentation for text chunking processor #6707

Add documentation for text chunking processor #6707

yuye-aws commented Mar 18, 2024 •

edited

Loading

kolchfa-aws commented Mar 18, 2024

yuye-aws commented Mar 19, 2024

yuye-aws commented Mar 19, 2024

yuye-aws Mar 20, 2024

yuye-aws Mar 20, 2024

yuye-aws Mar 20, 2024

yuye-aws Mar 20, 2024

natebower left a comment

natebower Mar 21, 2024

natebower Mar 21, 2024

natebower Mar 21, 2024

kolchfa-aws left a comment


		Users can set parameter `overlap_rate` from 0 to 50 percent. According to [bedrock](https://aws.amazon.com/blogs/aws/knowledge-bases-now-delivers-fully-managed-rag-experience-in-amazon-bedrock/), we recommend users to set this parameter between 0–20 percent to help improve accuracy.
		You can set the `overlap_rate` to a decimal equal to 0 to 50 percent. Per [Bedrock](https://aws.amazon.com/blogs/aws/knowledge-bases-now-delivers-fully-managed-rag-experience-in-amazon-bedrock/), we recommend setting this parameter to a value of 0–20 percent to improve accuracy.


		# Text chunking processor

		The `text_chunking` processor splits a long document into shorter passages. The processor supports the following algorithms for splitting text:

Add documentation for text chunking processor #6707

Add documentation for text chunking processor #6707

Conversation

yuye-aws commented Mar 18, 2024 • edited Loading

Description

Issues Resolved

Checklist

kolchfa-aws commented Mar 18, 2024

yuye-aws commented Mar 19, 2024

yuye-aws commented Mar 19, 2024

yuye-aws Mar 20, 2024

Choose a reason for hiding this comment

yuye-aws Mar 20, 2024

Choose a reason for hiding this comment

yuye-aws Mar 20, 2024

Choose a reason for hiding this comment

yuye-aws Mar 20, 2024

Choose a reason for hiding this comment

natebower left a comment

Choose a reason for hiding this comment

natebower Mar 21, 2024

Choose a reason for hiding this comment

natebower Mar 21, 2024

Choose a reason for hiding this comment

natebower Mar 21, 2024

Choose a reason for hiding this comment

kolchfa-aws left a comment

Choose a reason for hiding this comment

yuye-aws commented Mar 18, 2024 •

edited

Loading