Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add documentation for text chunking processor #6707

Merged
merged 11 commits into from
Mar 21, 2024

Conversation

yuye-aws
Copy link
Member

@yuye-aws yuye-aws commented Mar 18, 2024

Description

Add documentation for text chunking processor

Issues Resolved

Implements RFC: opensearch-project/neural-search#548

Closes #6663

Checklist

  • By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license and subject to the Developers Certificate of Origin.
    For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: yuye-aws <yuyezhu@amazon.com>
@kolchfa-aws
Copy link
Collaborator

Thanks so much, @yuye-aws! I will review shortly and push my edits in the same PR.

@kolchfa-aws kolchfa-aws self-assigned this Mar 18, 2024
@kolchfa-aws kolchfa-aws added v2.13.0 release-notes PR: Include this PR in the automated release notes labels Mar 18, 2024
@hdhalter hdhalter changed the title add documentation for text chunking processor Add documentation for text chunking processor Mar 18, 2024
@yuye-aws
Copy link
Member Author

Thanks @kolchfa-aws ! I will fix the style check today.

Signed-off-by: yuye-aws <yuyezhu@amazon.com>
Signed-off-by: yuye-aws <yuyezhu@amazon.com>
Signed-off-by: yuye-aws <yuyezhu@amazon.com>
@yuye-aws
Copy link
Member Author

Hi @kolchfa-aws ! The style check error have been fixed. You can review this PR. If you have any concerns or questions, feel free to reach out to me.

Signed-off-by: Fanit Kolchina <kolchfa@amazon.com>
@yuye-aws yuye-aws marked this pull request as ready for review March 20, 2024 02:15
@@ -55,6 +55,8 @@ PUT /_ingest/pipeline/nlp-ingest-pipeline-sparse
```
{% include copy-curl.html %}

To split long text into paragraphs, use the `text_chunking` ingest processor before the `sparse_encoding` processor. For more information, see [Text chunking processor]({{site.url}}{{site.baseurl}}/ingest-pipelines/processors/text-chunking/).
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we be more specific and change To split long text into paragraphs to To avoid information loss by truncating tokens for long documents? The text_chunking processor is a text preprocess step before embedding processor. You can refer to the RFC for details.

@@ -48,6 +48,8 @@ PUT /_ingest/pipeline/nlp-ingest-pipeline
```
{% include copy-curl.html %}

To split long text into paragraphs, use the `text_chunking` ingest processor before the `text_embedding` processor. For more information, see [Text chunking processor]({{site.url}}{{site.baseurl}}/ingest-pipelines/processors/text-chunking/).
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above

---

# Text chunking processor

The `text_chunking` processor is used to chunk a long document into paragraphs. The following is the syntax for the `text_chunking` processor:
The `text_chunking` processor is used to chunk a long document into paragraphs on a delimiter or chunks of a certain size. The following is the syntax for the `text_chunking` processor:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

certain size -> certain token size


Users can set parameter `overlap_rate` from 0 to 50 percent. According to [bedrock](https://aws.amazon.com/blogs/aws/knowledge-bases-now-delivers-fully-managed-rag-experience-in-amazon-bedrock/), we recommend users to set this parameter between 0–20 percent to help improve accuracy.
You can set the `overlap_rate` to a decimal equal to 0 to 50 percent. Per [Bedrock](https://aws.amazon.com/blogs/aws/knowledge-bases-now-delivers-fully-managed-rag-experience-in-amazon-bedrock/), we recommend setting this parameter to a value of 0–20 percent to improve accuracy.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a decimal equal to 0 to 50 percent -> a decimal equal from 0 to 50 percent

yuye-aws and others added 5 commits March 20, 2024 11:16
Signed-off-by: yuye-aws <yuyezhu@amazon.com>
Signed-off-by: yuye-aws <yuyezhu@amazon.com>
Signed-off-by: yuye-aws <yuyezhu@amazon.com>
Signed-off-by: Fanit Kolchina <kolchfa@amazon.com>
@hdhalter hdhalter added the 4 - Doc review PR: Doc review in progress label Mar 20, 2024
Copy link
Collaborator

@natebower natebower left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yuye-aws @kolchfa-aws Please see my comments and changes and let me know if you have any questions. Thanks!


# Text chunking processor

The `text_chunking` processor splits a long document into shorter passages. The processor supports the following algorithms for splitting text:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"text splitting" instead of "splitting text"?

_ingest-pipelines/processors/text-chunking.md Outdated Show resolved Hide resolved
|:---|:---|:---|:---|
| `field_map` | Object | Required | Contains key-value pairs that specify the mapping of a text field to the output field for the text chunking processor. |
| `field_map.<input_field>` | String | Required | The name of the field from which to obtain text for generating chunked passages. |
| `field_map.<output_field>` | String | Required | The name of the field in which to store the chunking results. |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"chunked" instead of "chunking"?

_ingest-pipelines/processors/text-chunking.md Outdated Show resolved Hide resolved
_ingest-pipelines/processors/text-chunking.md Outdated Show resolved Hide resolved
_ingest-pipelines/processors/text-chunking.md Outdated Show resolved Hide resolved
_ingest-pipelines/processors/text-chunking.md Outdated Show resolved Hide resolved
_ingest-pipelines/processors/text-chunking.md Outdated Show resolved Hide resolved
}
```

Once you have created an ingest pipeline, you need to create an index for ingestion and ingest documents into the index. To learn more, see [Step 2: Create an index for ingestion]({{site.url}}{{site.baseurl}}/search-plugins/neural-sparse-search/#step-2-create-an-index-for-ingestion) and [Step 3: Ingest documents into the index]({{site.url}}{{site.baseurl}}/search-plugins/neural-sparse-search/#step-3-ingest-documents-into-the-index) of [Neural sparse search]({{site.url}}{{site.baseurl}}/search-plugins/neural-sparse-search/).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the end, let's clarify slightly and lowercase "neural". "of the neural sparse search documentation"?

_ingest-pipelines/processors/text-chunking.md Outdated Show resolved Hide resolved
Co-authored-by: Nathan Bower <nbower@amazon.com>
Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com>
Copy link
Collaborator

@kolchfa-aws kolchfa-aws left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thank you, @yuye-aws!

@kolchfa-aws kolchfa-aws merged commit 69b9384 into opensearch-project:main Mar 21, 2024
3 checks passed
CaptainDredge pushed a commit to CaptainDredge/documentation-website that referenced this pull request Mar 22, 2024
* add documentation for text chunking processor

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* fix style check errors

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* fix style check errors

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* fix style check errors

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* Doc review

Signed-off-by: Fanit Kolchina <kolchfa@amazon.com>

* add example on text embedding and cascade chunking

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* fix style check

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* add comment

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* Review additional requested info

Signed-off-by: Fanit Kolchina <kolchfa@amazon.com>

* Apply suggestions from code review

Co-authored-by: Nathan Bower <nbower@amazon.com>
Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com>

---------

Signed-off-by: yuye-aws <yuyezhu@amazon.com>
Signed-off-by: Fanit Kolchina <kolchfa@amazon.com>
Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com>
Co-authored-by: Fanit Kolchina <kolchfa@amazon.com>
Co-authored-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com>
Co-authored-by: Nathan Bower <nbower@amazon.com>
@hdhalter hdhalter added 3 - Done Issue is done/complete and removed 4 - Doc review PR: Doc review in progress labels Mar 22, 2024
@yuye-aws yuye-aws deleted the text_chunking branch March 26, 2024 02:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Done Issue is done/complete release-notes PR: Include this PR in the automated release notes v2.13.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[DOC] Document chunking processor.
4 participants