-
Notifications
You must be signed in to change notification settings - Fork 480
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add documentation for text chunking processor #6707
Conversation
Signed-off-by: yuye-aws <yuyezhu@amazon.com>
Thanks so much, @yuye-aws! I will review shortly and push my edits in the same PR. |
Thanks @kolchfa-aws ! I will fix the style check today. |
Signed-off-by: yuye-aws <yuyezhu@amazon.com>
Signed-off-by: yuye-aws <yuyezhu@amazon.com>
Signed-off-by: yuye-aws <yuyezhu@amazon.com>
Hi @kolchfa-aws ! The style check error have been fixed. You can review this PR. If you have any concerns or questions, feel free to reach out to me. |
Signed-off-by: Fanit Kolchina <kolchfa@amazon.com>
@@ -55,6 +55,8 @@ PUT /_ingest/pipeline/nlp-ingest-pipeline-sparse | |||
``` | |||
{% include copy-curl.html %} | |||
|
|||
To split long text into paragraphs, use the `text_chunking` ingest processor before the `sparse_encoding` processor. For more information, see [Text chunking processor]({{site.url}}{{site.baseurl}}/ingest-pipelines/processors/text-chunking/). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we be more specific and change To split long text into paragraphs
to To avoid information loss by truncating tokens for long documents
? The text_chunking
processor is a text preprocess step before embedding processor. You can refer to the RFC for details.
_search-plugins/semantic-search.md
Outdated
@@ -48,6 +48,8 @@ PUT /_ingest/pipeline/nlp-ingest-pipeline | |||
``` | |||
{% include copy-curl.html %} | |||
|
|||
To split long text into paragraphs, use the `text_chunking` ingest processor before the `text_embedding` processor. For more information, see [Text chunking processor]({{site.url}}{{site.baseurl}}/ingest-pipelines/processors/text-chunking/). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above
--- | ||
|
||
# Text chunking processor | ||
|
||
The `text_chunking` processor is used to chunk a long document into paragraphs. The following is the syntax for the `text_chunking` processor: | ||
The `text_chunking` processor is used to chunk a long document into paragraphs on a delimiter or chunks of a certain size. The following is the syntax for the `text_chunking` processor: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
certain size
-> certain token size
|
||
Users can set parameter `overlap_rate` from 0 to 50 percent. According to [bedrock](https://aws.amazon.com/blogs/aws/knowledge-bases-now-delivers-fully-managed-rag-experience-in-amazon-bedrock/), we recommend users to set this parameter between 0–20 percent to help improve accuracy. | ||
You can set the `overlap_rate` to a decimal equal to 0 to 50 percent. Per [Bedrock](https://aws.amazon.com/blogs/aws/knowledge-bases-now-delivers-fully-managed-rag-experience-in-amazon-bedrock/), we recommend setting this parameter to a value of 0–20 percent to improve accuracy. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a decimal equal to 0 to 50 percent
-> a decimal equal from 0 to 50 percent
Signed-off-by: yuye-aws <yuyezhu@amazon.com>
Signed-off-by: yuye-aws <yuyezhu@amazon.com>
Signed-off-by: yuye-aws <yuyezhu@amazon.com>
Signed-off-by: Fanit Kolchina <kolchfa@amazon.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yuye-aws @kolchfa-aws Please see my comments and changes and let me know if you have any questions. Thanks!
|
||
# Text chunking processor | ||
|
||
The `text_chunking` processor splits a long document into shorter passages. The processor supports the following algorithms for splitting text: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"text splitting" instead of "splitting text"?
|:---|:---|:---|:---| | ||
| `field_map` | Object | Required | Contains key-value pairs that specify the mapping of a text field to the output field for the text chunking processor. | | ||
| `field_map.<input_field>` | String | Required | The name of the field from which to obtain text for generating chunked passages. | | ||
| `field_map.<output_field>` | String | Required | The name of the field in which to store the chunking results. | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"chunked" instead of "chunking"?
} | ||
``` | ||
|
||
Once you have created an ingest pipeline, you need to create an index for ingestion and ingest documents into the index. To learn more, see [Step 2: Create an index for ingestion]({{site.url}}{{site.baseurl}}/search-plugins/neural-sparse-search/#step-2-create-an-index-for-ingestion) and [Step 3: Ingest documents into the index]({{site.url}}{{site.baseurl}}/search-plugins/neural-sparse-search/#step-3-ingest-documents-into-the-index) of [Neural sparse search]({{site.url}}{{site.baseurl}}/search-plugins/neural-sparse-search/). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At the end, let's clarify slightly and lowercase "neural". "of the neural sparse search documentation"?
Co-authored-by: Nathan Bower <nbower@amazon.com> Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thank you, @yuye-aws!
* add documentation for text chunking processor Signed-off-by: yuye-aws <yuyezhu@amazon.com> * fix style check errors Signed-off-by: yuye-aws <yuyezhu@amazon.com> * fix style check errors Signed-off-by: yuye-aws <yuyezhu@amazon.com> * fix style check errors Signed-off-by: yuye-aws <yuyezhu@amazon.com> * Doc review Signed-off-by: Fanit Kolchina <kolchfa@amazon.com> * add example on text embedding and cascade chunking Signed-off-by: yuye-aws <yuyezhu@amazon.com> * fix style check Signed-off-by: yuye-aws <yuyezhu@amazon.com> * add comment Signed-off-by: yuye-aws <yuyezhu@amazon.com> * Review additional requested info Signed-off-by: Fanit Kolchina <kolchfa@amazon.com> * Apply suggestions from code review Co-authored-by: Nathan Bower <nbower@amazon.com> Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> --------- Signed-off-by: yuye-aws <yuyezhu@amazon.com> Signed-off-by: Fanit Kolchina <kolchfa@amazon.com> Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> Co-authored-by: Fanit Kolchina <kolchfa@amazon.com> Co-authored-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> Co-authored-by: Nathan Bower <nbower@amazon.com>
Description
Add documentation for text chunking processor
Issues Resolved
Implements RFC: opensearch-project/neural-search#548
Closes #6663
Checklist
For more information on following Developer Certificate of Origin and signing off your commits, please check here.