Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Port NLTKDocumentSplitter from dC to Haystack #8350

Merged
merged 11 commits into from
Sep 17, 2024
Merged

Conversation

vblagoje
Copy link
Member

@vblagoje vblagoje commented Sep 10, 2024

Why:

Introduces a new document splitter component utilizing NLTK for enhanced text processing.

What:

  • Implemented NLTKDocumentSplitter: A new component that leverages the Natural Language Toolkit (NLTK) for splitting documents based on words, sentences, passages, or pages.
  • Configuration Options: The splitter supports various configuration parameters such as split_by, split_length, split_overlap, respect_sentence_boundary, language, use_split_rules, and extend_abbreviations for fine-tuning the document splitting process.
  • Split Mechanics Enhancement: Improvements in split mechanics, including respecting sentence boundaries when splitting by words and addressing special cases with split rules and extended abbreviations for improved tokenization.
  • Testing: Comprehensive test suite for validating the functionality of the NLTKDocumentSplitter across different splitting scenarios and configurations.

How can it be used:

  • Dataset Preprocessing: Before feeding text data into machine learning models, use the splitter to preprocess documents into smaller, more manageable sizes or into specific formats required by downstream processing stages.
# Example usage of the NLTKDocumentSplitter component
from haystack.components.preprocessors import NLTKDocumentSplitter

document_splitter = NLTKDocumentSplitter(
    split_by="sentence",
    split_length=10,
    split_overlap=1,
    respect_sentence_boundary=True,
    language="en"
)

# Split documents
split_documents = document_splitter.run(documents=[your_input_document])

How did you test it:

  • Unit Tests: Implemented unit tests for each critical functionality of the NLTKDocumentSplitter, including various split_by configurations, handling different languages, and respecting sentence boundaries.
  • Integration Testing: Tested the component within a sample text preprocessing pipeline to ensure it integrates well with other components and handles real-world data as expected.

Notes for the reviewer:

  • Check if lazy imports are correctly used
  • This component has been used extensively in dC and is ported from dC into Haystack

@github-actions github-actions bot added type:documentation Improvements on the docs topic:tests topic:build/distribution and removed type:documentation Improvements on the docs labels Sep 10, 2024
@coveralls
Copy link
Collaborator

coveralls commented Sep 10, 2024

Pull Request Test Coverage Report for Build 10792807724

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

Details

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • 6 unchanged lines in 2 files lost coverage.
  • Overall coverage increased (+0.09%) to 90.398%

Files with Coverage Reduction New Missed Lines %
components/generators/azure.py 3 92.68%
components/generators/chat/azure.py 3 92.5%
Totals Coverage Status
Change from base Build 10774951533: 0.09%
Covered Lines: 7315
Relevant Lines: 8092

💛 - Coveralls

Copy link
Member Author

@vblagoje vblagoje left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are just my notes to help the reviewer orient more easily

:param keep_white_spaces: If True, the tokenizer will keep white spaces between sentences.
:returns: nltk sentence tokenizer.
"""
try:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the right thing to use here? See the issue described in #8238 (and the corrective action recommended)
In 1.26.x branch we load these nltk thingies a bit differently, see these LOCs

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as it is now should be fine, the punkt_tab is the recommended folder to load and already works for us

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok keeping it as is

return True

# next sentence starts with a bracket or we return False
return re.search(r"^\s*[\(\[]", text[next_start:next_end]) is not None
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of these linters and checkers didn't allow me to keep this LOC as it was in the original code. Please double check @sjrl

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What was the original line of code?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was:

if re.search(r"^\s*[\(\[]", text[next_start:next_end]) is not None:
    return True
    
return False

But then linter didn't allow me to have such code :-)

@@ -103,6 +103,8 @@ extra-dependencies = [
"python-pptx", # PPTXToDocument
"python-docx", # DocxToDocument

"nltk", # NLTKDocumentSplitter
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nltk is treated as optional dep but we need to add it here for tests

@vblagoje vblagoje marked this pull request as ready for review September 10, 2024 13:09
@vblagoje vblagoje requested review from a team as code owners September 10, 2024 13:09
@vblagoje vblagoje requested review from dfokina, davidsbatista and sjrl and removed request for a team September 10, 2024 13:09
@vblagoje
Copy link
Member Author

@davidsbatista you won the lottery here but let's allow @sjrl a first pass to make sure all the pieces were migrated properly 🙏

@sjrl
Copy link
Contributor

sjrl commented Sep 10, 2024

Hey @vblagoje broad question, would it be better to fold this functionality into the existing document splitter instead of creating a new component?

@vblagoje
Copy link
Member Author

Forced pushed to properly credit @sjrl for all the work

@vblagoje
Copy link
Member Author

vblagoje commented Sep 10, 2024

Hey @vblagoje broad question, would it be better to fold this functionality into the existing document splitter instead of creating a new component?

I'm afraid of unintended side effect for the existing users of DocumentSplitter @sjrl Perhaps we can keep it as is now and carefully merge it for the next release I'd say, wdyt? wdyt @julian-risch ?

@vblagoje vblagoje changed the title draft: Port NLTKDocumentSplitter from dC to Haystack feat: Port NLTKDocumentSplitter from dC to Haystack Sep 10, 2024
@vblagoje
Copy link
Member Author

@davidsbatista I converted a few more methods to static, they seems to be really tied to SentenceSplitter and as such I didn't make them free standing

@vblagoje
Copy link
Member Author

@sjrl please have another look. I spoke to @julian-risch and he also agreed we integrate NLTKDocumentSplitter and later investigate options to perhaps merge NLTKDocumentSplitter and DocumentSplitter

@davidsbatista
Copy link
Contributor

Name                                                          Stmts   Miss  Cover   Missing
-------------------------------------------------------------------------------------------
haystack/components/preprocessors/__init__.py                     5      0   100%
haystack/components/preprocessors/document_cleaner.py           104      2    98%   90, 311
haystack/components/preprocessors/document_splitter.py           96      1    99%   127
haystack/components/preprocessors/nltk_document_splitter.py      98      0   100%
haystack/components/preprocessors/text_cleaner.py                29      0   100%
haystack/components/preprocessors/utils.py                       83     15    82%   91-95, 102-107, 174-176, 202, 208, 212, 230-231
-------------------------------------------------------------------------------------------
TOTAL                                                           415     18    96%

Running the test coverage locally it seems there's a few edge cases in utils.py that might be worth testing. This is what not currently being tested:

  • 174-176 inside _apply_split_rules() tests never go inside the second while loop
  • 202,208,212 _needs_join never falls into a return True case
  • 230-231: _read_abbreviations always falls into the first return case

Do you think it's worth to extend the tests for this edge cases?

@vblagoje
Copy link
Member Author

Sure @davidsbatista let's increase coverage and see about compiling those expressions 🙏

@vblagoje
Copy link
Member Author

Ah pre-integration checks say we need to add a new documentation page for this component. Not yet ready for integration @davidsbatista @sjrl

@vblagoje
Copy link
Member Author

@dfokina I created an initial version of the doc for this component
The main info centers around why someone would choose this splitter over the default one.

@vblagoje
Copy link
Member Author

What prevents us from integrating this PR @davidsbatista and @sjrl ?

@davidsbatista
Copy link
Contributor

to be complete maybe just the docs - but I wouldn't hold the merging because of that

Copy link
Contributor

@davidsbatista davidsbatista left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks @sjrl for this 👍🏽

@sjrl
Copy link
Contributor

sjrl commented Sep 17, 2024

@vblagoje I'm doing one last quick look over now!

:param language: The language to read the abbreviations for.
:returns: List of abbreviations.
"""
abbreviations_file = Path(__file__).parent.parent / f"data/abbreviations/{language}.txt"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @vblagoje I noticed that we didn't add these files, could we do that?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sjrl
Copy link
Contributor

sjrl commented Sep 17, 2024

Thanks @vblagoje this looks great! Just left a few comments.

Also, all code in the utils.py file was contributed by @tstadel except for the CustomPunktLanguageVars class. So if possible it would be great to attribute him instead :)

@vblagoje
Copy link
Member Author

Thanks @vblagoje this looks great! Just left a few comments.

Also, all code in the utils.py file was contributed by @tstadel except for the CustomPunktLanguageVars class. So if possible it would be great to attribute him instead :)

Ah, no problem, will do - thanks @davidsbatista and @sjrl 🙏

@vblagoje
Copy link
Member Author

Spoke to @tstadel - he waived attributions. Merging this now. @dfokina let's not forget to include this component in 2.6 docs release

@vblagoje vblagoje merged commit badd059 into main Sep 17, 2024
19 checks passed
@vblagoje vblagoje deleted the document_splitter branch September 17, 2024 11:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants