Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade Haystack 1.x to NLTK 3.9 #8238

Closed
julian-risch opened this issue Aug 15, 2024 · 4 comments
Closed

Upgrade Haystack 1.x to NLTK 3.9 #8238

julian-risch opened this issue Aug 15, 2024 · 4 comments
Assignees
Labels
1.x P1 High priority, add to the next sprint topic:preprocessing

Comments

@julian-risch
Copy link
Member

In Haystack 1.26.x we should replace the nltk.download("punkt") with nltk.download('punkt_tab') here


so that users can use Haystack 1.26.x with NLTK 3.9. Prior NLTK versions are affected by https://nvd.nist.gov/vuln/detail/CVE-2024-39705. We should therefore also pin NLTK to >=3.9.

While the NLTK release notes list 3.8.2 https://pypi.org/project/nltk/#history with the fix, that release disappeared from pypi. https://pypi.org/project/nltk/#history
There is a comment on GitHub saying that the release was deleted and there will be a 3.9 nltk/nltk#3301 (comment)

@julian-risch julian-risch added topic:preprocessing P1 High priority, add to the next sprint 1.x labels Aug 15, 2024
@sagarneeldubey
Copy link

sagarneeldubey commented Aug 15, 2024

Our pre-processing pipeline broke since the nltk update because of the following issues:

  1. Unpickling restricted to simple types to avoid CVE issues (the reason we have to upgrade from 3.8.1 asap). So use "PunktTokenizer" to safely load the nltk model rather than using "nltk.data.load".
  2. "punkt" package contains unsafe pickles and is replaced by "punkt_tab" [https://github.com/[BUG] punkt_tab breaking change nltk/nltk#3293]

We had to create a custom preprocessor component to fix the above issues. Happy to contribute to a fix if needed.

@vblagoje
Copy link
Member

vblagoje commented Aug 29, 2024

Closing as #8256 has been integrated on 1.26.x branch and a new 1.26.3 release has been released with this fix.

@sagarneeldubey
Copy link

We upgraded farm-haystack to 1.26.3 today and we confirm that the preprocessor is working fine with nltk 3.9.1 , so we don't need the custom preprocessor anymore. Thanks a lot for your prompt response to this issue!

@vblagoje
Copy link
Member

vblagoje commented Sep 1, 2024

Awesome, thanks for reporting back @sagarneeldubey much appreciated 🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
1.x P1 High priority, add to the next sprint topic:preprocessing
Projects
None yet
Development

No branches or pull requests

3 participants