Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Chinese documents and candidates #247

Open
bsariturk opened this issue Aug 8, 2024 · 2 comments
Open

Chinese documents and candidates #247

bsariturk opened this issue Aug 8, 2024 · 2 comments

Comments

@bsariturk
Copy link

I'm using jieba for tokenization for my Chinese documents, as suggested here in the issues and in the documentation. It also says in the documentation that if I use a vectorizer, I cannot use a candidates lists. In that case, is there a way to use a candidates lists with Chinese documents?

@MaartenGr
Copy link
Owner

When you pass candidates to KeyBERT, the only thing that you are doing is adding them as part of the CountVectorizer vocabulary. So if you have a custom CountVectorizer, simply add the list of candidate words to the vocabulary parameter.

@bsariturk
Copy link
Author

Thank you so much Maarten. I managed to use my candidates list by providing it as vocabulary to a custom vectorizer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants