Chinese documents and candidates #247

bsariturk · 2024-08-08T08:38:16Z

I'm using jieba for tokenization for my Chinese documents, as suggested here in the issues and in the documentation. It also says in the documentation that if I use a vectorizer, I cannot use a candidates lists. In that case, is there a way to use a candidates lists with Chinese documents?

MaartenGr · 2024-08-10T06:34:28Z

When you pass candidates to KeyBERT, the only thing that you are doing is adding them as part of the CountVectorizer vocabulary. So if you have a custom CountVectorizer, simply add the list of candidate words to the vocabulary parameter.

bsariturk · 2024-08-13T13:06:46Z

Thank you so much Maarten. I managed to use my candidates list by providing it as vocabulary to a custom vectorizer.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Chinese documents and candidates #247

Chinese documents and candidates #247

bsariturk commented Aug 8, 2024

MaartenGr commented Aug 10, 2024

bsariturk commented Aug 13, 2024

Chinese documents and candidates #247

Chinese documents and candidates #247

Comments

bsariturk commented Aug 8, 2024

MaartenGr commented Aug 10, 2024

bsariturk commented Aug 13, 2024