-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
some thai word2vec problem #1489
Comments
@billy800413 can you give some examples of the Expected/Actual results? I'm thinking it may be somehow related to text preprocessing or tokenization for Thai. |
I may be wrong, but isn't Thai also a language without spaces-between-words, making tokenization potentially much more difficult & dependent on contextual-analysis? |
It is, that's why I ask for the examples :) |
in thai, the word will be with some vowels(example วัตถุท้องฟ้า), but i see all words without vowels after converting thwiki-20170701-pages-articles.xml.bz2 to text (example วตถทองฟา). i find a data from |
Eh, I see they ripped off gensim, just copy&pasting gensim code and changing the author & license, without any attribution 😳 @billy800413 regarding Thai tokenization: do you have a tokenizer that can split Thai text into words? If so, we can help you plug it in. |
the code which i use to do tokennization: |
OK, so before running your code in gensim, you want to plug in this Thai tokenizer. You can do that simply by replacing the import polyglot
from polyglot.text import Text, Word
def tokenize(content):
zen = Text(content)
return [word for word in zen.words] # or any other text processing you like I don't know how fast Polyglot is (on Thai texts), but gensim has no Thai tokenizer, so it's probably your best bet anyway. |
Resolved in #1537, now WikiCorpus is more flexible @billy800413. |
hi, i'm trying to train the word2vec. i download thwiki-20170701-pages-articles.xml.bz2 and convert it to text. but i see all word can't be read in text, because it misses the vowels.
Code to Reproduce
from gensim.corpora import WikiCorpus
wiki_corpus = WikiCorpus('thwiki-20170701-pages-articles.xml.bz2')
with open("wiki_texts.txt",'w',encoding='utf-8') as output:
for text in wiki_corpus.get_texts():
output.write(' '.join(text) + '\n')
Expected Results
words with vowels
Actual Results
words without vowels
Versions
Python 3.4.3
gensim (2.2.0)
The text was updated successfully, but these errors were encountered: