Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

some thai word2vec problem #1489

Closed
billy800413 opened this issue Jul 18, 2017 · 8 comments
Closed

some thai word2vec problem #1489

billy800413 opened this issue Jul 18, 2017 · 8 comments

Comments

@billy800413
Copy link

hi, i'm trying to train the word2vec. i download thwiki-20170701-pages-articles.xml.bz2 and convert it to text. but i see all word can't be read in text, because it misses the vowels.

Code to Reproduce

from gensim.corpora import WikiCorpus
wiki_corpus = WikiCorpus('thwiki-20170701-pages-articles.xml.bz2')
with open("wiki_texts.txt",'w',encoding='utf-8') as output:
for text in wiki_corpus.get_texts():
output.write(' '.join(text) + '\n')

Expected Results
words with vowels

Actual Results
words without vowels

Versions
Python 3.4.3
gensim (2.2.0)

@piskvorky
Copy link
Owner

@billy800413 can you give some examples of the Expected/Actual results?

I'm thinking it may be somehow related to text preprocessing or tokenization for Thai.

@gojomo
Copy link
Collaborator

gojomo commented Jul 18, 2017

I may be wrong, but isn't Thai also a language without spaces-between-words, making tokenization potentially much more difficult & dependent on contextual-analysis?

@piskvorky
Copy link
Owner

piskvorky commented Jul 18, 2017

It is, that's why I ask for the examples :)
Vowels are also realized very differently in Thai (below, above, ...).

@billy800413
Copy link
Author

in thai, the word will be with some vowels(example วัตถุท้องฟ้า), but i see all words without vowels after converting thwiki-20170701-pages-articles.xml.bz2 to text (example วตถทองฟา). i find a data from
https://sites.google.com/site/rmyeid/projects/polyglot -> th_wiki_text.tar.lzma, and it also from wiki but the words with vowel. So now i use th_wiki_text.tar.lzma and ployglot to tokensize all sentence to train my word2vec.

@piskvorky
Copy link
Owner

piskvorky commented Jul 19, 2017

Eh, I see they ripped off gensim, just copy&pasting gensim code and changing the author & license, without any attribution 😳
https://bitbucket.org/aboSamoor/polyglot2/src/2aba7d03dd2bfc97aa8c620161c2489df827b538/polyglot2/polyglot2.py?at=master&fileviewer=file-view-default

@billy800413 regarding Thai tokenization: do you have a tokenizer that can split Thai text into words? If so, we can help you plug it in.

@billy800413
Copy link
Author

the code which i use to do tokennization:
import polyglot
from polyglot.text import Text, Word
zen = Text(sentence)
words=zen.words #the type is a list
for word in words:
output.write(word +' ')

@piskvorky
Copy link
Owner

piskvorky commented Jul 19, 2017

OK, so before running your code in gensim, you want to plug in this Thai tokenizer. You can do that simply by replacing the tokenize function here with your own tokenizer:

import polyglot
from polyglot.text import Text, Word

def tokenize(content):
    zen = Text(content)
    return [word for word in zen.words]  # or any other text processing you like

I don't know how fast Polyglot is (on Thai texts), but gensim has no Thai tokenizer, so it's probably your best bet anyway.

@menshikh-iv
Copy link
Contributor

Resolved in #1537, now WikiCorpus is more flexible @billy800413.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants