some thai word2vec problem #1489

billy800413 · 2017-07-18T04:59:07Z

hi, i'm trying to train the word2vec. i download thwiki-20170701-pages-articles.xml.bz2 and convert it to text. but i see all word can't be read in text, because it misses the vowels.

Code to Reproduce

from gensim.corpora import WikiCorpus
wiki_corpus = WikiCorpus('thwiki-20170701-pages-articles.xml.bz2')
with open("wiki_texts.txt",'w',encoding='utf-8') as output:
for text in wiki_corpus.get_texts():
output.write(' '.join(text) + '\n')

Expected Results
words with vowels

Actual Results
words without vowels

Versions
Python 3.4.3
gensim (2.2.0)

piskvorky · 2017-07-18T06:24:16Z

@billy800413 can you give some examples of the Expected/Actual results?

I'm thinking it may be somehow related to text preprocessing or tokenization for Thai.

gojomo · 2017-07-18T16:33:36Z

I may be wrong, but isn't Thai also a language without spaces-between-words, making tokenization potentially much more difficult & dependent on contextual-analysis?

piskvorky · 2017-07-18T23:15:02Z

It is, that's why I ask for the examples :)
Vowels are also realized very differently in Thai (below, above, ...).

billy800413 · 2017-07-19T02:52:02Z

in thai, the word will be with some vowels(example วัตถุท้องฟ้า), but i see all words without vowels after converting thwiki-20170701-pages-articles.xml.bz2 to text (example วตถทองฟา). i find a data from
https://sites.google.com/site/rmyeid/projects/polyglot -> th_wiki_text.tar.lzma, and it also from wiki but the words with vowel. So now i use th_wiki_text.tar.lzma and ployglot to tokensize all sentence to train my word2vec.

piskvorky · 2017-07-19T03:52:33Z

Eh, I see they ripped off gensim, just copy&pasting gensim code and changing the author & license, without any attribution 😳
https://bitbucket.org/aboSamoor/polyglot2/src/2aba7d03dd2bfc97aa8c620161c2489df827b538/polyglot2/polyglot2.py?at=master&fileviewer=file-view-default

@billy800413 regarding Thai tokenization: do you have a tokenizer that can split Thai text into words? If so, we can help you plug it in.

billy800413 · 2017-07-19T06:17:46Z

the code which i use to do tokennization:
import polyglot
from polyglot.text import Text, Word
zen = Text(sentence)
words=zen.words #the type is a list
for word in words:
output.write(word +' ')

piskvorky · 2017-07-19T11:08:29Z

OK, so before running your code in gensim, you want to plug in this Thai tokenizer. You can do that simply by replacing the tokenize function here with your own tokenizer:

import polyglot
from polyglot.text import Text, Word

def tokenize(content):
    zen = Text(content)
    return [word for word in zen.words]  # or any other text processing you like

I don't know how fast Polyglot is (on Thai texts), but gensim has no Thai tokenizer, so it's probably your best bet anyway.

menshikh-iv · 2017-09-18T15:17:52Z

Resolved in #1537, now WikiCorpus is more flexible @billy800413.

piskvorky mentioned this issue Jul 27, 2017

Restructure TextCorpus code to share multiprocessing and preprocessing logic. #1478

Closed

piskvorky mentioned this issue Aug 16, 2017

WikiCorpus Tokenization issue #1534

Closed

menshikh-iv closed this as completed Sep 18, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

some thai word2vec problem #1489

some thai word2vec problem #1489

billy800413 commented Jul 18, 2017

piskvorky commented Jul 18, 2017

gojomo commented Jul 18, 2017

piskvorky commented Jul 18, 2017 •

edited

Loading

billy800413 commented Jul 19, 2017

piskvorky commented Jul 19, 2017 •

edited

Loading

billy800413 commented Jul 19, 2017

piskvorky commented Jul 19, 2017 •

edited

Loading

menshikh-iv commented Sep 18, 2017

some thai word2vec problem #1489

some thai word2vec problem #1489

Comments

billy800413 commented Jul 18, 2017

piskvorky commented Jul 18, 2017

gojomo commented Jul 18, 2017

piskvorky commented Jul 18, 2017 • edited Loading

billy800413 commented Jul 19, 2017

piskvorky commented Jul 19, 2017 • edited Loading

billy800413 commented Jul 19, 2017

piskvorky commented Jul 19, 2017 • edited Loading

menshikh-iv commented Sep 18, 2017

piskvorky commented Jul 18, 2017 •

edited

Loading

piskvorky commented Jul 19, 2017 •

edited

Loading

piskvorky commented Jul 19, 2017 •

edited

Loading