Organize the article on Wiki to train the model, can choose what kind of data you like Data Link
- Output :
word2vec-model/data.xml.bz2
Using the function WikiCorpus
from gensim.corpora
, to get the sentence of the article
python wiki_to_txt.py <filename of data.xml.bz2>
- Output :
word2vec-model/wiki_texts.txt
In some article, there are some simplified chinese words which have same meaning with traditional one, we use opencc
to translate it.
opencc -i wiki_texts.txt -o wiki_zh_tw.txt -c s2tw.json
- Output :
word2vec-model/wiki_zh_tw.txt
Using the jieba
to cut the chinese senetence into short words
python wiki_seg_jieba.py
- Output :
word2vec-model/wiki_seg.txt
With wiki_seg.txt
, we use word2vec
to train the matching model
python w2v-train.py
- Output :
word2vec-model/
[model.bin] (https://drive.google.com/file/d/0B9bH77JfnfxlZlhFaXdudjEwVEU/view?usp=sharing)
Put the article in medium length into target_article.txt
in article-keywords
.
- Recommend Use
jieba
to seperate the words if your model is trained by jieba's method. - But if you have the model trained by CKIP way, please use it and get better matching effect
python jieba_seg.py target_article.txt
Register the account and password on CKIP. CKIP
can provide better effect to cut the chinese sentence than jieba
; However, the shortcome of CKIP
is slower to jieba
because it needs to send sentence by part with Internet and get resouce back.
python ckip_seg.py target_article.txt
Rember to put your account and password into ckip_account.txt
in two lines
- Output :
article-keywords/target_article_seg.txt
First, Use Counter()
to get the frequency of each words in target article. Second, each words to add the frequcy of similar words in order, and sort it. Third, from top of the list, eliminate the following words on list, which is very similar with previous one.
python find_key_weight.py target_article_seg.txt
- Output :
article-keywords/target_article_keywords.txt
, list of the keywords in target article which all of them are not similar to each other.