article_match

Into word2vec-training Folder

Step 1 : Download The Raw Data On Wiki

Organize the article on Wiki to train the model, can choose what kind of data you like Data Link

Output : word2vec-model/data.xml.bz2

Step 2 : Get Raw Data

Using the function WikiCorpus from gensim.corpora, to get the sentence of the article

python wiki_to_txt.py <filename of data.xml.bz2>

Output : word2vec-model/wiki_texts.txt

Step 3 : Translation

In some article, there are some simplified chinese words which have same meaning with traditional one, we use opencc to translate it.

opencc -i wiki_texts.txt -o wiki_zh_tw.txt -c s2tw.json

Output : word2vec-model/wiki_zh_tw.txt

Step 4 : Cut the Sentence

Using the jieba to cut the chinese senetence into short words

python wiki_seg_jieba.py

Output : word2vec-model/wiki_seg.txt

Step 5 : Start Training

With wiki_seg.txt, we use word2vec to train the matching model

python w2v-train.py

Output : word2vec-model/[model.bin] (https://drive.google.com/file/d/0B9bH77JfnfxlZlhFaXdudjEwVEU/view?usp=sharing)

Into article-keywords Folder

Step 6 : Find the Target Article-keywords:

Put the article in medium length into target_article.txt in article-keywords.

Step 7 : Cut Target Article Into Short Pieces

Recommend Use jieba to seperate the words if your model is trained by jieba's method.
But if you have the model trained by CKIP way, please use it and get better matching effect

python jieba_seg.py target_article.txt

or

Register the account and password on CKIP. CKIP can provide better effect to cut the chinese sentence than jieba; However, the shortcome of CKIP is slower to jieba because it needs to send sentence by part with Internet and get resouce back.

python ckip_seg.py target_article.txt

Rember to put your account and password into ckip_account.txt in two lines

Output : article-keywords/target_article_seg.txt

Step 8 : Find keywords in target article

First, Use Counter() to get the frequency of each words in target article. Second, each words to add the frequcy of similar words in order, and sort it. Third, from top of the list, eliminate the following words on list, which is very similar with previous one.

python find_key_weight.py target_article_seg.txt

Output : article-keywords/target_article_keywords.txt, list of the keywords in target article which all of them are not similar to each other.

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
article-keywords		article-keywords
experiment		experiment
word2vec-model		word2vec-model
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
ckip_demo.py		ckip_demo.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

article_match

Into word2vec-training Folder

Step 1 : Download The Raw Data On Wiki

Step 2 : Get Raw Data

Step 3 : Translation

Step 4 : Cut the Sentence

Step 5 : Start Training

Into article-keywords Folder

Step 6 : Find the Target Article-keywords:

Step 7 : Cut Target Article Into Short Pieces

or

Step 8 : Find keywords in target article

About

Releases

Packages

Languages

License

chunchih/article-matching

Folders and files

Latest commit

History

Repository files navigation

article_match

Into word2vec-training Folder

Step 1 : Download The Raw Data On Wiki

Step 2 : Get Raw Data

Step 3 : Translation

Step 4 : Cut the Sentence

Step 5 : Start Training

Into article-keywords Folder

Step 6 : Find the Target Article-keywords:

Step 7 : Cut Target Article Into Short Pieces

or

Step 8 : Find keywords in target article

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages