a text-searching script base on bayesian network and word2vec
gensim,
$ curl https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2
$ 7z e enwiki-latest-pages-articles.xml.bz2
2. parse the origin xml using wikiextractor
$ git clone https://github.com/attardi/wikiextractor && cd wikiextractor
$ python3 WikiExtractor.py --html -s ../enwiki-latest-pages-articles.xml
$ cd .. && git clone https://github.com/ethanmiles/Bayesian-Network-for-NLP && cd Bayesian-Network-for-NLP/src/py/
$ python3 xmlParser.py --input /path/to/wikiextractor/text/ --work /path/to/workdir/ -p
$ python3 xmlParser.py --work /path/to/workdir/ -q feedback
xml_parser.hpp contains a series of xml processing tools which can basically cover various requirement of student or scholar.
Method Preview:
- split tags
- edges
- nodes
- probabilistic table
- tfidf
- bayesian network inference
- building graph from structured documents
- graph_test
- xml_parser_test