WTMF-pipeline

Documentation:

This is the package for running WTMF pipeline. It contains the Python code for a distributional similarity model -- Orthogonal matrix factorization (OrMF), and a perl pipeline that preprocesses the data and uses the OrMF model to extract the latent vectors of short texts.

The OrMF model is an unsupervised dimension reduction algorithm, use the exactly the same information that LSA and LDA exploit, which is word-document co-occurrence, and outperforms LSA and LDA by a large margin (on the sentence similarity data sets).
It will train a model on a corpus. For each short text in the test data, it will find a latent K-dimension vector. Usually a larger K leads to a better performance. In this package, the default value of K is K=100.

Please cite these papers if you use this code.

[1] Weiwei Guo and Mona Diab. 2012. Modeling Sentences in the Latent Space. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 864–872, Jeju Island, Korea. Association for Computational Linguistics.

[2] Weiwei Guo, Wei Liu, and Mona Diab. 2014. Fast Tweet Retrieval with Compact Binary Codes. In Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers, pages 486–496, Dublin, Ireland. Dublin City University and Association for Computational Linguistics.

How to run on ICS-ACI

Follow the two steps below if you are running the project on ICS-ACI (The Institute for CyberScience Advanced CyberInfrastructure), Penn State's high-performance research cloud:

Upload the project to @submit.aci.ics.psu.edu
Load the proper GCC with:

module load gcc

Run the following for testing:

python3 test.py

Or run the following for training:

python3 train.py

How to run locally

Make sure your GCC version is after 5.3
Make sure the armadillo file (armadillo-9.800.4.tar.xz) gets unzipped

Additional notes

Change the basedir variable in the config.ini to your local project path
Change the base_dir variable in the WTMF/ormf.cpp to your local project path
Check the line count in the train.txt file is the same as what the program computes
Check to see if the word count in the train.txt is the same as the input in the WTMF/script/intimate.py file
The results are in a model.mat format file

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
WTMF		WTMF
models/weiwei		models/weiwei
preprocess		preprocess
test		test
.gitignore		.gitignore
README.md		README.md
clean.py		clean.py
config.ini		config.ini
test.py		test.py
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

WTMF-pipeline

Documentation:

How to run on ICS-ACI

How to run locally

Additional notes

About

Releases

Packages

Contributors 2

Languages

psunlpgroup/WTMF-pipeline

Folders and files

Latest commit

History

Repository files navigation

WTMF-pipeline

Documentation:

How to run on ICS-ACI

How to run locally

Additional notes

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages