Authors: Jisun An, Haewoon Kwak, and Yong-Yeol Ahn
Because word semantics can substantially change across communities and contexts, capturing domain-specific word semantics is an important challenge. Here, we propose SemAxis, a simple yet powerful framework to characterize word semantics using many semantic axes in word-vector spaces beyond sentiment. We demonstrate that SemAxis can capture nuanced semantic representations in multiple online communities. We also show that, when the sentiment axis is examined, SemAxis outperforms the state-of-the-art approaches in building domain-specific sentiment lexicons.
/r/The_Donald community feels Guns more safe than /r/SandersForPresident.
If you make use of this work in your research please cite the following paper:
Jisun An, Haewoon Kwak, and Yong-Yeol Ahn. 2018. SemAxis: A Lightweight Framework to Characterize Domain-Specific Word Semantics Beyond Sentiment. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (ACL'18)
@InProceedings{P18-1228,
author = "An, Jisun
and Kwak, Haewoon
and Ahn, Yong-Yeol",
title = "SemAxis: A Lightweight Framework to Characterize Domain-Specific Word Semantics Beyond Sentiment",
booktitle = "Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
year = "2018",
publisher = "Association for Computational Linguistics",
pages = "2450--2461",
location = "Melbourne, Australia",
url = "http://aclweb.org/anthology/P18-1228"
}
To use SemAxis you will need to download some that are pre-trained. Once this is done, you would specify the path (variable: EMBEDDING_PATH) to these embeddings in semaxis.py. The file semaxis.py contains implementations for computing semantic axes given two pole words and projecting target word on the semantic axes along with some comments/documentation on how to use them.
We make pre-trained word embeddings used in this study availalbe to download.
- Google300D Note that for SemAxis, bin file needs to be converted to text file: see
- Reddit20M We randomly sample 1M comments from the top 200 Subreddits and trained the word embeddings.
- Subreddits (We update the reference model (Reddit20M) with a corpus of each subreddit.)
We systematically induce 732 semantic axes based on the antonym pairs from ConceptNet. You can download them in the following: 732 Pre-defined Semantic Axes for download. The file includes 732 antonym word pairs. The file is tab-separated.
An up-to-date Python 3.5 distribution, with the standard packages provided by the anaconda distribution is required.
In particular, the code was tested with:
numpy (1.14.0)
gensim (3.4.0)
scipy (1.0.0)