Understanding the Influence of Hyperparameters on Text Embeddings for Natural Language Processing Tasks
This is the accompanying material for the paper "Understanding the Influence of Hyperparameters on Text Embeddings for Natural Language Processing Tasks" by Nils Witt and Christin Seifert. The paper explores the influences of the various hyperpaprameters of a doc2vec-based kNN-classifier on the accuracy of the model. For further details, please refer to the paper.
- There are two notebooks that transform a dataset into a common format (1, 2)
- This notebook carries out grid search as well as Bayesian optimization.
- The last notebook visualizeses the results for the previous step in several ways.
This software is very demanding in terms of hardware specifications. While the actual demand depends on some parameters of the grid search (size of the training set and number parallel executed processees), it should be noted that, in general, a big machine (>128GB RAM, >12 CPU cores) is required.
The project is based on Python 3.5 and Jupyter notebooks. We recommend using the Anaconda distribution to install the environment. Furthermore, we are using the Python machine learning stack inclusing NumPy, Scikit Learn, Matplotlib, Pandas and Gensim.
Incorporating new corpora into this framework is faily easy: to satisfy the input format a Pandas dataframe must be pickled to disk. The dataframe must contain the column text
which comprises the text of one document per row as well as a column category
which defines the membership of this document to a category.
text | category |
---|---|
Lorem Ipsum... | greek philosophy |
Live long and prosper... | sci-fi |