Skip to content

This Repository contains the code that generated the data and plots for the paper "Understanding the Influence of Hyperparameters on Text Embeddings for Natural Language Processing Tasks"

License

Notifications You must be signed in to change notification settings

n-witt/Influence-of-Hyperparameter-on-Text-Embeddings

Repository files navigation

Understanding the Influence of Hyperparameters on Text Embeddings for Natural Language Processing Tasks

This is the accompanying material for the paper "Understanding the Influence of Hyperparameters on Text Embeddings for Natural Language Processing Tasks" by Nils Witt and Christin Seifert. The paper explores the influences of the various hyperpaprameters of a doc2vec-based kNN-classifier on the accuracy of the model. For further details, please refer to the paper.

Repository structure

  1. There are two notebooks that transform a dataset into a common format (1, 2)
  2. This notebook carries out grid search as well as Bayesian optimization.
  3. The last notebook visualizeses the results for the previous step in several ways.

Hardware requirements

This software is very demanding in terms of hardware specifications. While the actual demand depends on some parameters of the grid search (size of the training set and number parallel executed processees), it should be noted that, in general, a big machine (>128GB RAM, >12 CPU cores) is required.

Dependencies

The project is based on Python 3.5 and Jupyter notebooks. We recommend using the Anaconda distribution to install the environment. Furthermore, we are using the Python machine learning stack inclusing NumPy, Scikit Learn, Matplotlib, Pandas and Gensim.

Adapting new corpora

Incorporating new corpora into this framework is faily easy: to satisfy the input format a Pandas dataframe must be pickled to disk. The dataframe must contain the column text which comprises the text of one document per row as well as a column category which defines the membership of this document to a category.

text category
Lorem Ipsum... greek philosophy
Live long and prosper... sci-fi

About

This Repository contains the code that generated the data and plots for the paper "Understanding the Influence of Hyperparameters on Text Embeddings for Natural Language Processing Tasks"

Resources

License

Stars

Watchers

Forks

Packages

No packages published