Understanding the Influence of Hyperparameters on Text Embeddings for Natural Language Processing Tasks

This is the accompanying material for the paper "Understanding the Influence of Hyperparameters on Text Embeddings for Natural Language Processing Tasks" by Nils Witt and Christin Seifert. The paper explores the influences of the various hyperpaprameters of a doc2vec-based kNN-classifier on the accuracy of the model. For further details, please refer to the paper.

Repository structure

There are two notebooks that transform a dataset into a common format (1, 2)
This notebook carries out grid search as well as Bayesian optimization.
The last notebook visualizeses the results for the previous step in several ways.

Hardware requirements

This software is very demanding in terms of hardware specifications. While the actual demand depends on some parameters of the grid search (size of the training set and number parallel executed processees), it should be noted that, in general, a big machine (>128GB RAM, >12 CPU cores) is required.

Dependencies

The project is based on Python 3.5 and Jupyter notebooks. We recommend using the Anaconda distribution to install the environment. Furthermore, we are using the Python machine learning stack inclusing NumPy, Scikit Learn, Matplotlib, Pandas and Gensim.

Adapting new corpora

Incorporating new corpora into this framework is faily easy: to satisfy the input format a Pandas dataframe must be pickled to disk. The dataframe must contain the column text which comprises the text of one document per row as well as a column category which defines the membership of this document to a category.

text	category
Lorem Ipsum...	greek philosophy
Live long and prosper...	sci-fi

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
data		data
figures		figures
.gitignore		.gitignore
1.0_Amazon_corpus_to_pandas.ipynb		1.0_Amazon_corpus_to_pandas.ipynb
1.1_20Newsgroups_to_pandas.ipynb		1.1_20Newsgroups_to_pandas.ipynb
2_Hyperparameter_search.ipynb		2_Hyperparameter_search.ipynb
3_Analysis.ipynb		3_Analysis.ipynb
LICENCE		LICENCE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Understanding the Influence of Hyperparameters on Text Embeddings for Natural Language Processing Tasks

Repository structure

Hardware requirements

Dependencies

Adapting new corpora

About

Releases 2

Packages

Languages

License

n-witt/Influence-of-Hyperparameter-on-Text-Embeddings

Folders and files

Latest commit

History

Repository files navigation

Understanding the Influence of Hyperparameters on Text Embeddings for Natural Language Processing Tasks

Repository structure

Hardware requirements

Dependencies

Adapting new corpora

About

Resources

License

Stars

Watchers

Forks

Releases 2

Packages 0

Languages

Packages