As you may know, Facebook silently dropped any future support for fastText and it is very important to install the correct version of all the dependencies. To make sure that the project runs correctly, it is strongly recommended to open a new virtual environment and install the dependencies from the requirements.txt
file.
conda create -n <your_env_name> python=3.9
pip install -r requirements.txt
Full dataset is available at this link, which contains a total of 142 authors.
After unzipping the file, you will find two files:
label2ind.json
: a json file that maps the author name to an index.train.txt
: a text file that contains the training data. (Approx. 1.2GB)
Please put these two files in the data
folder.
Upon evaluation, we found that the full dataset is too large for the current implementation, which also makes it difficult to train the model. Therefore, we decided to use a subset of the full dataset for the current implementation.
Partial dataset is available at this link, which contains a total of manually selected 10 authors, including:
- Charles Dickens,
- Agatha Christie,
- Jane Austen,
- Mark Twain,
- O Henry,
- Oscar Wilde,
- P G Wodehouse,
- Walt Whitman,
- Winston Churchill, and
- Zane Grey.
If you wish to generate the dataset from scratch, here are the steps:
- Download the raw data from a preprocessed Gutenberg dataset.
- Unzip the file and put the
Gutenberg
folder under thedata
folder. - In the
data
folder, run the following command to generate the dataset:
python preprocess.py --num_authors <number_of_authors_you_want>
This will generate the dataset with the specified number of authors randomly. The generated dataset will be saved in the data
folder.
If you wish to generate the dataset with specific authors, please modify the manually_selected_authors
variable in the preprocess.py
file, and run:
python preprocess.py --enable_author_selection
If you wish to split the dataset into training and testing sets, you can run the following command
python splitdataset.py --train_split <train_split_ratio>
after you properly modify the variable ORIGINAL_DATASET_PATH
in the splitdataset.py
file.
To train the model, you can run the following command under the model
folder:
python train.py --train <path_to_the_training_data> --type <basic_or_autotune> --test <path_to_the_testing_data> --val <path_to_the_validation_data> --model <path_to_save_the_model> --label2ind <path_to_the_label2ind_file>
We have provided a pretrained model at this link. You can download the model and put it under the model
folder.
The user interface is implemented using Tornado, a Python web framework. To run the user interface, you can run the following command under the root directory:
python index.py --model <path_to_the_model> --label2ind <path_to_the_label2ind_file>
Then, you can access the user interface by visiting http://localhost:9263
in your web browser.
Important
To run the server successfully, your folder structure should at least look like this:
.
├── data
│ └── label2ind.json
├── model
│ ├── __init__.py
│ ├── expl.py
│ ├── train.py
│ └── some_model_file
├── templates
│ ├── index.html
│ └── submit.html
└── index.py
This project is a part of the course project for the course ECS 171 Machine Learning SS2 2024 at University of California, Davis. The project is done by the following members (in alphabetical order):
We would like to thank the following open-source contributors for their work.
- Some portions of the code in this project are adapted from Ankie Fan's project testurtext-algo.
- The dataset used in this project is based on the Gutenberg dataset provided by Lahiri et al. (2014).
- The implementation of the model is based on the fastText library provided by Facebook Research.
- The user interface is implemented using the Tornado web framework.