Copyright (c) 2021 Yanjun Gao
This is the github repository for the ACL 2021 paper: ABCD: A Graph Framework to Convert Complex Sentences to a Covering Set of Simple Sentences. Please cite our paper if you are using ABCD (BibTex at the end). You could also find the slides from my oral presentation.
ABCD is a linguistically motivated sentence editor that decomposes a complex sentence into N simple sentences, where N corresponds to the number of predicates in the complex sentence. It first constructs a sentence graph using dependency parsing information, and edits the graph into subgraphs by a neural classifier with four graph operations: A(accept), B(break), C(copy) and D(drop). Depending on your applications and data, ABCD could be flexibly trained to keep (or drop) connectives, or to perform simplification. See paper for more details.
We provide ABCD model trained on MinWiki (Wikipedia Text). You could test this pre-trained model on your data. If you are interested in training your ABCD model, we provide scripts of doing so. The main differences between testing and training is the distant supervision labels generated through preprocessor. At training time, we use distant supervision labels to train the neural classifier to predict the four edit types. At testing time, we do not need the distant supervision signals anymore.
Dependencies
Prerequisites
Testing
Training
Corpus
python 3.6, pytorch 1.6.0, numpy, nltk (word tokenize and sent tokenze), networkx, dgl, pickle, torchtext.
Install Stanford CoreNLP (new version 4.20)Link and Generate Dependency Parses with CoreNLP
ABCD relies on external package to generate dependency parses. Here we use Stanford Corenlp. Output is a ".out file". E.g., the input file is "test.complex", then output file is "test.complex.out". Input file should have one sentence per line. If you see special characters in your text, please remove them otherwise it might cause token errors in preprocessing.
There are many commands to run the stanford CoreNLP. E.g. the following could generate the .out file with enhanced dependency parse:
java -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLP -file $filename
ABCD uses pre-trained GloVe embeddings (100d) to initialize the word representation. To download the embedding, go to GloVe website and download glove.6B.zip.
Default is the ABCD model trained on Minwiki, with MLP classifier Link, or with Bilinear classifier Link
python process_test.py
Output is a pickle file, with sentence ids as keys, preprocssed graph as values. You could change the filename and data directory in line 23-24 (variable batch_id and data_path). Currently the data loader will read from data/test.pkl as its default setting.
python test.py
Remember to modify root_dir
(code directory), data_filename
(the input .pkl filename after preprocessing) and glove_dir
(where you store glove.6B.100d.txt). Also modify the pretrained_path
to specify the folder of pretrained models, and classifer
for the type of classfier the pre-trained model using. Output of this script is a pickle file storing a output dictionary where the keys are sentence indices and values are predicted strings and action predictions. Another argument output_str_to_file is set to True to generate clean output txt file.
We provide scripts to help you train your ABCD model. You need to run your data through stanford CoreNLP first.
python process_data.py # different preprocessor than test time
Recall that we rely on distant supervision labels to train the network, which are generated at this step.
As we mention in the paper, the distributions across A,B,C,D could be greatly different depending on the dataset and the linguistic phenamena. We encourage you to take a step to get an inverse frequency weights from the gold labels created from Step 1, and replace the weights in the main.py (under config) for your data.
python main.py
Remember to change the root_dir
and glove_dir
. The parameters in the encoder, graph attention and classifier will be stored seperately in three checkpoints.
@inproceedings{gao-etal-2021-abcd,
title = "{ABCD}: A Graph Framework to Convert Complex Sentences to a Covering Set of Simple Sentences",
author = "Gao, Yanjun and
Huang, Ting-Hao and
Passonneau, Rebecca J.",
booktitle = "Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)",
month = aug,
year = "2021",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2021.acl-long.303",
doi = "10.18653/v1/2021.acl-long.303",
pages = "3919--3931",
}
Our ACL paper mentions two corpus that we train and evaluate ABCD on: the MinWiki and DeSSE. If you are interested in training/evaluating your model on these two corpus, refer to this github repository for more details.