Assuming that you're working on AMR 2.0 (LDC2017T10), unzip the corpus to data/AMR/LDC2017T10
, and make sure it has the following structure:
data/AMR/LDC2017T10
├── data
│ ├── alignments
│ ├── amrs
│ └── frames
├── docs
│ ├── AMR-alignment-format.txt
│ ├── amr-guidelines-v1.2.pdf
│ ├── file.tbl
│ ├── frameset.dtd
│ ├── PropBank-unification-notes.txt
│ └── README.txt
└── index.html
- Download Artifacts:
./scripts/download_artifacts.sh
- Prepare training/dev/test data:
./scripts/prepare_data.sh -v 2 -p data/AMR/LDC2017T10
- We use Stanford CoreNLP (version 3.9.2) for tokenizing. First, start a CoreNLP server by
sh run_standford_corenlp_server.sh
Then, annotate AMR sentences:
sh run_standford_corenlp_server.sh
./scripts/annotate_features.sh data/AMR/amr_2.0
- Data Preprocessing
./scripts/preprocess_2.0.sh
(Acknowledgements) A large body of the code for AMR preprocessing is from sheng-z/stog.