This repository provides the experimental codes for the CONLL2023 paper Tree-shape Uncertainty for Analyzing the Inherent Branching Bias of Unsupervised Parsing Models.
For example, the conda environment can be created as follows:
conda create -n env python=3.10
conda install pytorch==1.13.1 pytorch-cuda=11.7 -c pytorch -c nvidia
conda install numpy
conda install matplotlib
conda install pyyaml
conda install tqdm
conda install nltk
conda install h5py
Detailed version information is available in requirements.txt
.
- English Penn Treebank
- Note that PTB requires LDC account for installation
- Japanese Keyaki Treebank
- Preprocess the installed treebanks using UnsupConstParseEval
- Generate random automorphisms:
src/gen_automorphism.py
- Generate texts that do not contain potential branching bias using the automohrphisms:
src/gen_sym_corpus.py
- Generate preprocessed files for URNNG:
src/preprocess_for_urnng.py
The whole preprocess can be done by executing script/preprocess.sh
.
We generate 10 datasets by this script.
For each dataset generated in the previous step:
- Train models with the train and dev split:
src/train_many.py
- Run the trained models and obtain predicted parses for the train, dev and test splits:
src/train_many.py
Detailed setups of the trianing and evaluation parts are described in script/train.sh
and script/parse.sh
, respectively.
Note that training all models would take very long time (especially URNNG). Appropriate parallelization is required (by default, script/train.sh
assumes 8 gpus are available).
- Calculate basic statistics of the generated datasets:
src/dataset_stats.py
- Plot histograms of branching directions of gold trees:
src/analyze_treebank.py
- Plot the averaged branching directions of the model prediction:
src/analyze_models.py
- Plot histograms of branching directions of model predictions:
src/analyze_model_hist.py
Detailed setups of these analyses are provided in script/dataset_stats.sh
, script/analyze_treebank.sh
, script/analyze_models.sh
, and script/analyze_model_hist.sh
.
Our data preprocess procedure is mostly based on the script of UnsupConstParseEval.
For the models, we slightly modify the codes distributed by the authors to unify the training emvironment. This repository includes the modified codes:
- DIORA:
src/diora
- Originally distributed with Apache License 2.0
- PRPN:
src/prpn
- Originally distributed with MIT License
- URNNG:
src/urnng
- Originally distributed with MIT License
This code is distributed under the MIT License.