Constituency Parsing

Command for training crf constituency parser is simple. We follow instructions of Benepar to preprocess the data.

To train a BiLSTM-based model:

$ python -u -m supar.cmds.const.crf train -b -d 0 -c con-crf-en -p model -f char --mbr
    --train ptb/train.pid  \
    --dev ptb/dev.pid  \
    --test ptb/test.pid  \
    --embed glove-6b-100  \
    --mbr

To finetune robert-large:

$ python -u -m supar.cmds.const.crf train -b -d 0 -c con-crf-roberta-en -p model  \
    --train ptb/train.pid  \
    --dev ptb/dev.pid  \
    --test ptb/test.pid  \
    --encoder=bert  \
    --bert=roberta-large  \
    --lr=5e-5  \
    --lr-rate=20  \
    --batch-size=5000  \
    --epochs=10  \
    --update-steps=4

The command for finetuning xlm-roberta-large on merged treebanks of 9 languages in SPMRL dataset is:

$ python -u -m supar.cmds.const.crf train -b -d 0 -c con-crf-roberta-en -p model  \
    --train spmrl/train.pid  \
    --dev spmrl/dev.pid  \
    --test spmrl/test.pid  \
    --encoder=bert  \
    --bert=xlm-roberta-large  \
    --lr=5e-5  \
    --lr-rate=20  \
    --batch-size=5000  \
    --epochs=10  \
    --update-steps=4

Different from conventional evaluation manner of executing EVALB, we internally integrate python code for constituency tree evaluation. As different treebanks do not share the same evaluation parameters, it is recommended to evaluate the results in interactive mode.

To evaluate English and Chinese models:

>>> Parser.load('con-crf-en').evaluate('ptb/test.pid',
                                       delete={'TOP', 'S1', '-NONE-', ',', ':', '``', "''", '.', '?', '!', ''},
                                       equal={'ADVP': 'PRT'},
                                       verbose=False)
(0.21318972731630007, UCM: 50.08% LCM: 47.56% UP: 94.89% UR: 94.71% UF: 94.80% LP: 94.16% LR: 93.98% LF: 94.07%)
>>> Parser.load('con-crf-zh').evaluate('ctb7/test.pid',
                                       delete={'TOP', 'S1', '-NONE-', ',', ':', '``', "''", '.', '?', '!', ''},
                                       equal={'ADVP': 'PRT'},
                                       verbose=False)
(0.3994724107416053, UCM: 24.96% LCM: 23.39% UP: 90.88% UR: 90.47% UF: 90.68% LP: 88.82% LR: 88.42% LF: 88.62%)

To evaluate the multilingual model:

>>> Parser.load('con-crf-xlmr').evaluate('spmrl/eu/test.pid',
                                         delete={'TOP', 'ROOT', 'S1', '-NONE-', 'VROOT'},
                                         equal={},
                                         verbose=False)
(0.45620645582675934, UCM: 53.07% LCM: 48.10% UP: 94.74% UR: 95.53% UF: 95.14% LP: 93.29% LR: 94.07% LF: 93.68%)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

const.md

const.md

Constituency Parsing

Files

const.md

Latest commit

History

const.md

File metadata and controls

Constituency Parsing