Skip to content

Latest commit

 

History

History
70 lines (63 loc) · 2.79 KB

const.md

File metadata and controls

70 lines (63 loc) · 2.79 KB

Constituency Parsing

Command for training crf constituency parser is simple. We follow instructions of Benepar to preprocess the data.

To train a BiLSTM-based model:

$ python -u -m supar.cmds.const.crf train -b -d 0 -c con-crf-en -p model -f char --mbr
    --train ptb/train.pid  \
    --dev ptb/dev.pid  \
    --test ptb/test.pid  \
    --embed glove-6b-100  \
    --mbr

To finetune robert-large:

$ python -u -m supar.cmds.const.crf train -b -d 0 -c con-crf-roberta-en -p model  \
    --train ptb/train.pid  \
    --dev ptb/dev.pid  \
    --test ptb/test.pid  \
    --encoder=bert  \
    --bert=roberta-large  \
    --lr=5e-5  \
    --lr-rate=20  \
    --batch-size=5000  \
    --epochs=10  \
    --update-steps=4

The command for finetuning xlm-roberta-large on merged treebanks of 9 languages in SPMRL dataset is:

$ python -u -m supar.cmds.const.crf train -b -d 0 -c con-crf-roberta-en -p model  \
    --train spmrl/train.pid  \
    --dev spmrl/dev.pid  \
    --test spmrl/test.pid  \
    --encoder=bert  \
    --bert=xlm-roberta-large  \
    --lr=5e-5  \
    --lr-rate=20  \
    --batch-size=5000  \
    --epochs=10  \
    --update-steps=4

Different from conventional evaluation manner of executing EVALB, we internally integrate python code for constituency tree evaluation. As different treebanks do not share the same evaluation parameters, it is recommended to evaluate the results in interactive mode.

To evaluate English and Chinese models:

>>> Parser.load('con-crf-en').evaluate('ptb/test.pid',
                                       delete={'TOP', 'S1', '-NONE-', ',', ':', '``', "''", '.', '?', '!', ''},
                                       equal={'ADVP': 'PRT'},
                                       verbose=False)
(0.21318972731630007, UCM: 50.08% LCM: 47.56% UP: 94.89% UR: 94.71% UF: 94.80% LP: 94.16% LR: 93.98% LF: 94.07%)
>>> Parser.load('con-crf-zh').evaluate('ctb7/test.pid',
                                       delete={'TOP', 'S1', '-NONE-', ',', ':', '``', "''", '.', '?', '!', ''},
                                       equal={'ADVP': 'PRT'},
                                       verbose=False)
(0.3994724107416053, UCM: 24.96% LCM: 23.39% UP: 90.88% UR: 90.47% UF: 90.68% LP: 88.82% LR: 88.42% LF: 88.62%)

To evaluate the multilingual model:

>>> Parser.load('con-crf-xlmr').evaluate('spmrl/eu/test.pid',
                                         delete={'TOP', 'ROOT', 'S1', '-NONE-', 'VROOT'},
                                         equal={},
                                         verbose=False)
(0.45620645582675934, UCM: 53.07% LCM: 48.10% UP: 94.74% UR: 95.53% UF: 95.14% LP: 93.29% LR: 94.07% LF: 93.68%)