PDTB parser based on:
Ziheng Lin, Hwee Tou Ng and Min-Yen Kan (2014). A PDTB-Styled End-to-End Discourse Parser. Natural Language Engineering, 20, pp 151-184. Cambridge University Press.
Developer: Ilija Ilievski
Version: 2.0.2
Last update: 7-Nov-2015
Requires Java 1.7+. Tested only on Mac and Linux OS.
- Download the parser from here.
- Extract the file with:
tar -xzvf pdtb-parser.tar.gz
- From the extracted
pdtb-parser
folder run:
java -jar parser.jar examples/wsj_2300.txt
Replace the argument examples/wsj_2300.txt
with the file or the folder containing text files you want to parse. The resulting pipe and auxiliary files would be in a folder named output
in each folder containing text files. Note that when the argument is a folder, the parser will search for files ending in .txt
in the folder and all of it's subfolders.
If you want to use level 1 type of relations (for more info see this or read the PDTB 2.0 annotation manual) open the config.properties
and set SEMANTIC_LEVEL=1
and MODEL_PATH=models/level_1/
.
Check config.properties for all the options.
- Download the BioDRB parser from here.
- Extract the file with:
tar -xzvf biodrb-parser.tar.gz
- In the extracted folder
biodrb-parser
unzip theBioDRB_corpus.zip
file - Check in
config.properties
if the paths to the corpus are correct.BIO_DRB_RAW_PATH
should point toGeniaRaw/Genia/
andBIO_DRB_ANN_PATH
toGeniaAnn/Genia/
. TheBIO_DRB_TREE_PATH
will be created by the parser. - From the extracted
biodrb-parser
folder run:
java -jar bio-parser.jar [program_arguments]
Program arguments can be one of the following:
--train-only
- will build a BioDRB model using all 24 articles. The model files will be stored inMODEL_PATH
--cross-validation
- will do 10-fold cross validation. The model and test files will be stored inMODEL_PATH/CV_K
, where k is the fold index.--score-pdtb pdtb_pipe_folder biodrb_pipe_folder
- will score the PDTB parser on the BioDRB corpus.pdtb_pipe_folder
should contain the pipes generated by the PDTB parser andbiodrb_pipe_folder
should contain the BioDRB gold standard pipes (should end in.pipe
). You can use the pre-generated pipe files with--score-pdtb pdtb_vs_biodrb pdtb_vs_biodrb/bio_drb_gold
.
The mapping from PDTB relation sense type to BioDRB is done according to this table. Precompiled results for 10 fold cross validation compared with the PDTB results can be found here.
Check config.properties for all the options.
To train and/or test the parser on different PDTB sections follow these steps:
-
Clone with
git clone https://github.com/WING-NUS/pdtb-parser.git
or download it from here. -
Obtain the PTB and PDTB corpus files and move them to
external/data/
. Theexternal/data/
directory should look like this. -
From the project root directory run the following:
java -jar runnable_jars/pdtb-tools/span-tree-extractor.jar
To generate auxiliary files (external/data/
should now look like this)java -jar runnable_jars/pdtb-tools/train-parser.jar
To train the parserjava -jar runnable_jars/pdtb-tools/test-parser.jar
To test the parser (GS+EP option)
-
Set the output folder, train and test sections in config.properties.
The parser uses the PDTB pipe-delimited format where every relation is represented on a single line and values are delimited by the pipe symbol. There must be 48 columns, but certain values may be blank.
The following lists the column values. For precise definitions of the terms used, please consult the PDTB 2.0 annotation manual.
Note the zero-based column index
- Col 0: Relation type (Explicit/Implicit/AltLex/EntRel/NoRel)
- Col 1: Section number (0-24)
- Col 2: File number (0-99)
- Col 3: Connective/AltLex SpanList (only for Explicit and AltLex)
- Col 4: Connective/AltLex GornAddressList (only for Explicit and AltLex)
- Col 5: Connective/AltLex RawText (only for Explicit and AltLex)
- Col 6: String position (only for Implicit, EntRel and NoRel)
- Col 7: Sentence number (only for Implicit, EntRel and NoRel)
- Col 8: ConnHead (only for Explicit)
- Col 9: Conn1 (only for Implicit)
- Col 10: Conn2 (only for Implicit)
- Col 11: 1st Semantic Class corresponding to ConnHead, Conn1 or AltLex span (only for Explicit, Implicit and AltLex)
- Col 12: 2nd Semantic Class corresponding to ConnHead, Conn1 or AltLex span (only for Explicit, Implicit and AltLex)
- Col 13: 1st Semantic Class corresponding to Conn2 (only for Implicit)
- Col 14: 2nd Semantic Class corresponding to Conn2 (only for Implicit)
- Col 15: Relation-level attribution: Source (only for Explicit, Implicit and AltLex)
- Col 16: Relation-level attribution: Type (only for Explicit, Implicit and AltLex)
- Col 17: Relation-level attribution: Polarity (only for Explicit, Implicit and AltLex)
- Col 18: Relation-level attribution: Determinacy (only for Explicit, Implicit and AltLex)
- Col 19: Relation-level attribution: SpanList (only for Explicit, Implicit and AltLex)
- Col 20: Relation-level attribution: GornAddressList (only for Explicit, Implicit and AltLex)
- Col 21: Relation-level attribution: RawText (only for Explicit, Implicit and AltLex)
- Col 22: Arg1 SpanList
- Col 23: Arg1 GornAddress
- Col 24: Arg1 RawText
- Col 25: Arg1 attribution: Source (only for Explicit, Implicit and AltLex)
- Col 26: Arg1 attribution: Type (only for Explicit, Implicit and AltLex)
- Col 27: Arg1 attribution: Polarity (only for Explicit, Implicit and AltLex)
- Col 28: Arg1 attribution: Determinacy (only for Explicit, Implicit and AltLex)
- Col 29: Arg1 attribution: SpanList (only for Explicit, Implicit and AltLex)
- Col 30: Arg1 attribution: GornAddressList (only for Explicit, Implicit and AltLex)
- Col 31: Arg1 attribution: RawText (only for Explicit, Implicit and AltLex)
- Col 32: Arg2 SpanList
- Col 33: Arg2 GornAddress
- Col 34: Arg2 RawText
- Col 35: Arg2 attribution: Source (only for Explicit, Implicit and AltLex)
- Col 36: Arg2 attribution: Type (only for Explicit, Implicit and AltLex)
- Col 37: Arg2 attribution: Polarity (only for Explicit, Implicit and AltLex)
- Col 38: Arg2 attribution: Determinacy (only for Explicit, Implicit and AltLex)
- Col 39: Arg2 attribution: SpanList (only for Explicit, Implicit and AltLex)
- Col 40: Arg2 attribution: GornAddressList (only for Explicit, Implicit and AltLex)
- Col 41: Arg2 attribution: RawText (only for Explicit, Implicit and AltLex)
- Col 42: Sup1 SpanList (only for Explicit, Implicit and AltLex)
- Col 43: Sup1 GornAddress (only for Explicit, Implicit and AltLex)
- Col 44: Sup1 RawText (only for Explicit, Implicit and AltLex)
- Col 45: Sup2 SpanList (only for Explicit, Implicit and AltLex)
- Col 46: Sup2 GornAddress (only for Explicit, Implicit and AltLex)
- Col 47: Sup2 RawText (only for Explicit, Implicit and AltLex)
Example relation:
Explicit|18|70|262..265|1,0|But|||but|||Comparison.Contrast||||Wr|Comm|Null|Null||||9..258|0|From a helicopter a thousand feet above Oakland after the second-deadliest earthquake in U.S. history, a scene of devastation emerges: a freeway crumbled into a concrete sandwich, hoses pumping water into once-fashionable apartments, abandoned autos|Inh|Null|Null|Null||||266..354|1,1;1,2;1,3|this quake wasn't the big one, the replay of 1906 that has been feared for so many years|Inh|Null|Null|Null|||||||||
There are 27 columns, however only 7 are used.
- Col 0: Relation type (Explicit, Implicit, AltLex, NoRel)
- Col 1: (Sets of) Text span offset for connective (when explicit) (eg. 472..474)
- Col 7: Connective string “inserted” for Implicit relation
- Col 8: Sense1 of Explicit Connective (or Implicit Connective)
- Col 9: Sense2 of Explicit Connective (or Implicit Connective)
- Col 14: (Sets of) Text span offset for Arg1
- Col 20: (Sets of) Text span offset for Arg2
More details at:
Prasad R, McRoy S, Frid N, Joshi A and Yu H. 2011. BioDRB: The Biomedical Discourse Relation Bank. BMC Bioinformatics
Example relation:
Explicit|258..260|Wr|Comm|Null|Null|||Purpose.Enablement||||||182..257|Inh|Null|Null|Null||261..298|Inh|Null|Null|Null||
Stanford's CoreNLP Natural Language Processing Toolkit for reading and generating parse trees.
Reference:
- Manning, Christopher D., Surdeanu, Mihai, Bauer, John, Finkel, Jenny, Bethard, Steven J., and McClosky, David. 2014. The Stanford CoreNLP Natural Language Processing Toolkit. In Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pp. 55-60.
Two old versions of the Charniak parser. Copyright Mark Johnson, Eugene Charniak, 24th November 2005 --- August 2006. References:
-
Eugene Charniak and Mark Johnson. Coarse-to-fine n-best parsing and MaxEnt discriminative reranking. Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics. Association for Computational Linguistics, 2005.
-
Eugene Charniak. A maximum-entropy-inspired parser. Proceedings of the 1st North American chapter of the Association for Computational linguistics conference. Association for Computational Linguistics, 2000.
Copyright © 2015 WING, NUS and NUS NLP Group.
This program is free software: you can redistribute it and/or modify it under the terms of the
GNU General Public License as published by the Free Software Foundation, either version 3 of the
License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without
even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
General Public License for more details.
You should have received a copy of the GNU General Public License along with this program. If
not, see http://www.gnu.org/licenses/.
Other licensing terms are available, please contact the authors if you require other licensing terms.