how did BioRED process the relations span in multiple sentences? #7

Meiling-Sun · 2024-05-06T13:59:53Z

Hi, thanks for this amazing work. i have some questions.
The annotation is base on abstract level. but when you use PubMedBERT model for relation extraction, how do tokenizers do the sentence segmentation? As i know max token of BERT is 512. So how do you proceed if the token length of one abstract bigger than 512?
Another question is when you do annotation, how about the coreference examples? Did you also annotate pronoun like, 'it', 'this' also as entity? do they become noises for NER task? Before do RE task, do you change them as original entity names or keep them or any other strategies?

ptlai · 2024-05-07T23:22:09Z

Hi @Meiling-Sun,
We don't deal with the token length of one abstract larger than 512 in the PubMedBERT model. If you would like to do this, you may consider to use the "stride" parameter of huggingface's tokenizer.

No, our BioRED corpus doesn't contain pronoun annotations, so they are not used in NER and RE. In our dataset, coreference cases are those entities which have the same database identifier, e.g. MESH or Entrez ID. For the RE task, I don't normalize the entities in text, instead I inserts special tokens to tag those entities in the text.

Meiling-Sun · 2024-05-08T08:22:25Z

Thank you very much for the reply :)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

how did BioRED process the relations span in multiple sentences? #7

how did BioRED process the relations span in multiple sentences? #7

Meiling-Sun commented May 6, 2024

ptlai commented May 7, 2024

Meiling-Sun commented May 8, 2024

how did BioRED process the relations span in multiple sentences? #7

how did BioRED process the relations span in multiple sentences? #7

Comments

Meiling-Sun commented May 6, 2024

ptlai commented May 7, 2024

Meiling-Sun commented May 8, 2024