Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

how did BioRED process the relations span in multiple sentences? #7

Open
Meiling-Sun opened this issue May 6, 2024 · 2 comments
Open

Comments

@Meiling-Sun
Copy link

Hi, thanks for this amazing work. i have some questions.
The annotation is base on abstract level. but when you use PubMedBERT model for relation extraction, how do tokenizers do the sentence segmentation? As i know max token of BERT is 512. So how do you proceed if the token length of one abstract bigger than 512?
Another question is when you do annotation, how about the coreference examples? Did you also annotate pronoun like, 'it', 'this' also as entity? do they become noises for NER task? Before do RE task, do you change them as original entity names or keep them or any other strategies?

@ptlai
Copy link
Collaborator

ptlai commented May 7, 2024

Hi @Meiling-Sun,
We don't deal with the token length of one abstract larger than 512 in the PubMedBERT model. If you would like to do this, you may consider to use the "stride" parameter of huggingface's tokenizer.

No, our BioRED corpus doesn't contain pronoun annotations, so they are not used in NER and RE. In our dataset, coreference cases are those entities which have the same database identifier, e.g. MESH or Entrez ID. For the RE task, I don't normalize the entities in text, instead I inserts special tokens to tag those entities in the text.

@Meiling-Sun
Copy link
Author

Thank you very much for the reply :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants