You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi, I am working on a covid-19 antiviral and was spot checking antivirals in scispacy and was surprised that remdesivir is not tagged as a chemical in any of the 1,338 PubMed abstracts containing it. I'm using en_ner_bc5cdr_md to extract CHEMICAL and DISEASE entities; spacy: '3.0.4', scispacy: '0.4.0'.
As you see below, remdesivir is not tagged as a CHEMICAL when I run en_ner_bc5cdr_md in Jupyter Lab.
However, when I put the same text into your demo, I was surprised that remdesivir is found.
Questions
Wonder if the version running on demo is the same one that I used in my notebook (spacy: '3.0.4', scispacy: '0.4.0')?
Maybe remdesivir isn't found since it wasn't present in earlier training sets?
Can we expect new chemicals to be recognized (e.g., first time ever published)?
It's especially surprising that remdesivir wasn't detected as a CHEMICAL even in the following line where it's called a 'drug' from the text used in my example:
Though the drug remdesivir (RDV) is not approved by the FDA, still the "Emergency Use Authorization" (EUA) for compassionate use in severe cases is endorsed.
In the demo remdesivir is detected but only once while it is mentioned several times in that passage. Is that expected?
Thanks,
vikram
The text was updated successfully, but these errors were encountered:
The version on the demo is probably not the latest release version. I should check and update that.
2/3/4) First, this is a model, so inconsistent and surprising output is likely, and some memorization is likely (@DeNeutoy looks like data augmentation could help a lot here). Second, the BC5CDR corpus was annotated with specific guidelines (https://biocreative.bioinformatics.udel.edu/media/store/files/2015/bc5_CDR_data_guidelines.pdf) which you may want to read and see if they align with your expectations of what would be annotated as a chemical. Here is some output of a mix of real and made up chemical names. I don't really conclude anything from this, other than that the model is definitely using some combination of the form of the name itself and the context
In [29]: for drug_name in ["mesna", "remdesivir", "mebane", "relidate", "novila", "aspirin", "coloxal", "inovivir", "scopolamine", "entamine", "valimine", "henirin", "noonirin", "halirin"]:
...: text = f"The drug {drug_name} is used to treat the virus"
...: doc = nlp(text)
...: print(doc.ents)
...:
(mesna,)
()
(mebane,)
()
()
(aspirin,)
()
()
(scopolamine,)
(entamine,)
(valimine,)
(henirin,)
()
()
Looks like it is also sensitive to capitalization
In [56]: doc = nlp("Remdesivir is a chemical")
In [57]: doc.ents
Out[57]: (Remdesivir,)
In [58]: doc = nlp("remdesivir is a chemical")
In [59]: doc.ents
Out[59]: ()
I don't have much else to add at the moment. We were thinking about running some data augmentation experiments to try to improve the NER, but haven't done it yet (I'd be thrilled to have a contribution along those lines).
5) Definitely the model takes into account the context that the word occurs in, so it is not wholly surprising to me that the same word could be classified differently in different contexts.
dakinggg
changed the title
Expectations on CHEMICAL NER?
Explore data augmentation for NER robustness
Mar 24, 2021
Hi, I am working on a covid-19 antiviral and was spot checking antivirals in scispacy and was surprised that remdesivir is not tagged as a chemical in any of the 1,338 PubMed abstracts containing it. I'm using en_ner_bc5cdr_md to extract CHEMICAL and DISEASE entities; spacy: '3.0.4', scispacy: '0.4.0'.
As you see below, remdesivir is not tagged as a CHEMICAL when I run en_ner_bc5cdr_md in Jupyter Lab.
However, when I put the same text into your demo, I was surprised that remdesivir is found.
Questions
Though the drug remdesivir (RDV) is not approved by the FDA, still the "Emergency Use Authorization" (EUA) for compassionate use in severe cases is endorsed.
Thanks,
vikram
The text was updated successfully, but these errors were encountered: