This is purely exploratory: the goal is to learn and practice more.
The task was mainly focussed on data that has BIO tags over biomedical data. The data with gene, proteins and biochemical interactions has its own Finite state automata and grammar. Thus this requires state-of-the-art NLP models that are specific to the biomedical data. Can We Learn the Language of Proteins?
Submit a report of what you did, your python code, and test results. I will post a test set before the due date. Run the test data through your system and submit the output. The output format of your system should be the same as the input training data: one token per line, with a tab separating the token from its tag, and a blank line separating sentences.
The dataset seems like a text from a medical journal consisting of articles on biomedical and/or clinical topics with sufficient reference to gene, proteins, protein interactions and biochemical interactions. The data represents I,O,B tags for each word for each sentence in the dataset. There are about 13,795 sentences. The vocabulary is different from regular English texts. The task in homework suggests to detect “Gene” entities which could appear in patterns such as “BII”, “BIO”, or even regex such as “BI[I][O]”.
The dataset and vocabulary likewise, underpinned by specific nomenclature, the scientific names of proteins, DNA structures, interactions of proteins, which indicate sufficient intersection of biomedical literature and the domain knowledge required to understand word embeddings, for example. The dataset brings forth interesting patterns such as having simple, english vocabulary appearing almost randomly placed next to biomedical literature. Despite such intricate patterns of word combinations, subword information, the information in the dataset seems to provide minimalistic features: word, tag combinations for each sentence in the dataset. The subword information seems to be key information required for this dataset.
Given a word in a sentence, identify the one tag that can most likely represent the entity i.e. from the tags (B,I,O).
Entity Sequence required for this problem: Understanding if a gene exists from a sequence of tags. Gene is likely defined by “BII” sequence of tags or “BI[I][O]” irrespective of previous tags.
Dataset is far apart from regular vocabulary such as the ones from other non-domain specific tasks such as POS tagging over BERT models. There are several rare words. There are rare words. A lot of medical terms, bio medical terminology from this dataset are rare with frequencies <10.
About the dataset:
Total number of sentences: 13795 Total number of words/tokens in dataset: 308229 Max number of words in a sentence: ~102 Vocabulary size: 27282
Total number of B tags: 13304 Total number of I tags: 19527 Total number of O tags: 276009
O tags clearly are more highly common than B and I tags. The max number of B tags a word can have is 265 compared to the max number of O tags. Similarly, max number I tags are 4393 which is about 25% of O tags.
The most common top 10 words are: “.”, “the”, “of”, “-”, (',','and','in','a','(', 'to'.The least common top 10 words are: 'K713','hypercholesterolemic','lutein','P69','conference','Talk','Tele','cruciform','TE105'
Most importantly, most of the tokens that are from biomedical /bioinformatics terminology itself are very less frequent or rare words.
The facts for the problem from the dataset: Several words in the dataset have any of the three tags from B,I,O. Given a sequence of tags: “t1,t2,t3” Each tag t_i is more likely to occur along with the previous tag t_i-1 and next the tag t_i+1 depending on the Markov property that applies conveniently to this tag sequence problem. Each word can have only one tag. Some words are labelled for one of 3 tags, for different sentences. Some tag sequences and each of their frequencies:
The two tag sequences and their frequencies:
The three tag sequences and each of their frequencies:
The tag frequencies suggest that OOO is most frequent and OOB, OBI, IOO, OBO, BII are most frequent.
First I attempted to implement the Hidden Markov Model with Viterbi for the task. Given, transition probability matrix, emission probabilities, and initial probabilities of states, tags and words, it appears easier to calculate the results and predicted tags for each word. The HMM model with Viterbi, the approach as follows:
Defining input probabilities:
States: States in this case are tag sequences. Define 2 tag sequence for Bi-gram case: The given tags : B,I,O and have 2^3 = 8 arrangements including repetitions Define 3 tag sequence for tri-gram case: The given tags : B,I,O and have 3^3 = 27 arrangements including repetitions. Solving for the most common subsequence: This can be done by making all tag sequences into a single string of all patterns. The resulting super sequence of all tags is as follows:“OOOOOOOOOOOOOBIIIOBIIOOOOOOOOOOOOOOOOBIIIOOOOOOOOBIII…..” Algorithmically and programmatically, the sequence counts for each state i.e tag sequence such as “IB”, “BII”, “BIO” can be generated using Counter data structure (or a suffix algorithm or other advanced (string, index) sequence extraction algorithms). There are about 92 sequences of “BII” in the super sequence string. The transition probability for “I” given B will be: count(tag == B) / count(tag == “BI”) Total single tag counts i.e. count (B), count(I), count(O) = 'B': 13363, 'I': 19779, 'O': 276565 All possible two-tag sequences: 'BB', 'BI', 'BO', 'IB', 'II', 'IO', 'OB', 'OI', 'OO' Counts of 2-tag sequences: The two tag sequences and their frequencies:
All possible 3-tag sequences: IBO BII IIB BOB IOI OII OBI BIO IBI OBO BOI BIB IOB IOO OIB OBB OIO BBO OOO BBI BBB BOO OOB IBB IIO OOI III
Counts if 3-tag sequences The three tag sequences and each of their frequencies:
Transition probabilities: The probability of going from one state to another assumes Markov independence assumption. Bi-gram case: product p(t_1 | t_i-1 )
Tri-gram case: product (t_1 | t_i-1, t_i-2 )
Emission probabilities: The probability of emitting a word given a tag For 2-tag sequences:
For 3-tag sequences:
Building a word count for each word and tag sequence (which can be used to build trellis for dynamic programming:
Results: Finally after 80/20 split over data, after training HMM with Viterbi, as expected, almost always O tag was predicted while this would increase False Positives. I saw a 0.327 f1-score.
The tag-wise confusion matrix, F1-score, precision and recalls results for HMM Viterbi are as follows:
The f-score for I and B tags is better however it seems pretty good for O tag since O is the most frequent tag.
Summary of results: From above results, f1-score, it is not recommended to continue using HMM and Viterbi approach for Named Entity Recognition, since the pitfall of HMM model is that it is not flexible enough to unknown words as well as any new vocabulary. It is possible to consider the previous word is a B tag, then next tag is an O tag, then the likelihood of having an I tag or O tag more than a B tag. However this consideration is not sufficient to generalize and improve HMM with Viterbi.
We could use additional features such as the first letter is a capital letter, all letters are capital letters, the previous word is a hyphen, or next word is a number, previous word + next word is an alphanumeric, all of which can act as better features for working with HMM. The idea is frequencies for each of these new features and their likelihood of being assigned a tag under a 5-gram approach; without a pretrained model, the HMM model might survive for small problems but for small, limited real world applications.
Given that Data analysis and statistical analysis of data sets surrounding words, tags and their corresponding frequencies from the above sections, I found that it is perhaps better to ignore O tag. The tags that are considered are I, B tags only. Once after prediction of I, B tags is complete, we could assume all of the remaining untagged words are O tags.
So in this approach, I trained the en_core_web_md model using the SpACY library and extracted entities (with start and end ids for each word in a sentence) and processed it through a minimalistic SpACY approach.
Evaluation The validation during model training task, the precision, f1-score and recall scores are as follows for the I, B tags for NER task:
The precision and recall along with the f1-scores for test data are worse as expected. This evaluation was conducted based on the evaluation script provided.
And finally on calculating tag-wise confusion matrix, f1-score, precision and recall, the results are as follows:
The f1-score is very low and the true positives are significantly low and almost zero. This model and approach is not suitable and hence should not be considered.
Summary of results: an O tag is very important in recognizing frequent sequence patterns towards training the model towards the NER task for this dataset with IOB tags.
Given that Data analysis and statistical analysis of data sets surrounding words, tags and their corresponding frequencies and based on results from IB tag sequence from above section, the next natural assumption is to include O tag nevertheless.
So in this approach, I just trained the en_core_web_md model using the SpACY library and extracted entities (with start and end ids for each word in a sentence) and processed it through a minimalistic SpACY approach.
Evaluation The validation results during training phase and the f1-score, recall and precision scores are as follows:
The precision and recall along with the f1-scores for test data are worse as expected. This evaluation was conducted based on the evaluation script provided.
Additionally, upon tag-wise confusion matrix, f1-score, precision and recall, which are fall further (even lesser than previous case):
Summary of results: this model is not suitable for Named Entity tagging for Bioinformatics data. The vocabulary consists of too many rare words and clearly the model is not adaptable to unknown or rare words that are significantly different from words that do exist in vocabulary. Due to the fact that there are 27000+ words in this dataset, most of them with very different subwords(such as names originating from chemistry, biology), it could be helpful to use a more domain specific trained model.
After a little bit of research on Google scholar and NIH, BioBERT is a fast, and vastly used model in Bioinformatics. BioBERT used BioInformatics data and is different from BERT. The difference lies in first considering subwords and finally enhancing the word2Vector embeddings for the bioinformatics data. This seemed a reasonable direction to explore, without hoping much too soon.
A special case to consider, for example, the least common words in dataset provided for this homework, 'K713','hypercholesterolemic','lutein','P69','conference','Talk','Tele','cruciform','TE105'. They are not only least common, they need special domain specific knowledge for subword tokenization, which is very different as well as difficult from that of other common English word tokenizations. Thus BioBERT makes for a special case and is more reasonable to explore. This also complies with the fact that domain specific expertise adds additional information required to understand the Named Entity tags and vice versa. The reverse case is : to represent knowledge and “reason” from a medical journal text corpus, Named entity recognition is required for identifying domain specific instances. To rephrase the reverse case, for domain specific knowledge mining, we need Named Entity recognition as a prerequisite.
However, the BioBERT or generally BERT models do consider POS tags along with entity tags. So few tasks for conducting Named Entity recognition and training a model for this task, following were the steps: As is the case, firstly I generated POS tags using SpACY library for each of the words for each sentence provided in the dataset. So the POS tag is a new added feature to this dataset. Secondly, train the model evaluate the model.
The approach is simple and it seemed at first to reasonably work well. The validation during training as in other cases, gave 84% Upon tag-wise results with confusion matrix, f1-score, precision and recall are as follows, which are also very poor.Summary of results: POS tags are important information to identifying Named Entity tags and yet having additional features would contribute better. Additionally when training smaller models over BERT/BERT extended models, the need for more data seems one expectation. There are additional cases upon researching a little further. The structure considerations: in most of the predictions from above three tasks, some proteins are almost always recognized correctly due to the co-occurrence with certain words. They are more likely to occur with other rare words. However, the language of proteins in chemistry and biological data seems to contribute more intricately towards co-occurrence. That is to say the representation style, structure, format, sequence of capital letters with numbers to showcase a chemical compound or a double helix structure, symbolic representation contribute significantly towards linguistic tasks. For example a structure of gene, might naturally suggests that TE105 is a representation of 105-nt TE (TE105).[https://lib.dr.iastate.edu/cgi/viewcontent.cgi?article=1040&context=plantpath_pubs ]. So if a word occurred such as “105-nt TE (TE105)”, a pretrained model over biomedical/bioinformatics text corpus might be more likely to recognize this word and identify subwords for text representation. Navigating a little outside scope of this topic, the article [https://bair.berkeley.edu/blog/2019/11/04/proteins/] suggests that understanding protein structures is considered similar to a Linguistics task. This is sufficient evidence to infer that any domain specific knowledge representation requires significant understanding of underlying structures whether the data is a biomedical textual data or as complex as a domain specific representation such as that of proteins, DNA bindings in a freeform text.
Attempted to explore pretraining a model from scratch however, due to insufficient resources such as computational resources, I could not conduct a pretrained model for Bioinformatics data from a PubMed dataset that could improve the NER task and f1-scores over this dataset. But including an additional feature such as “chunks” for POS tags feature and additionally using domain specific tag information such as protein, gene, or even simpler binary tag classification such as B-BIO, B-CHEM, B-GENE, B-PROT, B-DNA may be helpful, similar to the case of B-PER, B-GEO etc. I attempted Word Vector embeddings to further improve BioBERT, which seemed to be an even better approach. However this task requires domain-specific word embeddings as I came across this informative article: https://www.nature.com/articles/s41597-019-0055-0
The need for additional features such as the first letter is a capital letter, all letters are capital letters, the previous word is a hyphen, previous word or next word that is a number, previous word+ next word that is an alphanumeric string - all of which can act as better features for working with HMM. The vocabulary consists of many rare words and clearly the model is not adaptable to unknown or rare words that are significantly different from words that do exist in vocabulary. it could be helpful to use a more domain specific trained model. Domain specific knowledge mining tasks requires significant understanding of underlying structures whether the data is a biomedical textual data or as complex as a domain specific representation such as that of proteins, DNA bindings in a freeform text. Could have explored a few more suggestions informed by the professor during class before the break, such as marking rare words with UNKNOWN or exploring BIOES, using PyTorch and word2vec. Including an additional feature such as “chunks” for POS tags feature and additionally using domain specific tag information such as protein, gene, or even simpler tag classification such as B-BIO or B-CHEM or B-GENE, B-PROT, B-DNA may be helpful, similar to that of B-PER, B-GEO etc.