Skip to content
Hilmar Lapp edited this page Dec 11, 2019 · 16 revisions

Table of contents:

  1. OntoTrace
  2. Data Science & Statistics
  3. Phylogenetic Comparative Approaches

OntoTrace

Customized Presence-Absence Matrix

Laura is a comparative biologist. Her research is to compare presence-absence data for defined list of taxa, and a defined list of anatomical entities, and she therefore needs to obtain a presence-absence matrix for the taxa and anatomical entities as synthesized by OntoTrace. Because the taxa and anatomical entities don't fit nicely underneath higher-level groupings, Laura can't use the query-by-OWL expression feature for OntoTrace. Instead, she uses R to read in the list of taxa and the list of anatomical entities, to query OntoTrace for each anatomical entity across the list of taxa, and to merge the results into a single non-redundant presence-absence matrix. Because her list of anatomical terms is already curated, parts or entities connected by other relationships are not included in the results. However, presence/absence of entities implied by those in her list, are included in the result, just as would otherwise be the case for an OntoTrace synthesis. In the end Laura uses this customized presence-absence matrix to merge with a phylogenetic tree for the list of taxa used in the input. This matrix is read into Mesquite and merged with the phylogeny to perform ancestral reconstructions for each anatomical entity in the customized matrix.

Filtered Presence-Absence Matrix

Laura is a comparative biologist. Her research is to compare presence-absence data for types and parts underneath an anatomical entity grouping class across her taxonomic groups of interest. The taxonomic groups are several sister clades among a larger group of sibling groups. She uses R to obtain a presence-absence matrix for her taxonomic groups and for her group of anatomical entities as synthesized by OntoTrace, using the query-by-OWL expression feature. However, the anatomical entities in her list have parts (or subsume structures) that do not occur in her taxonomic groups of interest, and therefore the initial query result includes many anatomical entities that are irrelevant to her study. (Simply requesting only variable characters to avoid the extraneous anatomical structures is unsuitable, because structures that do occur in her groups of interest but aren't variable would then be excluded as well.) Also, her taxonomic groups of interest don't form a monophyletic clade, and the query-by-OWL expression feature thus yields many taxa she is not interested in.

To remove those entities and taxa, she uses R to filter from her matrix all taxa that aren't members of her groups, and to filter out all anatomical entities that don't occur in her groups of interest. When using the query-by-OWL expression query, it is not apparent which terms do not occur in the group of interest without further investigation. Therefore, it is necessary to filter out these individual terms after the matrix has been generated. In the end Laura uses this matrix to merge with a phylogenetic tree for her taxonomic group. This matrix is read into Mesquite and merged with the phylogeny to perform ancestral reconstructions for each anatomical entity in the filtered matrix.

Obtaining supporting states

Laura is a comparative biologist. She has obtained an OntoTrace-synthesized presence/absence matrix for her taxonomic and anatomical groups of interest. To better understand unexpected presence and absence states, and polymorphic states that Laura believes to be in error, she needs to understand which presence/absence states were directly asserted and which were inferred by OntoTrace, which she can then use arrange the data into separate columns for data coming from originally asserted states, and data inferred by OntoTrace. She also needs to inspect the original character states (along with the publications they came from) that support the synthesized presence and absence states, especially for investigating conflicting statements in the original data for accuracy.

Data Science & Statistics

Mutually exclusive phenotypes

Jordan develops probabilistic models for comparative analysis of discrete traits. To make these models more tractable and biologically realistic, she is interested in incorporating into these models which traits are mutually exclusive with each other. Specifically, which phenotypes can an organism not exhibit simultaneously. This is typically not inferable from the domain knowledge (anatomy, phenotypic quality, etc) representation. Instead, Jordan is interested in learning from the data which phenotypes can be hypothesized as being mutually exclusive, and which as being compatible. As a starting hypothesis, if two phenotypes are found associated only with different character states of the same character are likely mutually exclusive, because based on the data they would not both be reported for the same taxon. In contrast, phenotypes found to be associated with different characters of the same study would likely be compatible, because, at least if characters were truly independent, they could both be observed for the same taxon. Jordan is also interested in looking at characters across studies, hypothesizing that even if associated with different characters, two phenotypes are still likely mutually exclusive if their respective character states never co-occur for the same taxon.

Semantic similarity within and between characters

Haley is a data scientist interested in the distribution of semantic similarity properties of characters from published studies. In particular, Haley has a number of hypotheses about the average semantic similarity of phenotypes within a character versus across characters, and about comparing characters with multi-phenotype states to those with single-phenotype states. To test these hypotheses, Haley uses R to obtain the phenotypes (as ontological expressions) and the states and characters that use them, from a study under investigation, or from a list of studies. She also obtains the pairwise semantic similarity between all phenotypes according to one of several available metrics, and then uses the results to conduct the statistical tests for her hypotheses.

Factors influencing semantic similarity

Haley is a data scientist interested in determining which factors, if any, account for the variance seen in, or can predict semantic similarity between phenotypes. Such factors could include anatomy (such as anatomical regions or other groupings), taxonomy (such as clades at a certain level), size of matrix, year of publication, author of publication, etc. To test for such factors, Haley uses R to obtain the phenotypes (as ontological expressions) and the states, characters, and matrices that use them, from the knowledgebase, with those studies removed that would unduly bias the results (such as incompletely annotated, etc). She also obtains the pairwise semantic similarity between all phenotypes according to one of several available metrics, considering all or only some relationship hierarchies, and then uses the results to conduct the appropriate factor analysis.

Visualizing semantic similarity properties for studies

Paulo is is systematist interested in better understanding the semantic information properties of different morphological character matrices. To generate questions for further investigation, he wants to start with visual representations of the statistics of semantic similarity metrics for a given character matrix, or across a set of matrices. Assuming that character matrices are identified by the study that published them, he uses R to obtain the phenotypes (as ontological expressions) and the states and characters that use them, for his study (or set of studies) of interest. He also obtains the pairwise semantic similarity between all phenotypes according to one of several available metrics, and then uses the results to generate a variety of visualizations, including frequency distribution plots for each study, and box plots comparing characters and studies.

Curation coverage

Walter is a comparative biologist and annotates morphological character matrices from published studies, the result of which is loaded into the Phenoscape KB. To guide annotation plans and assess gaps in annotation coverage across taxa of interest, he needs to get a summary of the data annotated in the KB for a particular taxon or list of taxa of interest. First, Walter uses R to query the KB for the taxon or list of taxa, and to obtain a table listing each taxon with their corresponding number of annotated phenotypes, along with the family and order that each taxon belongs to (4 columns in table). By seeing gaps in the curation coverage for a given taxon, and sorting the file by family or order, he can better target the taxa needing additional curation.

Second, Walter needs to determine which entities are lacking annotations for the taxa of interest. To do this, he uses R to obtain a table with the entities with annotations for at least one taxon within the group of interest, along with each taxon scored for whether or not they have an annotation for that entity. With this information, he can then prioritize annotation effort to curate phenotypes to entities lacking data in the target taxa.

Phylogenetic Comparative Approaches

Character and character state dependency

Presence/absence-based dependency

Sasha has a character matrix in R and wants to reconstruct stochastic character maps of these characters on a phylogeny using Sergei's PARAMO pipeline. To do so, Sasha must account for all dependencies between character states in the data matrix. For the example of presence/absence of anatomical entities, Sasha uses Rphenoscape to query with a list of anatomical trait entities, and receives as result a list of (bi)directional dependencies. Here, a dependency between characters A and B is defined as the presence of one implying the presence of the other, or the absence of one implying the absence of the other. This list of dependencies can then be used in Sergei's pipeline to amalgamate (lump) characters into suites of dependent characters so that they may be properly reconstructed using stochastic character mapping. This information allows the user to estimate ancestral states, compare rates across body regions, and analyze whole anatomy evolutionary rates.

Mutual exclusivity of characters and character states

Sasha has a character matrix in R and wants to reconstruct stochastic character maps of these characters on a phylogeny using Sergei's PARAMO pipeline. To do so, aside from dependencies between character states Sasha must account for the mutual exclusivity between characters in the data matrix. For the example of presence/absence of anatomical entities, Sasha uses Rphenoscape to query with a list of their anatomical trait entities, and receives as result a list of which of the input characters represent mutually exclusive characters. This list can be then be used in Sergei's pipeline to amalgamate (lump) characters into suites of dependent characters so that they may be properly reconstructed using stochastic character mapping. This information allows the user to estimate ancestral states, compare rates across body regions, and analyze whole anatomy evolutionary rates.