Skip to content
Josef Uyeda edited this page Feb 22, 2019 · 16 revisions

OntoTrace

Customized Presence-Absence Matrix

Laura is a comparative biologist. Her research is to compare presence-absence data for defined list of taxa, and a defined list of anatomical entities, and she therefore needs to obtain a presence-absence matrix for the taxa and anatomical entities as synthesized by OntoTrace. Because the taxa and anatomical entities don't fit nicely underneath higher-level groupings, Laura can't use the query-by-OWL expression feature for OntoTrace. Instead, she uses R to read in the list of taxa and the list of anatomical entities, to query OntoTrace for each anatomical entity across the list of taxa, and to merge the results into a single non-redundant presence-absence matrix. Because her list of anatomical terms is already curated, parts or entities connected by other relationships are not included in the results. However, presence/absence of entities implied by those in her list, are included in the result, just as would otherwise be the case for an OntoTrace synthesis. In the end Laura uses this customized presence-absence matrix to merge with a phylogenetic tree for the list of taxa used in the input. This matrix is read into Mesquite and merged with the phylogeny to perform ancestral reconstructions for each anatomical entity in the customized matrix.

Filtered Presence-Absence Matrix

Laura is a comparative biologist. Her research is to compare presence-absence data for types and parts underneath an anatomical entity grouping class across her taxonomic groups of interest. The taxonomic groups are several sister clades among a larger group of sibling groups. She uses R to obtain a presence-absence matrix for her taxonomic group and for her group of anatomical entities as synthesized by OntoTrace, using the query-by-OWL expression feature. Because the anatomical entities in her list do XYZ, and her taxonomic groups of interest don't form a monophyletic clade, the query-by-OWL expression feature results in many anatomical entities that don't occur in her taxonomic groups of interest, and the resulting dataset also includes taxa she is not interested in. To remove those entities and taxa, she uses R to filter from her matrix all taxa that aren't members of her groups, and to filter out all anatomical entities that don't occur in her groups of interest. When using the query-by-OWL expression query, it is not apparent which terms do not occur in the group of interest without further investigation. Therefore, it is necessary to filter out these individual terms after the matrix has been generated. In the end Laura uses this matrix to merge with a phylogenetic tree for her taxonomic group. This matrix is read into Mesquite and merged with the phylogeny to perform ancestral reconstructions for each anatomical entity in the filtered matrix.

Obtaining supporting states

Laura is a comparative biologist. She has obtained an OntoTrace-synthesized presence/absence matrix for her taxonomic and anatomical groups of interest. To better understand unexpected presence and absence states, and polymorphic states that Laura believes to be in error, she needs to understand which presence/absence states were directly asserted and which were inferred by OntoTrace. She also needs to inspect the original character states (along with the publications they came from) that support the synthesized presence and absence states. In the end Laura is able to use this matrix as an input to a Python pipeline to extract this information into a table. The pipeline is able to arrange data into separate columns for data coming from asserted, and data inferred by OntoTrace. This allows Laura to identify any conflicting statements in the original data and inspect the original character states in the publications for accuracy.

Data Science & Statistics

Semantic similarity within and between characters

Haley is a data scientist interested in the distribution of semantic similarity properties of characters from published studies. In particular, Haley has a number of hypotheses about the average semantic similarity of phenotypes within a character versus across characters, and about comparing characters with multi-phenotype states to those with single-phenotype states. To test these hypotheses, Haley uses R to obtain the phenotypes (as ontological expressions) and the states and characters that use them, from a study under investigation, or from a list of studies. She also obtains the pairwise semantic similarity between all phenotypes according to one of several available metrics, and then uses the results to conduct the statistical tests for her hypotheses.

Factors influencing semantic similarity

Haley is a data scientist interested in determining which factors, if any, account for the variance seen in, or can predict semantic similarity between phenotypes. Such factors could include anatomy (such as anatomical regions or other groupings), taxonomy (such as clades at a certain level), size of matrix, year of publication, author of publication, etc. To test for such factors, Haley uses R to obtain the phenotypes (as ontological expressions) and the states, characters, and matrices that use them, from the knowledgebase, with those studies removed that would unduly bias the results (such as incompletely annotated, etc). She also obtains the pairwise semantic similarity between all phenotypes according to one of several available metrics, and then uses the results to conduct the appropriate factor analysis.

Visualizing semantic similarity properties for studies

Paulo is is systematist interested in better understanding the semantic information properties of different morphological character matrices. To generate questions for further investigation, he wants to start with visual representations of the statistics of semantic similarity metrics for a given character matrix, or across a set of matrices. Assuming that character matrices are identified by the study that published them, he uses R to obtain the phenotypes (as ontological expressions) and the states and characters that use them, for his study (or set of studies) of interest. He also obtains the pairwise semantic similarity between all phenotypes according to one of several available metrics, and then uses the results to generate a variety of visualizations, including frequency distribution plots for each study, and box plots comparing characters and studies.

Curation coverage

Walter is a comparative biologist and annotates morphological character matrices from published studies, the result of which is loaded into the Phenoscape KB. To guide annotation plans and assess gaps in annotation coverage across taxa of interest, he needs to get a summary of the data annotated in the KB for a particular taxon or list of taxa of interest. First, Walter uses R to query the KB for the taxon or list of taxa, and to obtain a table listing each taxon with their corresponding number of annotated phenotypes, along with the family and order that each taxon belongs to (4 columns in table). By seeing gaps in the curation coverage for a given taxon, and sorting the file by family or order, he can better target the taxa needing additional curation.

Second, Walter needs to determine which entities are lacking annotations for the taxa of interest. To do this, he uses R to obtain a table with the entities with annotations for at least one taxon within the group of interest, along with each taxon scored for whether or not they have an annotation for that entity. With this information, he can then prioritize annotation effort to curate phenotypes to entities lacking data in the target taxa.

Phylogenetic Comparative Approaches

PARAMO pipeline query

A user has a character matrix in R and wants to reconstruct stochastic character maps of these characters on a phylogeny using Sergei's PARAMO pipeline. In order to do so, they must ideally account for all dependencies between character states as well as the mutual exclusivity between characters in the data matrix. For the example of presence/absence of anatomical entities, the user would like to query phenoscape with a list of their anatomical trait entities (can be by IRI) and have returned a list of (bi)directional dependencies. A similar query of traits can provide which characters provided represent mutually exclusive characters (may not be as relevant for anatomical characters). This list can be then be used in Sergei's pipeline to amalgamate (lump) characters into suites of dependent characters so that they may be properly reconstructed using stochastic character mapping. This information allows the user to estimate ancestral states, compare rates across body regions, and analyze whole anatomy evolutionary rates.

Clone this wiki locally