Skip to content
Hilmar Lapp edited this page Jan 17, 2019 · 16 revisions

Downloading Data

Phenotypes for a publication

Walter is a comparative biologist and annotates morphological character matrices from published studies, the result of which is loaded into the Phenoscape KB. To be able to do X, Y, and Z, he needs to download from the KB the phenotypes for any of the studies (publications) that he has annotated. Walter has a list of publication identifiers (such as in the form of DOIs). To accomplish his goal, Walter uses R to read the list of identifiers for the desired studies from a file, and to obtain the phenotypes from the KB for each study. The phenotypes are the non-redundant set of ontological expressions with which the studies' character data were annotated. He then does A, B, and C with the result.

OntoTrace

Customized Presence-Absence Matrix

Laura is a comparative biologist. Her research is to compare presence-absence data for defined list of taxa, and a defined list of anatomical entities, and she therefore needs to obtain a presence-absence matrix for the taxa and anatomical entities as synthesized by OntoTrace. Because the taxa and anatomical entities don't fit nicely underneath higher-level groupings, Laura can't use the query-by-OWL expression feature for OntoTrace. Instead, she uses R to read in the list of taxa and the list of anatomical entities, to query OntoTrace for each anatomical entity across the list of taxa, and to merge the results into a single non-redundant presence-absence matrix. Because her list of anatomical terms is already curated, parts or entities connected by other relationships are not included in the results. However, presence/absence of entities implied by those in her list, are included in the result, just as would otherwise be the case for an OntroTrace synthesis. In the end Laura does A, B, and C with the merged dataset.

Filtered Presence-Absence Matrix

Laura is a comparative biologist. Her research is to compare presence-absence data for types and parts underneath an anatomical entity grouping class across her taxonomic groups of interest. The taxonomic groups are several sister clades among a larger group of sibling groups. She uses R to obtain a presence-absence matrix for her taxonomic group and for her group of anatomical entities as synthesized by OntoTrace, using the query-by-OWL expression feature. Because the anatomical entities in her list do XYZ, and her taxonomic groups of interest don't form a monophyletic clade, the query-by-OWL expression feature results in many anatomical entities that don't occur in her taxonomic groups of interest, and the resulting dataset also includes taxa she is not interested in. To remove those entities and taxa, she uses R to filter from her matrix all taxa that aren't members of her groups, and to filter out all anatomical entities that don't occur in her groups of interest. In the end Laura does A, B, and C with the filtered dataset.

Obtaining supporting states

Laura is a comparative biologist. She has obtained an OntoTrace-synthesized presence/absence matrix for her taxonomic and anatomical groups of interest. To better understand unexpected presence and absence states, and polymorphic states that Laura believes to be in error, she needs to understand which presence/absence states were directly asserted and which were inferred by OntoTrace. She also needs to inspect the original character states (along with the publications they came from) that support the synthesized presence and absence states. Laura therefore does X, Y, Z.

Data Science & Statistics

Semantic similarity within and between characters

Haley is a data scientist interested in the distribution of semantic similarity properties of characters from published studies. In particular, Haley has a number of hypotheses about the average semantic similarity of phenotypes within a character versus across characters, and about comparing characters with multi-phenotype states to those with single-phenotype states. To test these hypotheses, Haley uses R to obtain the phenotypes (as ontological expressions) and the states and characters that use them, from a study under investigation, or from a list of studies. She also obtains the pairwise semantic similarity between all phenotypes according to one of several available metrics, and then uses the results to conduct the statistical tests for her hypotheses.

Factors influencing semantic similarity

Haley is a data scientist interested in determining which factors, if any, account for the variance seen in, or can predict semantic similarity between phenotypes. Such factors could include anatomy (such as anatomical regions or other groupings), taxonomy (such as clades at a certain level), size of matrix, year of publication, author of publication, etc. To test for such factors, Haley uses R to obtain the phenotypes (as ontological expressions) and the states, characters, and matrices that use them, from the knowledgebase, with those studies removed that would unduly bias the results (such as incompletely annotated, etc). She also obtains the pairwise semantic similarity between all phenotypes according to one of several available metrics, and then uses the results to conduct the appropriate factor analysis.

Visualizing semantic similarity properties for studies

Paulo is is systematist interested in better understanding the semantic information properties of different morphological character matrices. To generate questions for further investigation, he wants to start with visual representations of the statistics of semantic similarity metrics for a given character matrix, or across a set of matrices. Assuming that character matrices are identified by the study that published them, he uses R to obtain the phenotypes (as ontological expressions) and the states and characters that use them, for his study (or set of studies) of interest. He also obtains the pairwise semantic similarity between all phenotypes according to one of several available metrics, and then uses the results to generate a variety of visualizations, including frequency distribution plots for each study, and box plots comparing characters and studies.

Clone this wiki locally