In year 1, we started with environmental data and phenotype and genotype data files from TERRA-REF and later added data from KSU and Clemson. We developed ML/statistical models and a Bayesian Belief Network for making predictions.
In year 2, we found an excellent cottonwood and corn data set. We are currently exploring those data sets to see if they are large enough to train a neural network. We starting collaborating with the Dogwood Genome Project to examine NPN data from the same dogwood clone and to sequence additional clones. We investigated the possibility of applying the sorghum-trained model on Sorghastrum, a wild relative of sorghum with data in NPN, but there were not enough data points.
The documentation for the genomic data is here. There are HapMap files and VCF files available. The difference between VCF and HapMap files is explained here. The HapMap file was downloaded from Cyverse. The initial decision to use HapMap instead of VCF was likely due to the convenience of already having one HapMap file. The VCF files need to be concatenated. The genotype data contain information on X cultivars.
The HapMap file and the phenotype file was used with Tassel to find the relevant SNPs (rows in the HapMap file) using GWAS. The result of the GWAS is here. The hapmap-slim.vcf file contains only the SNPs that were found to be correlated to the phenotypes of interest by the GWAS. The results with the HapMap file did not contain any information about heterozygosity, so we decided to use the VCF files even though they had to be concatenated first.
The VCF files were concatenated and can be found here. These files were exported to Tassel for GWAS. The output can be found here. The GWAS was used to filter the concatenated VCF file to the "Slim VCF" file that contains only the genes that the GWAS found to be useful for the phenotypes of interest. The Slim VCF was used to create a distance matrix for ML input. Unfortunately, this data did not improve the performance of the algorithms. We suspected that was because the cultivars were so closely related, that this distance matrix was not capturing useful genomic features. More information about this approach can be found here.
After this result, we decided to go back to the VCF files and explore creating a distance matrix based on information theory rather than genomics. Documentation for this work can be found here and here. First, The VCF files were filtered to remove SNPs that were identical to the reference sorghum genome. The SNPs were filtered down by alternately filtered for the SNPs that are most informative and filtering out cultivars that don't have enough data. The filtering process produced a matrix of jaccard distances for the ML. This matrix worked better, but still not as well as we would like. Now we will remove all cultivars from the genomic data set that don't have data in the phenotype data set.
The knowledge graphs at Monarch, Gramene, and Planteome were queried to find genes that impact the phenotypes of interest. The knowledge graphs were queried using ontology terms that mapped to the phenotypes here and additional related GO terms. The results of these queries are here. We used Gramene biomart to find Sorghum orthologs for Arabidopsis genes that were identified as potentially relevant by Monarch. Additionally, we used the flowering time pathway for Arabidopsis in WikiPathways to identify more relevant genes. There is much less data for the other phenotypes. Instructions for using Gramene Biomart are here. When looking at the SNPs of interest using information theory, we looked to see if any of the genes are also present in the knowledge graph queries. there were very few - perhaps seven. We anticpate that the KG queries will become more relevant when we start examining more species.
After a discussion with Addie Thompson at MSU (in July 2021), we realized we took the wrong approach with the gene data. We will now repeat the gene data manipulations for the sorghum files with different parameters for the GWAS analysis. We will also convert the entirety of the VCF file into a matrix instead of relying on filtered data.
The dogwood genome is available here for the appalachian spring cultivar. Our plan is to look at NPN data from the same cultivar (see this project) in different environments. We are collaborating with the Dogwood Genome Project to sequence additional dogwood cultivars with NPN data in the same environment for comparison.
Populus trichocarpa has been the subject of research by the Dept of Energy, especially at the Oak Ridge National Laboratory, for decades. There is a large data set of phenotypes from sequenced trees collected over decades. We are collaborating with this research group to prepare these data for ML training. So far, it looks like we have data from about 1200 cultivars. The genomic and phenotype data are being processed.
We have become aware of the large Genomes2Fields data set and are examining it to see if it is large enough to train a neural network.
The phenotype data was transformed into this file for use in the ML model and this file for GWAS. Phenotypes were mapped to ontology terms. The phenotype data were obtained from TERRA-REF using the Traits R package. This includes the Arizona, KSU, and Clemson data.
Genes relevant for the phenotypes of interest were found using Gramene and are here.
Phenotypes for ML are:
- GDD to flowering
- GDD to flagleaf emergence
- End of season canopy height
- End of season above ground dry biomass
The code used to prepare the phenotype data are currently being refactored to create a "story notebook" that contains the data, the code, and documentation.
Environmental parameters are:
- Row and column position
- Soil
- Air temperature
- Relative humidity
- GDD
- Wind speed and direction
- Precipitation
The documentation for the Bayesian Belief Network can be found in its repo. https://github.com/genophenoenvo/phenophasebbn
The machine learning team used the phenotype, genomic, and environment data to train and test a model to predict flowering time, canopy height, and flag leaf emergence. This work is described here.
A graph was created by combining the Gene Ontology, the Plant Trait Ontology, the Plant Ontology, NCBITaxon, and Planteome using the KG-Hub code. The Planteome data used included gene, ortholog, QTL, and germplasm data for Sorghum bicolor, Zea mays, Oryza sativa, Populus trichocarpa, and Arabidopsis thaliana. The code for constructing the graph is here. The lastest version of the graph is here. We plan to use the graph to make an embedding for using in visualization and ML.