Current human genome sequencing assays in both clinical and research settings primarily utilize short-read sequencing and apply resequencing pipelines to detect genetic variants. However, theses mapping-based data analysis pipelines remains a considerable challenge due to an incomplete reference genome, mapping errors and high sequence divergence.
To overcome this challenge, we propose an efficient and effective whole-read assembly workflow with unsupervised graph mining algorithms on an Apache Spark large-scale data processing platform called ConnectedReads. By fully utilizing short-read data information, ConnectedReads is able to generate assembled contigs and then benefit downstream pipelines to provide higher-resolution SV discovery than that provided by other methods, especially in high diversity against reference and N-gap regions of reference. Furthermore, we demonstrate a cost-effective approach by leveraging ConnectedReads to investigate all spectra of genetic changes in population-scale studies.
Interested in contributing? See CONTRIBUTING.
ConnectedReads is licensed under the terms of the Apache 2.0 License.
ConnectedReads happily makes use of many open source packages. We'd like to specifically call out a few key ones:
We thank all of the developers and contributors to these packages for their work.
- This is not an official Atgenomix product.
- To utilize the official product with full experience, please contact Atgenomix (info@atgenomix.com).