Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ref: simphenotype #64

Merged
merged 45 commits into from
Jul 20, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
45 commits
Select commit Hold shift + click to select a range
0116bbf
start on simphenotype refactor
aryarm May 14, 2022
4fbb293
remove original simphenotype code
aryarm May 14, 2022
30e5949
continue adding support for other file formats
aryarm May 15, 2022
06e3806
refmt with black
aryarm May 16, 2022
52d90a0
move transform subcommand to its own module
aryarm Jun 13, 2022
e3c9df9
setup PhenoSimulator.__init__ method
aryarm Jun 13, 2022
49341e6
add Genotypes.subset() method
aryarm Jun 16, 2022
a0ae5e1
also allow indexing in GenotypesRefAlt class
aryarm Jun 16, 2022
f2a3017
simplify the transform method of the Haplotype and Haplotypes classes
aryarm Jun 16, 2022
a13f94a
add region and samples params to simphenotype
aryarm Jun 16, 2022
fa55ee8
raise error if any samples or variants are invalid in transform subco…
aryarm Jun 16, 2022
5542ccb
copy transform subroutine to simphenotypes
aryarm Jun 16, 2022
def4826
rename covar and pheno files
aryarm Jul 1, 2022
bffd27c
describe phenotypes in a separate page of the docs
aryarm Jul 1, 2022
afcdda5
support reading PLINK2-style pheno and covar files
aryarm Jul 1, 2022
2849dca
clarify transform documentation
aryarm Jul 1, 2022
a76e1a6
add Phenotypes.write() method
aryarm Jul 1, 2022
80bf943
refmt with black
aryarm Jul 1, 2022
d249206
prelim finish PhenoSimulator.run
aryarm Jul 1, 2022
e00ff2b
try simulating phenotypes from multiple haplotypes, additively
aryarm Jul 2, 2022
05430ee
merge haplotype module with sim_phenotypes module
aryarm Jul 2, 2022
0636294
compute heritability when not provided in simphenotype
aryarm Jul 2, 2022
54f48de
implement case/control
aryarm Jul 2, 2022
7fba06c
improve commenting in __main__
aryarm Jul 2, 2022
6c9d7ce
rename sim_phenotypes module to sim_phenotype (singular)
aryarm Jul 2, 2022
c6f1c5c
create test module for simphenotype
aryarm Jul 3, 2022
1d96775
try to use new pysam add_samples() method to speed up vcf creation
aryarm Jul 6, 2022
23a067a
use default heritability of 1 and add append method to Phenotypes class
aryarm Jul 6, 2022
30852bc
clarify installation and contribution guidelines
aryarm Jul 6, 2022
fe63e32
seed the simphenotype random number generator
aryarm Jul 7, 2022
24fef39
mark case/control in pheno output
aryarm Jul 8, 2022
9de40ee
add tests for simphenotype
aryarm Jul 8, 2022
9a616fd
finish testing case/control simphenotype
aryarm Jul 8, 2022
b7e0dc0
ensure written pheno file has unique names
aryarm Jul 8, 2022
f02cb93
refmt with black
aryarm Jul 8, 2022
d2f144c
add docs for phenosimulator to API section
aryarm Jul 8, 2022
7291380
rename phenotype simulation tests
aryarm Jul 8, 2022
81e48a3
revise simphenotype command docs after changes to CLI
aryarm Jul 14, 2022
8b5de5c
oops - fix typos
aryarm Jul 14, 2022
b237872
add non-gcta model to simphenotype
aryarm Jul 19, 2022
9f9d0ea
switch if else in simphenotype to fix bug
aryarm Jul 19, 2022
6aefc33
adjust logging of info in PhenoSimulator.run()
aryarm Jul 19, 2022
38931d7
warn about case/control encoding in simphenotype docs
aryarm Jul 19, 2022
3f9af3a
log expected heritability and expand docs
aryarm Jul 19, 2022
0359ced
fix conflicts with main branch
aryarm Jul 20, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 8 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,10 @@ Homepage: https://haptools.readthedocs.io/

## Installation

UNDER CONSTRUCTION
We have not officially published `haptools` yet, but in the meantime, you can install it directly from our Github repository.
```bash
pip install git+https://github.com/gymrek-lab/haptools.git
```

## Haptools utilities

Expand All @@ -21,12 +24,13 @@ Haptools consists of multiple utilities listed below. Click on a utility to see

* [`haptools karyogram`](docs/commands/karyogram.md): Visualize a "chromosome painting" of local ancestry labels based on breakpoints output by `haptools simgenome`.

* [`haptools transform`](https://haptools.readthedocs.io/en/latest/commands/transform.html): Transform a set of genotypes via a list of haplotypes. Create a new VCF containing haplotypes instead of variants.

Outputs produced by these utilities are compatible with each other. For example
`haptools simgenome` outputs a VCF file with local ancestry information annotated for each variant. The output VCF file can be used as input to `haptools simphenotype` to simulate phenotype information. `haptools simgenome` also outputs a list of local ancestry breakpoints which can be visualized using `haptools karyogram`.

## Contributing

If you are interested in contributing to `haptools`, please get in touch by submitting a Github issue or contacting us at mlamkin@ucsd.edu.


We gladly welcome any contributions to `haptools`!

Please read [our contribution guidelines](https://haptools.readthedocs.io/en/latest/project_info/contributing.html) and then submit a [Github issue](https://github.com/gymrek-lab/haptools/issues).
10 changes: 9 additions & 1 deletion docs/api/haptools.rst
Original file line number Diff line number Diff line change
Expand Up @@ -52,8 +52,16 @@ haptools.data.haplotypes module
:undoc-members:
:show-inheritance:

haptools.sim_phenotype module
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. automodule:: haptools.sim_phenotype
:members:
:undoc-members:
:show-inheritance:

haptools.sim_genotype module
~~~~~~~~~~~~~~~~~~~~~~~~
~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. automodule:: haptools.sim_genotype
:members:
Expand Down
2 changes: 1 addition & 1 deletion docs/commands/karyogram.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Haptools karyogram
# karyogram

`haptools karyogram` takes as input a breakpoints file (e.g. as output by `haptools simgenotype`) and a sample name, and plots a karyogram depicting local ancestry tracks.

Expand Down
76 changes: 0 additions & 76 deletions docs/commands/simphenotype.md

This file was deleted.

81 changes: 79 additions & 2 deletions docs/commands/simphenotype.rst
Original file line number Diff line number Diff line change
@@ -1,4 +1,81 @@
.. _commands-simphenotype:

.. include:: simphenotype.md
:parser: myst_parser.sphinx_

simphenotype
============

Simulates a complex trait, taking into account haplotype- or local-ancestry- specific effects as well as traditional variant-level effects. The user denotes causal variables to use within the simulation by specifying them in a ``.hap`` file.

The implementation is based on the `GCTA GWAS Simulation <https://yanglab.westlake.edu.cn/software/gcta/#GWASSimulation>`_ utility.

Usage
~~~~~
.. code-block:: bash

haptools simphenotype \
--replications INT \
--heritability FLOAT \
--prevalence FLOAT \
--region TEXT \
--sample SAMPLE \
--samples-file FILENAME \
--output PATH \
--verbosity [CRITICAL|ERROR|WARNING|INFO|DEBUG|NOTSET] \
GENOTYPES HAPLOTYPES

Model
~~~~~
Each normalized haplotype :math:`\vec{Z_j}` is encoded as an independent causal variable in a linear model:

.. math::

\vec{y} = \sum_j \beta_j \vec{Z_j} + \vec \epsilon

where

.. math::

\epsilon_i \sim N(0, \sigma^2)

.. math::

\sigma^2 = Var[\sum_j \beta_j \vec{Z_j}] * (\frac 1 {h^2} - 1)

The heritability :math:`h^2` is user-specified, but if it is not provided, then :math:`\sigma^2` will be computed purely from the effect sizes, instead:

.. math::

\sigma^2 = \Biggl \lbrace {1 - \sum \beta_j^2 \quad \quad {\sum \beta_j^2 \le 1} \atop 0 \quad \quad \quad \quad \quad \text{ otherwise }}

If a prevalence for the disease is specified, the final :math:`\vec{y}` value will be thresholded to produce a binary case/control trait with the desired fraction of diseased individuals.

Output
~~~~~~
Phenotypes are output in the PLINK2-style ``.pheno`` file format. If ``--replications`` was set to greater than 1, additional columns are output for each simulated trait.

Note that case/control phenotypes are encoded as 0 (control) + 1 (case) **not** 1 (control) + 2 (case). The latter is used by PLINK2 unless the ``--1`` flag is used (see `the PLIN2 docs <https://www.cog-genomics.org/plink/2.0/input#input_missing_phenotype>`_). Therefore, you must use ``--1`` when providing our ``.pheno`` files to PLINK.

Examples
~~~~~~~~
.. code-block:: bash

haptools simphenotype -o simulated.pheno tests/data/example.vcf.gz tests/data/simphenotype.hap

Simulate two replicates of a case/control trait that occurs in 60% of your samples with a heritability of 0.8. Encode all of the haplotypes in ``tests/data/example.hap.gz`` as independent causal variables.

.. code-block:: bash

haptools simphenotype \
--replications 2 \
--heritability 0.8 \
--prevalence 0.6 \
--output simulated.pheno \
tests/data/example.vcf.gz tests/data/example.hap.gz

Detailed Usage
~~~~~~~~~~~~~~

.. click:: haptools.__main__:main
:prog: haptools
:show-nested:
:commands: simphenotype
2 changes: 1 addition & 1 deletion docs/commands/transform.rst
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ Examples
~~~~~~~~
.. code-block:: bash

haptools transform tests/data/example.vcf.gz tests/data/example.hap.gz | less
haptools transform tests/data/example.vcf.gz tests/data/basic.hap.gz | less

.. code-block:: bash

Expand Down
7 changes: 7 additions & 0 deletions docs/formats/genotypes.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
.. _formats-genotypes:


Genotypes
=========

Genotype files must be specified as VCF or BCF files.
4 changes: 2 additions & 2 deletions docs/formats/haplotypes.rst
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
.. _formats-haplotypes:


.hap
====
Haplotypes
==========

This document describes our custom file format specification for haplotypes: the ``.hap`` file.

Expand Down
43 changes: 0 additions & 43 deletions docs/formats/inputs.rst

This file was deleted.

28 changes: 28 additions & 0 deletions docs/formats/phenotypes.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
.. _formats-phenotypes:


Phenotypes and Covariates
=========================

Phenotype file format
---------------------
Phenotypes are expected to follow `the PLINK2 .pheno file format <https://www.cog-genomics.org/plink/2.0/input#pheno>`_. This is a
tab-separated format where the first column corresponds to the sample ID, and
subsequent columns contain each of your phenotypes.

The first line of the file corresponds with the header and must begin with ``#IID``.
The names of each of your phenotypes belong in the subbsequent columns of the header.

See `tests/data/simple.pheno <https://github.com/gymrek-lab/haptools/blob/main/tests/data/simple.pheno>`_ for an example of a phenotype file:

.. include:: ../../tests/data/simple.pheno
:literal:

Covariate file format
---------------------
Covariates follow the same format as phenotypes.

See `tests/data/simple.covar <https://github.com/gymrek-lab/haptools/blob/main/tests/data/simple.covar>`_ for an example of a covariate file:

.. include:: ../../tests/data/simple.covar
:literal:
17 changes: 17 additions & 0 deletions docs/formats/sample_info.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
.. _formats-sample_info:


Sample Info
===========

1000 Genomes sample_info file format
------------------------------------
Within the subcommand ``haptools simgenotype`` we use a file to map samples in the
reference to their population listed in the model file. This code is expected to work
out of the box with 1000 genomes data and we have pre-computed this mapping file as
well as given the command to how to compute it if you desire another as well.

``cut -f 1,4 igsr-1000\ genomes\ on\ grch38.tsv | sed '1d' | sed -e 's/ /\t/g' > 1000genomes_sampleinfo.tsv``

See ``example-files/1000genomes_sampleinfo.tsv`` for an example of the 1000genomes
GRCh38 samples mapped to their subpopulations.
5 changes: 4 additions & 1 deletion docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -9,8 +9,10 @@
:hidden:
:maxdepth: 1

formats/inputs.rst
formats/genotypes.rst
formats/haplotypes.rst
formats/phenotypes.rst
formats/sample_info.rst

.. toctree::
:caption: Commands
Expand All @@ -20,6 +22,7 @@

commands/simgenotype.rst
commands/simphenotype.rst
commands/karyogram.rst
commands/transform.rst

.. toctree::
Expand Down
Loading