GitHub - cortes-ciriano-lab/DeMethify: DeMethify is a suite of methylation deconvolution algorithms that mainly revolves around adapted non-negative matrix factorization, offering a versatile toolbox for heterogeneous methylation data.

DeMethify is a suite of methylation deconvolution algorithms that mainly revolves around adapted non-negative matrix factorization, offering a versatile toolbox for heterogeneous methylation data.

Flags and Arguments

Option	Description
`--methfreq`	Methylation frequency file path (values between 0 and 1).
`--ref`	Methylation reference matrix file path.
`--iterations`	Numbers of iterations for outer and inner loops (default without purity = 10000, 20; with purity = 100, 500).
`--nbunknown`	Number of unknown cell types to estimate.
`--purity`	The purities of the samples in percent [0,100], if known.
`--termination`	Termination condition for cost function (default = 1e-2).
`--init`	Initialisation option, the default is `uniform_`, and the options are: `uniform`, `uniform_`, `beta`, `SVD`, `ICA`.
`--outdir`	Output directory.
`--fillna`	Replace every NA by 0 in the given data.
`--ic`	Select number of unknown cell types by minimising a criterion (`AIC`, `BIC`, `CCC`, `BCV`, `minka`).
`--confidence`	Outputs bootstrap confidence intervals, takes confidence level and bootstrap iteration numbers as input.
`--plot`	Plot cell type proportions estimates for each sample, eventually with confidence intervals.
`--restart`	Number of random restarts among which to select the one with the lowest cost/highest loglikelihood.
`--seed`	Set a seed integer number for random number generation for reproducibility.
`--noprint`	Does not show the logo.
`--bedmethyl`	Flag to indicate that the input will be bedmethyl files, modkit style.

Installing DeMethify

We recommend setting up a fresh conda environment with a Python version >= 3.6 :

conda create --name demethify python=3.10.15
conda activate demethify

Then one can either use:

pip install git+https://github.com/cortes-ciriano-lab/DeMethify

Or:

git clone https://github.com/cortes-ciriano-lab/DeMethify
cd DeMethify
pip install .

Verify that the installation went well with:

demethify -h

Run DeMethify

After installing, you can finally run DeMethify.

The typical pipeline for bedmethyl files (like the ones outputted by modkit) is:

Preprocessing
- Potentially feature selection, doeable from commandline with preprocessing/feature_selection.py (see preprocessing/preprocessing.ipynb)
- Intersection of the reference and the samples so that the CpG sites are consistent across files, doeable from commandline with preprocessing/intersect_bed.py (see preprocessing/preprocessing.ipynb)
Run DeMethify depending on your use case

python feature_selection.py bed1.bed 100000
python intersect_bed.py bed1_select_ref.bed bed2.bed bed3.bed bed4.bed

Here is a flowchart to run you through the different use cases for DeMethify.

Input format

The expected reference input format is either a csv file with methylation frequency values between 0 and 1, with rows corresponding to CpG sites and columns to cell types:

Monocytes_EPIC	B-cells_EPIC	CD4T-cells_EPIC	NK-cells_EPIC	CD8T-cells_EPIC	Neutrophils_EPIC
0.9484	0.9447	0.9438	0.9394	0.9527	0.9354
0.0426	0.0518	0.0425	0.0366	0.0398	0.0358

or a bedmethyl file of the same kind, in this case you need to specify the --bedmethyl flag:

chrom	start	end	Monocytes_EPIC	B-cells_EPIC	CD4T-cells_EPIC	NK-cells_EPIC	CD8T-cells_EPIC	Neutrophils_EPIC
chr12	121416512	121416513	0.9484	0.9447	0.9438	0.9394	0.9527	0.9354
chr1	6088550	6088551	0.0426	0.0518	0.0425	0.0366	0.0398	0.0358

In the same way, the expected sample format is either a csv file or a bedmethyl file where rows correspond to CpG sites, it should have one or two columns in whatever order corresponding to the methylation frequency "percent_modified" (required), and potentially total count "valid_coverage" (not required). There can be additional columns without changing anything.

For modkit users, using the --header flag should be enough to obtain the right format.

The methylation frequency for the bedmethyl files are expected to be percentages as they're usually given by tools like modkit, the methylation frequency for csv files are expected to be between 0 and 1 as usual:

chrom	start	end	valid_coverage	count_modified	percent_modified
chr1	227058070	227058071	55	4	7.2727272727272725
chr1	3210424	3210425	52	46	88.46153846153845

or:

valid_coverage	percent_modified
55	0.07272727272727273
52	0.8846153846153846

Unsupervised case

If you've got no methylation reference matrix, you can still use DeMethify in a totally unsupervised fashion. Just leave out the --ref flag:

demethify \
    --methfreq output_gen/sample{1..10}.bed \
    --nbunknown 4 \
    --outdir unsupervised \
    --bedmethyl \
    --plot

Reference based case

If you want to perform fully reference-based methylation deconvolution, just leave out the --nbunknown flag:

demethify \
    --ref output_gen/ref_matrix.bed \
    --methfreq output_gen/sample{1..10}.bed \
    --bedmethyl \
    --outdir output_ref_based \
    --plot

Partial-reference based case

If you've got a number of samples greater or equal than 2, you can use the partial-reference based algorithm to jointly estimate the unknown cell type portion methylation profile and the proportions of all known and unknown cell types, otherwise you can use the reference based algorithm (if you don't specify --nbunknown) and hope that the unknown portion of the mixture isn't too high.

demethify \
    --ref output_gen/ref_matrix.bed \
    --methfreq output_gen/sample{1..10}.bed \
    --nbunknown 1 \
    --confidence 95 2500 \
    --outdir ci \
    --bedmethyl \
    --plot

Partial-reference based case with purity

You can specify (in percent) the sample purity if you have it to make the estimation better. It also makes the optimisation problem identifiable for the one sample, one known cell type case.

demethify \
    --ref output_gen/ref_matrix.bed \
    --methfreq output_gen/sample{1..10}.bed \
    --nbunknown 1 \
    --purity 60 80 90 20 50 90 100 30 50 10 \
    --outdir purity \
    --bedmethyl \
    --plot

Confidence intervals

With the --confidence flag (arguments are confidence level in percentage and number of bootstrap iterations), you can obtain confidence intervals for the estimates and the --plot flag generates plots so that you can visualise the proportions estimates like this:

demethify \
    --ref output_gen/ref_matrix.bed \
    --methfreq output_gen/sample{1..10}.bed \
    --nbunknown 1 \
    --confidence 95 2500 \
    --outdir ci \
    --bedmethyl \
    --plot

Model selection

With the --ic flag, you can obtain the number of unknown cell types that minimises a set criterion, it can be corrected Bayesian Information Criterion with BIC, corrected Akaike Information Criterion with AIC, Brunet's Cophenetic Correlation Coefficient method with CCC, an adapted version of Owen and Parry's bi-cross-validation method with BCV, or an adapted version of the Minka-PCA method with minka. One can specify the number of restarts/number of folds for CCC and BCV by adding an int number argument after the method as in --ic BCV 30:

demethify \
    --ref output_gen/ref_matrix.bed \
    --methfreq output_gen/sample{1..10}.bed \
    --bedmethyl \
    --ic AIC \
    --outdir bloblo \
    --plot

demethify \
    --ref output_gen/ref_matrix.bed \
    --methfreq output_gen/sample{1..10}.bed \
    --bedmethyl \
    --ic CCC 20 \
    --outdir bloblo \
    --plot

DeMethify automatically chooses the first number of unknown cell types that minimises the chosen criterion and outputs its results, however it is recommended to try different criterions, to look at the plots, and to choose depending on the concrete situation.

Zone of overdetermination for the estimation problem

$n_s$: number of samples
$n_u$: number of unknown cell types to estimate
$n_c$: number of known cell types
$n_{cpg}$: number of CpG sites

In the partial-reference based case without purity, the estimation problem enters the realm of overdetermination (i.e., there are more equations than parameters to estimate) when:

$n_s \geq \frac{n_u n_{cpg}}{n_{cpg} - n_u - n_c + 1}$

When $n_u = 1$, this simplifies to:

$n_s \geq \frac{n_{cpg}}{n_{cpg} - n_c}$

The ratio on the right is in $(1, 2]$ for most real-life situations, which means that in the partial-reference based case without purity, estimating a single unknown cell type requires at least 2 samples to enter the realm of overdetermination.

In the partial-reference based case with purity, the estimation problem enters the realm of overdetermination when:

$n_s \geq \frac{n_u n_{cpg}}{n_{cpg} - n_u - n_c + 2}$

For $n_u = 1$ we have :

$n_s \geq \frac{n_{cpg}}{n_{cpg} - n_c + 1}$

Name		Name	Last commit message	Last commit date
Latest commit History 178 Commits
demethify		demethify
preprocessing		preprocessing
test		test
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Flags and Arguments

Installing DeMethify

Run DeMethify

Input format

Unsupervised case

Reference based case

Partial-reference based case

Partial-reference based case with purity

Confidence intervals

Model selection

Zone of overdetermination for the estimation problem

About

Releases

Packages

Languages

cortes-ciriano-lab/DeMethify

Folders and files

Latest commit

History

Repository files navigation

Flags and Arguments

Installing DeMethify

Run DeMethify

Input format

Unsupervised case

Reference based case

Partial-reference based case

Partial-reference based case with purity

Confidence intervals

Model selection

Zone of overdetermination for the estimation problem

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages