Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Chromosome X #10

Open
angelosarmen opened this issue Oct 9, 2023 · 5 comments
Open

Chromosome X #10

angelosarmen opened this issue Oct 9, 2023 · 5 comments

Comments

@angelosarmen
Copy link

Hi,

Can RUTH handle chromosome X? There isn't a mention of it in the manuscript, but the program accepts parameters for sex chromosomes. When I try to load a BCF file containing X-chromosome SNPs, however, I get a segmentation fault. It works fine if I remove those SNPs.

I compiled RUTH on macOS Ventura using htslib 1.18 installed with conda.

@hyunminkang
Copy link
Collaborator

It should work for X-chromosome, but it will assume diploid representation of genotype likelihood, and may not accept haploid genotype information. Also, the sex map should be provided. Please provide specific examples of the failure mode if you would like to report a bug.

@angelosarmen
Copy link
Author

angelosarmen commented Oct 12, 2023

Thank you for the prompt reply.

I confirmed that segmentation fault occurred because in my (PLINK-generated) BCF file, male non-PAR X genotypes were encoded by a single allele (as is the standard: "Haploid calls, e.g. on Y, male non-pseudoautosomal X, or mitochondria, should be indicated by having only one allele value."). After removing the male samples, RUTH ran without problems.

I'm attaching a set of test files that recreate the problem. samples.vcf has 20 samples, the first 10 of which are females and the remaining 10 are males (as specified in sex.tsv), and two X SNPs, one in the PAR and one in the non-PAR. Non-PAR male genotypes are represented by a single allele. The following command results in segmentation fault:

ruth --vcf samples.vcf --evec pcs.tsv --num-pc 2 --field GT --lambda 0 --lrt-em --skip-if --site-only --out ruth.vcf

After replacing the single alleles with two identical alleles and saving the file as samples_fixed.vcf, the following command succeeds:

ruth --vcf samples_fixed.vcf --evec pcs.tsv --num-pc 2 --field GT --lambda 0 --lrt-em --skip-if --site-only --out ruth.vcf

So, for RUTH to work, I should be replacing single alleles with two identical alleles for the male samples?

Finally, how does RUTH handle the PAR and the non-PAR? Does it perform the test using all samples for the PAR and females only for the non-PAR?

@angelosarmen
Copy link
Author

@hyunminkang, could you please provide an update on this.

@hyunminkang
Copy link
Collaborator

hyunminkang commented Oct 25, 2023

Did you provide --sex-map? I think you need to do so. Otherwise, it will consider all samples as females.

Also, you need to provide genotype likelihood (PL) as diploids. I think it may work well with --field GT as it is not very well for chrX. If you insist using GT, you can use it at your own risk, but it would require diploid representation, making each genotypes as homozygotes.

Yes, It will use ploidy information for non-PAR only.

@angelosarmen
Copy link
Author

I provided --sex-map in my analysis, but forgot to provide it in the tests. This still results in segmentation fault:

ruth --vcf samples.vcf --evec pcs.tsv --sex-map sex.tsv --num-pc 2 --field GT --lambda 0 --lrt-em --skip-if --site-only --out ruth.vcf

while this doesn't:

ruth --vcf samples_fixed.vcf --evec pcs.tsv --sex-map sex.tsv --num-pc 2 --field GT --lambda 0 --lrt-em --skip-if --site-only --out ruth.vcf

The error doesn't matter anymore, as I'm supposed to provide male genotypes as diploids anyway (I don't have the likelihoods, so I have to use the genotypes).

Using only females for the non-PAR is not the same as using ploidy information, though. To find out what you mean by "using ploidy information", I looked at the code of frequency_estimator::estimate_isaf_em_hwd() and understand that males are used to initialize individual-specific allele frequencies with estimate_pooled_af_em() and estimate the betas, but not theta (only females are used for that). What is being tested for the non-PAR, however, when all samples are male (i.e., all samples are haploid)? Same for chromosome Y and the mitochondria. I don't understand what testing for HWE means in those cases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants