You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'll preface this post by saying that it's a bit of a "stream-of-consciousness" type post - mainly me thinking my way through the process. But, it'd still be great/helpful to have some comments/input/feedback/insight and have this here for reference in the future.
This will print out the following tab-delimited columns:
%SAMPLE: Sample number:name
%CHROM: Chromosome name
%POS: Position
%REF: Reference allele
%GT: Genotype (0/0, 0/1, 1/1 reprentation)
%TGT: Translated genotype (e.g. C/C, C/T)
%ALT: Alternate allele
NOTE: If anyone's interested in messing around with this, the input VCF is currently on Raven. Also, it likely won't be here forever as I'm just using it for learning/exploration temporarily. So, if you want it, you should probably copy it to your own directory. This is where you can currently/temporarily find it:
unknown NC_035780.1 123 C 0/1 C/T T
2:unknown NC_035780.1 123 C 0/1 C/T T
3:unknown NC_035780.1 123 C 0/1 C/T T
4:unknown NC_035780.1 123 C 0/1 C/T T
5:unknown NC_035780.1 123 C 1/1 T/T T
6:unknown NC_035780.1 123 C 0/1 C/T T
7:unknown NC_035780.1 123 C 1/1 T/T T
8:unknown NC_035780.1 123 C 0/1 C/T T
9:unknown NC_035780.1 123 C 1/1 T/T T
10:unknown NC_035780.1 123 C 1/1 T/T T
11:unknown NC_035780.1 123 C 0/1 C/T T
12:unknown NC_035780.1 123 C 0/1 C/T T
13:unknown NC_035780.1 123 C 1/1 T/T T
14:unknown NC_035780.1 123 C 1/1 T/T T
15:unknown NC_035780.1 123 C 1/1 T/T T
16:unknown NC_035780.1 123 C 0/1 C/T T
17:unknown NC_035780.1 123 C 1/1 T/T T
18:unknown NC_035780.1 123 C 1/1 T/T T
19:unknown NC_035780.1 123 C 1/1 T/T T
20:unknown NC_035780.1 123 C 0/1 C/T T
21:unknown NC_035780.1 123 C 0/1 C/T T
22:unknown NC_035780.1 123 C 0/1 C/T T
23:unknown NC_035780.1 123 C 0/1 C/T T
24:unknown NC_035780.1 123 C 0/1 C/T T
25:unknown NC_035780.1 123 C 1/1 T/T T
26:unknown NC_035780.1 123 C 0/1 C/T T
The first question I have is regarding the reference allele vs. the translated genotype columns. I just want to confirm that I understand it. Looking at one of the samples from above, as an example:
5:unknown NC_035780.1 123 C 1/1 T/T T
The reference allele is listed as C. The translated genotype shows T/T, which is taken from the genotype values 1/1. This means that the two alleles in sample 5, at position 123 are both T, correct?
And, now that I've typed this all out, I'm certain this is the proper interpretation, but I'll leave the question anyway - it's all part of the process to helping me understand! I think part of my confusion stemmed from the fact that there is no alternative allele information for the reference genome, which makes sense.
On to the next part...
Following some guidelines from somewhere (don't ask - I don't know where they came from), we've decided to exclude positions with missing genotypes or homozygous genotypes. Since we're specifically interested in C > T SNPs for bisulfite analysis, I've added an awk command to just view those (i.e. reference allele is C and alternate allele is T):
unknown NC_035780.1 9138525 C 0/1 C/T T
2:unknown NC_035780.1 9138525 C 0/1 C/T T
3:unknown NC_035780.1 9138525 C 0/1 C/T T
4:unknown NC_035780.1 9138525 C 0/1 C/T T
5:unknown NC_035780.1 9138525 C 0/1 C/T T
6:unknown NC_035780.1 9138525 C 0/1 C/T T
7:unknown NC_035780.1 9138525 C 0/1 C/T T
8:unknown NC_035780.1 9138525 C 0/1 C/T T
9:unknown NC_035780.1 9138525 C 0/1 C/T T
10:unknown NC_035780.1 9138525 C 0/1 C/T T
11:unknown NC_035780.1 9138525 C 0/1 C/T T
12:unknown NC_035780.1 9138525 C 0/1 C/T T
13:unknown NC_035780.1 9138525 C 0/1 C/T T
14:unknown NC_035780.1 9138525 C 0/1 C/T T
15:unknown NC_035780.1 9138525 C 0/1 C/T T
16:unknown NC_035780.1 9138525 C 0/1 C/T T
17:unknown NC_035780.1 9138525 C 0/1 C/T T
18:unknown NC_035780.1 9138525 C 0/1 C/T T
19:unknown NC_035780.1 9138525 C 0/1 C/T T
20:unknown NC_035780.1 9138525 C 0/1 C/T T
21:unknown NC_035780.1 9138525 C 0/1 C/T T
22:unknown NC_035780.1 9138525 C 0/1 C/T T
23:unknown NC_035780.1 9138525 C 0/1 C/T T
24:unknown NC_035780.1 9138525 C 0/1 C/T T
25:unknown NC_035780.1 9138525 C 0/1 C/T T
26:unknown NC_035780.1 9138525 C 0/1 C/T T
In this instance the only SNPs being selected are those where all samples at a given position have a genotype of 0/1.
Here are some of the questions:
Does this type of filtering (excluding missing/homozygous genotypes) make sense? I.e. why would we want to only look at heterozygous individuals for SNPs?
Wouldn't something like using --include 'GT="alt"', where we examine all samples at a given position where the genotype is the alternate allele be more useful, since we want to find any position where there's a C > T conversion?
Why doesn't the result of this command show samples like:
3:unknown NC_035780.1 123 C 0/1 C/T T
This entry is not homozygous... So, why is it excluded?
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
I'll preface this post by saying that it's a bit of a "stream-of-consciousness" type post - mainly me thinking my way through the process. But, it'd still be great/helpful to have some comments/input/feedback/insight and have this here for reference in the future.
When querying a VCF file using
bcftools query
to identify/evaluate SNPs, I'd love some help understanding the output. Here's the command I'm running:This will print out the following tab-delimited columns:
%SAMPLE
: Sample number:name%CHROM
: Chromosome name%POS
: Position%REF
: Reference allele%GT
: Genotype (0/0
,0/1
,1/1
reprentation)%TGT
: Translated genotype (e.g.C/C
,C/T
)%ALT
: Alternate alleleNOTE: If anyone's interested in messing around with this, the input VCF is currently on Raven. Also, it likely won't be here forever as I'm just using it for learning/exploration temporarily. So, if you want it, you should probably copy it to your own directory. This is where you can currently/temporarily find it:
/home/shared/8TB_HDD_01/sam/data/C_virginica/genomes/EpiDiv_merged.f.recode.vcf
Here's the resulting output of the command above:
The first question I have is regarding the reference allele vs. the translated genotype columns. I just want to confirm that I understand it. Looking at one of the samples from above, as an example:
5:unknown NC_035780.1 123 C 1/1 T/T T
The reference allele is listed as
C
. The translated genotype showsT/T
, which is taken from the genotype values1/1
. This means that the two alleles in sample 5, at position123
are bothT
, correct?And, now that I've typed this all out, I'm certain this is the proper interpretation, but I'll leave the question anyway - it's all part of the process to helping me understand! I think part of my confusion stemmed from the fact that there is no alternative allele information for the reference genome, which makes sense.
On to the next part...
Following some guidelines from somewhere (don't ask - I don't know where they came from), we've decided to exclude positions with missing genotypes or homozygous genotypes. Since we're specifically interested in
C > T
SNPs for bisulfite analysis, I've added anawk
command to just view those (i.e. reference allele isC
and alternate allele isT
):Done like so:
The output looks like this:
In this instance the only SNPs being selected are those where all samples at a given position have a genotype of
0/1
.Here are some of the questions:
Does this type of filtering (excluding missing/homozygous genotypes) make sense? I.e. why would we want to only look at heterozygous individuals for SNPs?
Wouldn't something like using
--include 'GT="alt"'
, where we examine all samples at a given position where the genotype is the alternate allele be more useful, since we want to find any position where there's aC > T
conversion?Why doesn't the result of this command show samples like:
3:unknown NC_035780.1 123 C 0/1 C/T T
This entry is not homozygous... So, why is it excluded?
Well, thanks for reading!
Beta Was this translation helpful? Give feedback.
All reactions