-
Notifications
You must be signed in to change notification settings - Fork 22
Genotype Microarray
The Genotype Microarray pipeline is a set of pre-processing steps performed on genotype files to prepare them for use in later analyses. It uses a dbSNP build to map the entries in the genotype file to their positions in a reference sequence and subsequently produce other derivative files that are used by QC models.
This is a single Instrument Data record containing the Genotype data.
This is an Imported Variation List build containing a VCF; imported from dbSNP.
This build produces a single directory of files, for example:
- x.copynumber
- x.genotype
- x.original
- x.original.vcf
- build.xml
- formatted_genotype_file_path.genotype
- formatted_genotype_file_path.genotype.gold2geno@
- gold_snp.v2.bed
- logs/
- reports/
The "logs" and "reports" directories as well as the build.xml file are standard products of running a Build in the GMS. The remaining files are described below in the steps in which they are generated.
This step takes the genotype Instrument Data and produces two representations of the genotype data: a VCF file and a TSV file. Additional information about the genotypes is taken from dbSNP.
The TSV file will be named for the Sample ID and has the following format:
2892852755.original
chromosome position alleles id sample_name log_r_ratio gc_score cnv_value cnv_confidence allele1 allele2
1 734462 TT rs12564807 example_sample1 NA -1 NA NA T T
The VCF file contains the same information in the standard VCF format:
2892852755.original.vcf
This step uses the TSV file from the previous step and removes any genotypes that have a GC Score present but below 0.7. This file retains only the first three columns:
2892852755.genotype
1 734462 TT
For backwards compatibility a symlink to this file is created in the build directory with the name formatted_genotype_file_path.genotype.gold2geno.
This step produces an alternate version of the original TSV with only the chromosome, position, and log_r_ratio columns:
2892852755.copynumber
1 734462 NA
This step produces another alternate TSV. This one has the chromosome, the position twice, the first allele, the second allele, and four columns with either "ref" or "SNP". The first and third are "ref" if the first allele matches the reference at that position and "SNP" otherwise. The second and fourth are "ref" if the second allele matches the reference at that position and "SNP" otherwise:
formatted_genotype_file_path.genotype
1 734462 734462 T T SNP SNP SNP SNP
This step produces an alternate version of the previous file that conforms to the BED specification:
gold_snp.v2.bed
1 734461 734462 G/T 0 0 SNP SNP SNP SNP
The fourth column is reference/genotype where an IUB code is used if the genotype is heterozygous. The fifth and sixth columns are always zero--these are placeholders for quality scores that are not present in the genotype files.
###Defining a new Genotype Microarray model
Three things are needed to define a new Genotype Microarray model:
- the identifier of the instrument data record for the genotype data to process
- the identifier of the dbSNP build to be used to process the genotype data
- the identifier of the processing-profile corresponding to the instrument that produced the genotype data.
Given these, the "genome model define genotype-microarray" command is used, e.g.:
genome model define genotype-microarray --instrument-data 2893884260 --variation-list-build 127786607 --processing-profile 2591110
The command produces output like the following:
'variation_list_build', 'instrument_data', and 'processing_profile' may require verification...
Resolving parameter 'variation_list_build' from command argument '127786607'... found 1
Resolving parameter 'instrument_data' from command argument '2893884260'... found 1
Resolving parameter 'processing_profile' from command argument '2591110'... found 1
Created model:
id: 82a30a15a299460fb3d39b83dc06e545
name: example_subject1.microarray.external.plink.GRCh37-lite-build37
subject: example_subject1 (2892852755)
processing_profile: plink wugc (2591110)
The identifier "82a30a15a299460fb3d39b83dc06e545" is then used to track the model for starting builds, querying status, or assigning to downstream models.
###Determining which Processing Profile to use
Given the identifier for the instrument data, it can be queried for its instrument type:
genome instrument-data list imported id=2893884260 --show sequencing_platform
This will produce output like:
SEQUENCING_PLATFORM
-------------------
plink
This value is then used to query the available Processing Profiles:
genome processing-profile list genotype-microarray instrument_type=plink
This will produce output like:
ID NAME INPUT_FORMAT INSTRUMENT_TYPE
-- ---- ------------ ---------------
2591110 plink wugc wugc plink
The identifier "2591110" is then the proper processing-profile to select.
Under development