-
Notifications
You must be signed in to change notification settings - Fork 22
Genotype Microarray
-
Overview
-
Inputs
- Instrument Data
- dbSNP Build
- [Data Products](## Data Products)
- [Build Processes](## Build Processes)
- Create Original Genotype Files
- Create Filtered Genotype TSV File
- Create Copy Number TSV File
- Create Gold SNP File
- Create Gold SNP BED File
-
Tutorials
- Defining a new Genotype Microarray model
- Determining which Processing Profile to use
- Known Issues
-
Inputs
The Genotype Microarray pipeline is a set of pre-processing steps performed on genotype files to prepare them for use in later analyses. It uses a dbSNP build to map the entries in the genotype file to their positions in a reference sequence and subsequently produce other derivative files that are used by QC models.
This is a single Instrument Data record containing the Genotype data.
This is an Imported Variation List build containing a VCF; imported from dbSNP.
This build produces a single directory of files, for example:
- x.copynumber
- x.genotype
- x.original
- x.original.vcf
- build.xml
- formatted_genotype_file_path.genotype
- formatted_genotype_file_path.genotype.gold2geno@
- gold_snp.v2.bed
- logs/
- reports/
The "logs" and "reports" directories as well as the build.xml file are standard products of running a Build in the GMS. The remaining files are described below in the steps in which they are generated.
This step takes the genotype Instrument Data and produces two representations of the genotype data: a VCF file and a TSV file. Additional information about the genotypes is taken from dbSNP.
2892852755.original 2892852755.original.vcf
This step uses the TSV file from the previous step and removes any genotypes that have a GC Score present but below 0.7. This file retains only the first three columns.
2892852755.genotype
For backwards compatibility a symlink to this file is created in the build directory with the name formatted_genotype_file_path.genotype.gold2geno.
This step produces an alternate version of the original TSV with only the chromosome, position, and log_r_ratio columns:
2892852755.copynumber
This step produces another alternate TSV. This one has the chromosome, the position twice, the first allele, the second allele, and four columns with either "ref" or "SNP". The first and third are "ref" if the first allele matches the reference at that position and "SNP" otherwise. The second and fourth are "ref" if the second allele matches the reference at that position and "SNP" otherwise:
formatted_genotype_file_path.genotype
This step produces an alternate version of the previous file that conforms to the BED specification:
gold_snp.v2.bed
The fourth column is reference/genotype where an IUB code is used if the genotype is heterozygous. The fifth and sixth columns are always zero--these are placeholders for quality scores that are not present in the genotype files.