Skip to content
This repository has been archived by the owner on Jan 31, 2020. It is now read-only.

Genotype Microarray

zskidmor edited this page Apr 17, 2014 · 10 revisions

Genotype Microarray [in progress]

  • Overview
    • Inputs
      • Instrument Data
      • dbSNP Build
    • [Data Products](## Data Products)
    • [Build Processes](## Build Processes)
      • Create Original Genotype Files
      • Create Filtered Genotype TSV File
      • Create Copy Number TSV File
      • Create Gold SNP File
      • Create Gold SNP BED File
    • Tutorials
      • Defining a new Genotype Microarray model
      • Determining which Processing Profile to use
    • Known Issues

Overview

The Genotype Microarray pipeline is a set of pre-processing steps performed on genotype files to prepare them for use in later analyses. It uses a dbSNP build to map the entries in the genotype file to their positions in a reference sequence and subsequently produce other derivative files that are used by QC models.

Inputs

instrument data

This is a single Instrument Data record containing the Genotype data.

dbSNP Build

This is an Imported Variation List build containing a VCF; imported from dbSNP.

Data Products

This build produces a single directory of files, for example:

  1. x.copynumber
  2. x.genotype
  3. x.original
  4. x.original.vcf
  5. build.xml
  6. formatted_genotype_file_path.genotype
  7. formatted_genotype_file_path.genotype.gold2geno@
  8. gold_snp.v2.bed
  9. logs/
  10. reports/

The "logs" and "reports" directories as well as the build.xml file are standard products of running a Build in the GMS. The remaining files are described below in the steps in which they are generated.

Build Processes

Create Original Genotype Files

This step takes the genotype Instrument Data and produces two representations of the genotype data: a VCF file and a TSV file. Additional information about the genotypes is taken from dbSNP.

2892852755.original 2892852755.original.vcf

Create Filtered Genotype TSV File

This step uses the TSV file from the previous step and removes any genotypes that have a GC Score present but below 0.7. This file retains only the first three columns.

2892852755.genotype

For backwards compatibility a symlink to this file is created in the build directory with the name formatted_genotype_file_path.genotype.gold2geno.

Create Copy Number TSV File

This step produces an alternate version of the original TSV with only the chromosome, position, and log_r_ratio columns:

2892852755.copynumber

Create Gold SNP File

This step produces another alternate TSV. This one has the chromosome, the position twice, the first allele, the second allele, and four columns with either "ref" or "SNP". The first and third are "ref" if the first allele matches the reference at that position and "SNP" otherwise. The second and fourth are "ref" if the second allele matches the reference at that position and "SNP" otherwise:

formatted_genotype_file_path.genotype

Create Gold SNP BED File

This step produces an alternate version of the previous file that conforms to the BED specification:

gold_snp.v2.bed

The fourth column is reference/genotype where an IUB code is used if the genotype is heterozygous. The fifth and sixth columns are always zero--these are placeholders for quality scores that are not present in the genotype files.

Tutorials

Clone this wiki locally