Skip to content
This repository has been archived by the owner on Jan 31, 2020. It is now read-only.

Genotype Microarray

Obi Griffith edited this page Apr 21, 2014 · 10 revisions

[In Progress]

Overview

The Genotype Microarray pipeline is a set of pre-processing steps performed on genotype files to prepare them for use in later analyses. It uses a dbSNP build to map the entries in the genotype file to their positions in a reference sequence and subsequently produce other derivative files that are used by QC models.

Inputs

instrument data

This is a single Instrument Data record containing the Genotype data.

dbSNP Build

This is an Imported Variation List build containing a VCF; imported from dbSNP.

Data Products

This build produces a single directory of files, for example:

  1. x.copynumber
  2. x.genotype
  3. x.original
  4. x.original.vcf
  5. build.xml
  6. formatted_genotype_file_path.genotype
  7. formatted_genotype_file_path.genotype.gold2geno@
  8. gold_snp.v2.bed
  9. logs/
  10. reports/

The "logs" and "reports" directories as well as the build.xml file are standard products of running a Build in the GMS. The remaining files are described below in the steps in which they are generated.

Build Processes

Create Original Genotype Files

This step takes the genotype Instrument Data and produces two representations of the genotype data: a VCF file and a TSV file. Additional information about the genotypes is taken from dbSNP.

The TSV file will be named for the Sample ID and has the following format:

2892852755.original

chromosome      position        alleles id      sample_name     log_r_ratio     gc_score        cnv_value       cnv_confidence  allele1 allele2
1       734462  TT      rs12564807      example_sample1 NA      -1      NA      NA      T       T

The VCF file contains the same information in the standard VCF format:

2892852755.original.vcf

Create Filtered Genotype TSV File

This step uses the TSV file from the previous step and removes any genotypes that have a GC Score present but below 0.7. This file retains only the first three columns:

2892852755.genotype

1       734462  TT

For backwards compatibility a symlink to this file is created in the build directory with the name formatted_genotype_file_path.genotype.gold2geno.

Create Copy Number TSV File

This step produces an alternate version of the original TSV with only the chromosome, position, and log_r_ratio columns:

2892852755.copynumber

1       734462  NA

Create Gold SNP File

This step produces another alternate TSV. This one has the chromosome, the position twice, the first allele, the second allele, and four columns with either "ref" or "SNP". The first and third are "ref" if the first allele matches the reference at that position and "SNP" otherwise. The second and fourth are "ref" if the second allele matches the reference at that position and "SNP" otherwise:

formatted_genotype_file_path.genotype

1       734462  734462  T       T       SNP     SNP     SNP     SNP

Create Gold SNP BED File

This step produces an alternate version of the previous file that conforms to the BED specification:

gold_snp.v2.bed

1       734461  734462  G/T     0       0       SNP     SNP     SNP     SNP

The fourth column is reference/genotype where an IUB code is used if the genotype is heterozygous. The fifth and sixth columns are always zero--these are placeholders for quality scores that are not present in the genotype files.

Tutorials

###Defining a new Genotype Microarray model

Three things are needed to define a new Genotype Microarray model:

  1. the identifier of the instrument data record for the genotype data to process
  2. the identifier of the dbSNP build to be used to process the genotype data
  3. the identifier of the processing-profile corresponding to the instrument that produced the genotype data.

Given these, the "genome model define genotype-microarray" command is used, e.g.:

genome model define genotype-microarray --instrument-data 2893884260 --variation-list-build 127786607 --processing-profile 2591110

The command produces output like the following:

'variation_list_build', 'instrument_data', and 'processing_profile' may require verification...
Resolving parameter 'variation_list_build' from command argument '127786607'... found 1
Resolving parameter 'instrument_data' from command argument '2893884260'... found 1
Resolving parameter 'processing_profile' from command argument '2591110'... found 1
Created model:
id: 82a30a15a299460fb3d39b83dc06e545
name: example_subject1.microarray.external.plink.GRCh37-lite-build37
subject: example_subject1 (2892852755)
processing_profile: plink wugc (2591110)

The identifier "82a30a15a299460fb3d39b83dc06e545" is then used to track the model for starting builds, querying status, or assigning to downstream models.

###Determining which Processing Profile to use

Given the identifier for the instrument data, it can be queried for its instrument type:

genome instrument-data list imported id=2893884260 --show sequencing_platform

This will produce output like:

SEQUENCING_PLATFORM
-------------------
plink

This value is then used to query the available Processing Profiles:

genome processing-profile list genotype-microarray instrument_type=plink

This will produce output like:

ID        NAME         INPUT_FORMAT   INSTRUMENT_TYPE
--        ----         ------------   ---------------
2591110   plink wugc   wugc           plink

The identifier "2591110" is then the proper processing-profile to select.

Known Issues

Clone this wiki locally