Skip to content

Latest commit

 

History

History
93 lines (89 loc) · 27.8 KB

data-files-description.md

File metadata and controls

93 lines (89 loc) · 27.8 KB

Data file descriptions

This document contains information about all data files associated with this project. Each file will have the following association information:

  • File type will be one of:
    • Reference file: Obtained from an external source/database. When known, the obtained data and a link to the external source is included.
    • Modified reference file: Obtained from an external source/database but modified for OpenPBTA use.
    • PBTA data file: Pediatric Brain Tumor Atlas data that are processed upstream of the OpenPBTA project, e.g., the output of a somatic single nucleotide variant method. Links to the relevant D3B Center or Kids First workflow (and version where applicable) are included in Origin.
    • Analysis file: Any file created by a script in analyses/*.
  • Origin
    • For PBTA data files, a link the relevant D3B Center or Kids First workflow (and version where applicable).
    • When applicable, a link to the specific script that produced (or modified, for Modified reference file types) the data.
  • File description
    • A brief one sentence description of what the file contains (e.g., bed files contain coordinates for features XYZ).

current release (release-v16-20200320)

File name File Type Origin File Description
fusion_summary_embryonal_foi.tsv Analysis file analysis/fusion-summary Summary file for presence of embryonal tumor fusions of interest
fusion_summary_ependymoma_foi.tsv Analysis file analysis/fusion-summary Summary file for presence of ependymal tumor fusions of interest
fusion_summary_ewings_foi.tsv Analysis file analysis/fusion-summary Summary file for presence of Ewing's sarcoma fusions of interest
gencode.v27.primary_assembly.annotation.gtf.gz Reference file GENCODE v27 hg38 gene annotation on primary assembly (reference chromosomes and scaffolds)
GRCh38.primary_assembly.genome.fa.gz Reference Genome file GENCODE v27 hg38 primary assembly genome sequence FASTA file
independent-specimens.wgs.primary-plus.tsv Analysis file analyses/independent-samples Independent specimens list for WGS sample, primary + non-primary when no primary sample is available
independent-specimens.wgs.primary.tsv Analysis file analyses/independent-samples Independent specimens list for WGS samples, primary only
independent-specimens.wgswxs.primary-plus.tsv Analysis file analyses/independent-samples Independent specimens list for WGS and WXS samples, primary + non-primary when no primary sample is available
independent-specimens.wgswxs.primary.tsv Analysis file analyses/independent-samples Independent specimens list for WGS and WXS samples, primary only
intersect_cds_lancet.bed Analysis file analyses/snv-callers Intersection of gencode.v27.primary_assembly.annotation.gtf.gz CDS with WXS 100bp padded BED regions and Lancet's WXS regions
intersect_cds_lancet_strelka_mutect_WGS.bed Analysis file analyses/snv-callers Intersection of gencode.v27.primary_assembly.annotation.gtf.gz CDS with Lancet, Strelka2, Mutect2 regions
intersect_strelka_mutect_WGS.bed Analysis file analyses/snv-callers Intersection of gencode.v27.primary_assembly.annotation.gtf.gz CDS with Strelka2 and Mutect2 regions called
pbta-cnv-cnvkit-gistic.zip PBTA data file Workflow Somatic CNV - GISTIC 2.0 output using pbta-cnv-cnvkit.seg file input (WGS samples only)
pbta-cnv-consensus-gistic.zip Analysis file Workflow Somatic CNV - GISTIC 2.0 output using pbta-cnv-consensus.seg file input (WGS samples only)
pbta-cnv-cnvkit.seg.gz PBTA data file Copy number variant calling; Workflow Somatic Copy Number Variant - CNVkit SEG file (WGS samples only)
pbta-cnv-consensus.seg.gz Analysis file CNV consensus calls Somatic Copy Number Variant - CNVkit SEG file (WGS samples only)
pbta-cnv-controlfreec.tsv.gz PBTA data file Copy number variant calling; Workflow Somatic Copy Number Variant - TSV file that is a merge of ControlFreeC *_CNVs files (WGS samples only)
consensus_seg_annotated_cn_autosomes.tsv.gz Analysis file Focal CNV consensus calls TSV file containing genes with copy number changes per biospecimen; autosomes only
consensus_seg_annotated_cn_x_and_y.tsv.gz Analysis file Focal CNV consensus calls TSV file containing genes with copy number changes per biospecimen; sex chromosomes only
pbta-fusion-arriba.tsv.gz PBTA data file Gene fusion detection; Workflow Fusion - Arriba TSV, annotated with FusionAnnotator
pbta-fusion-putative-oncogenic.tsv Analysis file analyses/fusion_filtering Filtered and prioritized fusions
pbta-fusion-recurrently-fused-genes-byhistology.tsv Analysis file analysis/fusion-filtering Recurrently-fused genes tabulated by broad histology
pbta-fusion-recurrently-fused-genes-bysample.tsv Analysis file analysis/fusion-filtering Binary matrix that denotes the presence or absence of a recurrently fused gene in an individual RNA-seq specimen
pbta-fusion-starfusion.tsv.gz PBTA data file Gene fusion detection; Workflow Fusion - STARFusion TSV
pbta-gene-counts-rsem-expected_count.polya.rds PBTA data file Gene expression abundance estimation; Workflow Gene expression - RSEM expected counts for poly-A samples (gene-level)
pbta-gene-counts-rsem-expected_count.stranded.rds PBTA data file Gene expression abundance estimation; Workflow Gene expression - RSEM expected counts for stranded samples (gene-level)
pbta-gene-expression-kallisto.polya.rds PBTA data file Gene expression abundance estimation; Workflow Gene expression - kallisto TPM for poly-A samples (transcript-level)
pbta-gene-expression-kallisto.stranded.rds PBTA data file Gene expression abundance estimation; Workflow Gene expression - kallisto TPM for stranded samples (transcript-level)
pbta-gene-expression-rsem-fpkm-collapsed.polya.rds Analysis file analyses/collapse-rnaseq Gene expression - RSEM FPKM for poly-A samples collapsed to gene symbol (gene-level)
pbta-gene-expression-rsem-fpkm-collapsed.stranded.rds Analysis file analyses/collapse-rnaseq Gene expression - RSEM FPKM for stranded samples collapsed to gene symbol (gene-level)
pbta-gene-expression-rsem-fpkm.polya.rds PBTA data file Gene expression abundance estimation; Workflow Gene expression - RSEM FPKM for poly-A samples (gene-level)
pbta-gene-expression-rsem-fpkm.stranded.rds PBTA data file Gene expression abundance estimation; Workflow Gene expression - RSEM FPKM for stranded samples (gene-level)
pbta-gene-expression-rsem-tpm.polya.rds PBTA data file Gene expression abundance estimation; Workflow Gene expression - RSEM TPM for poly-A samples (gene-level)
pbta-gene-expression-rsem-tpm.stranded.rds PBTA data file Gene expression abundance estimation; Workflow Gene expression - RSEM TPM for stranded samples (gene-level)
pbta-histologies.tsv PBTA data file Clinical data harmonization Harmonized clinical metadata file (see data dictionary here)
pbta-isoform-counts-rsem-expected_count.polya.rds PBTA data file Gene expression abundance estimation; Workflow Gene expression - RSEM expected counts for poly-A samples (transcript-level)
pbta-isoform-counts-rsem-expected_count.stranded.rds PBTA data file Gene expression abundance estimation; Workflow Gene expression - RSEM expected counts for stranded samples (transcript-level)
pbta-isoform-expression-rsem-tpm.polya.rds PBTA data file Gene expression abundance estimation; Workflow Gene expression - RSEM TPM for poly-A samples (transcript-level)
pbta-isoform-expression-rsem-tpm.stranded.rds PBTA data file Gene expression abundance estimation; Workflow Gene expression - RSEM TPM for stranded samples (transcript-level)
pbta-mend-qc-manifest.tsv PBTA data file MendQC analysis placeholder; Workflow File to map MendQC output to biospecimen IDs
pbta-mend-qc-results.tar.gz PBTA data file MendQC analysis placeholder; Workflow MendQC output files
pbta-snv-consensus-mutation.maf.tsv.gz Analysis file analyses/snv-callers Consensus calls for SNVs and small indels; columns in the included file are derived from the Strelka2.
pbta-snv-consensus-mutation-tmb-all.tsv Analysis file analyses/snv-callers Tumor mutation burden statistics calculated from Strelka2 and Mutect2 SNV consensus, and the intersection of Strelka2 and Mutect2 BED windows sizes.
pbta-snv-consensus-mutation-tmb-coding.tsv Analysis file analyses/snv-callers Coding only tumor mutation burden statistics calculated from the number of coding sequence Strelka2, Mutect2, and Lancet consensus SNVs and size of the intersection of all three callers' BED windows and the Gencode v27 coding sequences.
pbta-snv-lancet.vep.maf.gz PBTA data file Somatic mutation calling; Workflow Somatic SNV - Lancet annotated MAF file
pbta-snv-mutect2.vep.maf.gz PBTA data file Somatic mutation calling; Workflow Somatic SNV - Mutect2 annotated MAF file
pbta-snv-strelka2.vep.maf.gz PBTA data file Somatic mutation calling; Workflow Somatic SNV - Strelka2 annotated MAF file
pbta-snv-vardict.vep.maf.gz PBTA data file Somatic mutation calling; Workflow Somatic SNV - VarDict annotated MAF file
pbta-star-log-final.tar.gz PBTA data file Gene expression abundance estimation; Workflow STAR log final output files
pbta-star-log-manifest.tsv PBTA data file Gene expression abundance estimation; Workflow File to map STAR output to biospecimen IDs
pbta-sv-manta.tsv.gz PBTA data file Structural variant calling; Workflow Somatic Structural Variant - Manta output, annotated with AnnotSV (WGS samples only)
pbta-tcga-manifest.tsv PBTA data file Somatic mutation calling Manifest of tumor/normal BAMs used for SNV calling, Tumor_Sample_Barcodes, and histologies
pbta-tcga-snv-lancet.vep.maf.gz PBTA/TCGA data file Somatic mutation calling; Workflow Somatic SNV - Lancet annotated MAF file
pbta-tcga-snv-mutect2.vep.maf.gz PBTA data file Somatic mutation calling; Workflow Somatic SNV - Mutect2 annotated MAF file
pbta-tcga-snv-strelka2.vep.maf.gz PBTA data file Somatic mutation calling; Workflow Somatic SNV - Strelka2 annotated MAF file
StrexomeLite_hg38_liftover_100bp_padded.bed Reference Target/Baits File SNV and INDEL calling hg38 targeted panel regions used for all variant callers, each region padded by 100 bp
StrexomeLite_Targets_CrossMap_hg38_filtered_chr_prefixed.bed Target/Baits File SNV and INDEL calling hg38 targeted DNA panel bait capture regions provided by the kit manufacturer
WGS.hg38.lancet.300bp_padded.bed Reference Target/Baits File SNV and INDEL calling WGS.hg38.lancet.unpadded.bed file with each region padded by 300 bp
WGS.hg38.lancet.unpadded.bed Reference Regions File SNV and INDEL calling hg38 WGS regions created using UTR, exome, and start/stop codon features of the GENCODE 31 reference, augmented with PASS variant calls from Strelka2 and Mutect2
WGS.hg38.mutect2.vardict.unpadded.bed Reference Regions File SNV and INDEL calling hg38 BROAD Institute interval calling list (restricted to Chr1-22,X,Y,M and non-N regions) used for Mutect2 and VarDict variant callers
WGS.hg38.strelka2.unpadded.bed Reference Regions File SNV and INDEL calling hg38 BROAD Institute interval calling list (restricted to Chr1-22,X,Y,M) used for Strelka2 variant caller
WGS.hg38.vardict.100bp_padded.bed Reference Regions File SNV and INDEL calling WGS.hg38.mutect2.vardict.unpadded.bed with each region padded by 100 bp used for VarDict variant caller
WXS.hg38.100bp_padded.bed Reference Target/Baits File SNV and INDEL calling hg38 WXS regions provided by the kit manufacturer used for Strelka2, Mutect2, and VarDict variant callers with each region padded by 100 bp
WXS.hg38.lancet.400bp_padded.bed Reference Target/Baits File SNV and INDEL calling hg38 WXS regions provided by the kit manufacturer used for Lancet variant callers with each region padded by 400 bp
intersected_whole_exome_agilent_designed_120_AND_tcga_6k_genes.Gh38.bed Reference Target/Baits File SNV and INDEL calling Generated using bedtools intersect from tcga_6k_genes.targetIntervals.Gh38.bed and whole_exome_agilent_designed_120.targetIntervals.Gh38.bed
intersected_whole_exome_agilent_plus_tcga_6k_AND_tcga_6k_genes.Gh38.bed Reference Target/Baits File SNV and INDEL calling Generated using bedtools intersect from tcga_6k_genes.targetIntervals.Gh38.bed and whole_exome_agilent_plus_tcga_6k.targetIntervals.Gh38.bed
tcga_6k_genes.targetIntervals.Gh38.bed Reference Target/Baits File SNV and INDEL calling hg38 version of tcga_6k_genes.targetIntervals.bed. Generated using CrossMap and bedtools sort and merge. Script in TCGA-capture-kit-analyses
tcga_6k_genes.targetIntervals.bed Reference Target/Baits File SNV and INDEL calling hg19 WXS target capture regions downloaded from GDC website API endpoint. Script to retreive file described in TCGA-capture-kit-analyses
whole_exome_agilent_1.1_refseq_plus_3_boosters.targetIntervals.Gh38.bed Reference Target/Baits File SNV and INDEL calling hg38 version of whole_exome_agilent_1.1_refseq_plus_3_boosters.targetIntervals.bed. Generated using CrossMap and bedtools sort and merge. Script in TCGA-capture-kit-analyses
whole_exome_agilent_1.1_refseq_plus_3_boosters.targetIntervals.bed Reference Target/Baits File SNV and INDEL calling hg19 WXS target capture regions downloaded from GDC website API endpoint. Script to retreive file described in TCGA-capture-kit-analyses
whole_exome_agilent_designed_120.targetIntervals.Gh38.bed Reference Target/Baits File SNV and INDEL calling hg38 version of whole_exome_agilent_designed_120.targetIntervals.bed. Generated using CrossMap and bedtools sort and merge. Script in TCGA-capture-kit-analyses
whole_exome_agilent_designed_120.targetIntervals.bed Reference Target/Baits File SNV and INDEL calling hg19 WXS target capture regions downloaded from GDC website API endpoint. Script to retreive file described in TCGA-capture-kit-analyses
whole_exome_agilent_plus_tcga_6k.targetIntervals.Gh38.bed Reference Target/Baits File SNV and INDEL calling hg38 version of whole_exome_agilent_plus_tcga_6k.targetIntervals.bed. Generated using CrossMap and bedtools sort and merge. Script in TCGA-capture-kit-analyses
whole_exome_agilent_plus_tcga_6k.targetIntervals.bed Reference Target/Baits File SNV and INDEL calling hg19 WXS target capture regions downloaded from GDC website API endpoint. Script to retreive file described in TCGA-capture-kit-analyses