Skip to content
This repository has been archived by the owner on Jun 21, 2023. It is now read-only.

Template markdown file for tracking data information/descriptions #336

Merged
merged 14 commits into from
Dec 17, 2019
Merged

Conversation

sjspielman
Copy link
Member

Purpose/implementation Section

The purpose of this PR is to initiate a framework for tracking the source, usage, and description of all data associated with this project. The goal is NOT (currently) to track all plots, files, etc. in analyses/ but rather to describe the bulk of data in data/.

What scientific question is your analysis addressing?

The goal is to increase the transparency and reproducibility of this project while lowering the cost-of-entry for new contributors.

What was your approach?

A template markdown file was created for the purposes of tracking and describing data.

What GitHub issue does your pull request address?

Issue #334

Directions for reviewers. Tell potential reviewers what kind of feedback you are soliciting.

Which areas should receive a particularly close look?

Format of the markdown file, name and location of the markdown file, whether the table is sufficient to describe data (ie should there be more/fewer columns).

Is there anything that you want to discuss further?

We should discuss whether the README.md or CONTRIBUTING.md file should be modified to direct contributors that they should keep their data well-documented in this markdown.

Is the analysis in a mature enough form that the resulting figure(s) and/or table(s) are ready for review?

Yes

Results

What types of results are included (e.g., table, figure)?

N/A

What is your summary of the results?

N/A

…k the source and description of data files
@sjspielman sjspielman added documentation Improvements or additions to documentation data discussion labels Dec 13, 2019
Copy link
Collaborator

@jharenza jharenza left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good, made some minor suggestions - see what you think!

DATA-DESCRIPTION.md Outdated Show resolved Hide resolved
DATA-DESCRIPTION.md Outdated Show resolved Hide resolved
DATA-DESCRIPTION.md Outdated Show resolved Hide resolved
Stephanie and others added 3 commits December 13, 2019 14:49
Co-Authored-By: Jo Lynne <jharenza@gmail.com>
Co-Authored-By: Jo Lynne <jharenza@gmail.com>
Co-Authored-By: Jo Lynne <jharenza@gmail.com>
@sjspielman
Copy link
Member Author

Changes seem good to me! I figured the specifics for each data file could be added by those who are most familiar with the files, and this markdown would get the ball rolling in that direction. Thanks everyone for quick feedback!

Copy link
Collaborator

@jharenza jharenza left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will approve with the last change!

DATA-DESCRIPTION.md Outdated Show resolved Hide resolved
@jaclyn-taroni
Copy link
Member

jaclyn-taroni commented Dec 13, 2019

I think this should live in the doc folder and be linked to "high up" in the main README and it should probably get filled out before it gets merged. I am happy to take the first pass at filling in the table.

Thoughts related to but outside of the scope of this pull request: We can also move the data formats section to doc (+ the reorganization mentioned here: #334 (comment)), but I think that's a separate PR. We might consider including both the notion of origin and associated analysis, but perhaps associated analysis in its own markdown document that isn't included in the download.

@jharenza
Copy link
Collaborator

I think this should live in the doc folder and be linked to "high up" in the main README and it should probably get filled out before it gets merged. I am happy to take the first pass at filling in the table.

Thoughts related to but outside of the scope of this pull request: We can also move the data formats section to doc (+ the reorganization mentioned here: #334 (comment)), but I think that's a separate PR. We might consider including both the notion of origin and associated analysis, but perhaps associated analysis in its own markdown document that isn't included in the download.

Agree!

@jaclyn-taroni
Copy link
Member

@sjspielman I made some changes last night — are those consistent with your goals for this document?

@jharenza I filled in everything that I felt comfortable filling in — can someone on your side fill in the rest and can you check what I filled in for accuracy and clarity?

Thank you!

Add missing fields
@cgreene cgreene removed their request for review December 16, 2019 14:46
jaclyn-taroni and others added 4 commits December 16, 2019 10:21
-add workflows
-note: `WGS.hg38.mutect2.unpadded.bed` should be renamed to `WGS.hg38.mutect2.vardict.unpadded.bed` in the next release, but kept as is for now since this description is for v11 files
@jaclyn-taroni
Copy link
Member

I'm going to get this merged because we expect this to get updated as part of the pull request that includes the v12 release.

@jaclyn-taroni jaclyn-taroni merged commit b5f0cfc into AlexsLemonade:master Dec 17, 2019
jharenza pushed a commit that referenced this pull request Dec 17, 2019
### release-v12-20191217
- release date: 2019-12-17
- status: available
- changes:
  - Add `data-file-descriptions.md` with data release to better track file types, origins, and workflows per [#334](#334) and [#336](#336)
  - Add stranded RNA-Seq for 23 PNOC samples and 21 CBTTC samples previously sequenced using a polyA library prep. Files updated:
    - pbta-fusion-arriba.tsv.gz
    - pbta-fusion-starfusion.tsv.gz
    - pbta-gene-expression-rsem-tpm.stranded.rds
    - pbta-gene-expression-rsem-fpkm.stranded.rds
    - pbta-isoform-expression-rsem-tpm.stranded.rds
    - pbta-isoform-counts-rsem-expected_count.stranded.rds
    - pbta-gene-counts-rsem-expected_count.stranded.rds
    - pbta-gene-expression-kallisto.stranded.rds
    - pbta-gene-expression-rsem-fpkm-collapsed.stranded.rds
  - Add recurrently-fused genes by histology and matrix of recurrently-fused genes by biospecimen from [fusion filtering and prioritization analysis](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/fusion_filtering)
  - Update consensus TMB files and MAF [#333]](#333)
  - Add RNA-Seq [collapsed matrices](#287) - wrong files (tables of transcripts removed) were included with [V10](#273)
  - Rename `WGS.hg38.mutect2.unpadded.bed` to `WGS.hg38.mutect2.vardict.unpadded.bed` to better reflect usage
  - Update `pbta-histologies.tsv` to add new RNA-Seq samples listed above, [#222 harmonize disease separators](#222), and reran [medulloblastoma classifier](https://github.com/d3b-center/medullo-classifier-package) using V12 RSEM fpkm collapsed files
    - BS_2Z1MKS84, BS_5VQP0E6K re-classified from Group4 to WNT and BS_3BDAG9YN, BS_8T7DZV2F, and BS_JTMXAMB7 re-classified from Group3 to WNT
  - Add CNVkit GISTIC results focal CN analyses, eg: [#244](#244) and [#8](#8)
@jharenza jharenza mentioned this pull request Dec 17, 2019
jaclyn-taroni pushed a commit that referenced this pull request Dec 19, 2019
* Release V12 data

### release-v12-20191217
- release date: 2019-12-17
- status: available
- changes:
  - Add `data-file-descriptions.md` with data release to better track file types, origins, and workflows per [#334](#334) and [#336](#336)
  - Add stranded RNA-Seq for 23 PNOC samples and 21 CBTTC samples previously sequenced using a polyA library prep. Files updated:
    - pbta-fusion-arriba.tsv.gz
    - pbta-fusion-starfusion.tsv.gz
    - pbta-gene-expression-rsem-tpm.stranded.rds
    - pbta-gene-expression-rsem-fpkm.stranded.rds
    - pbta-isoform-expression-rsem-tpm.stranded.rds
    - pbta-isoform-counts-rsem-expected_count.stranded.rds
    - pbta-gene-counts-rsem-expected_count.stranded.rds
    - pbta-gene-expression-kallisto.stranded.rds
    - pbta-gene-expression-rsem-fpkm-collapsed.stranded.rds
  - Add recurrently-fused genes by histology and matrix of recurrently-fused genes by biospecimen from [fusion filtering and prioritization analysis](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/fusion_filtering)
  - Update consensus TMB files and MAF [#333]](#333)
  - Add RNA-Seq [collapsed matrices](#287) - wrong files (tables of transcripts removed) were included with [V10](#273)
  - Rename `WGS.hg38.mutect2.unpadded.bed` to `WGS.hg38.mutect2.vardict.unpadded.bed` to better reflect usage
  - Update `pbta-histologies.tsv` to add new RNA-Seq samples listed above, [#222 harmonize disease separators](#222), and reran [medulloblastoma classifier](https://github.com/d3b-center/medullo-classifier-package) using V12 RSEM fpkm collapsed files
    - BS_2Z1MKS84, BS_5VQP0E6K re-classified from Group4 to WNT and BS_3BDAG9YN, BS_8T7DZV2F, and BS_JTMXAMB7 re-classified from Group3 to WNT
  - Add CNVkit GISTIC results focal CN analyses, eg: [#244](#244) and [#8](#8)

* Update release-notes.md

fix link

* Update data-files-description.md

fix GISTIC table sectioning

* Update data-files-description.md

fix spacing on data description table

* Update data-files-description.md

fix more spacing in data file description file

* Update download-data.sh

add new release date to download script

* Update the TMB file descriptions

* Update TMB file formats section

* Update fusion section of data formats

Also more specific description of the by sample file

* Add GISTIC file to data-formats

* Update download-data.sh

* Update download-data.sh

* data description md is also included in md5sum

* TMB exon -> coding sequence

* Coding TMB CDS, not exon
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
data discussion documentation Improvements or additions to documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants