Template markdown file for tracking data information/descriptions #336

sjspielman · 2019-12-13T16:59:33Z

Purpose/implementation Section

The purpose of this PR is to initiate a framework for tracking the source, usage, and description of all data associated with this project. The goal is NOT (currently) to track all plots, files, etc. in analyses/ but rather to describe the bulk of data in data/.

What scientific question is your analysis addressing?

The goal is to increase the transparency and reproducibility of this project while lowering the cost-of-entry for new contributors.

What was your approach?

A template markdown file was created for the purposes of tracking and describing data.

What GitHub issue does your pull request address?

Issue #334

Directions for reviewers. Tell potential reviewers what kind of feedback you are soliciting.

Which areas should receive a particularly close look?

Format of the markdown file, name and location of the markdown file, whether the table is sufficient to describe data (ie should there be more/fewer columns).

Is there anything that you want to discuss further?

We should discuss whether the README.md or CONTRIBUTING.md file should be modified to direct contributors that they should keep their data well-documented in this markdown.

Is the analysis in a mature enough form that the resulting figure(s) and/or table(s) are ready for review?

Yes

Results

What types of results are included (e.g., table, figure)?

N/A

What is your summary of the results?

N/A

…k the source and description of data files

jharenza

looks good, made some minor suggestions - see what you think!

DATA-DESCRIPTION.md

Co-Authored-By: Jo Lynne <jharenza@gmail.com>

sjspielman · 2019-12-13T19:50:43Z

Changes seem good to me! I figured the specifics for each data file could be added by those who are most familiar with the files, and this markdown would get the ball rolling in that direction. Thanks everyone for quick feedback!

jharenza

will approve with the last change!

DATA-DESCRIPTION.md

jaclyn-taroni · 2019-12-13T22:01:15Z

I think this should live in the doc folder and be linked to "high up" in the main README and it should probably get filled out before it gets merged. I am happy to take the first pass at filling in the table.

Thoughts related to but outside of the scope of this pull request: We can also move the data formats section to doc (+ the reorganization mentioned here: #334 (comment)), but I think that's a separate PR. We might consider including both the notion of origin and associated analysis, but perhaps associated analysis in its own markdown document that isn't included in the download.

jharenza · 2019-12-15T23:55:02Z

I think this should live in the doc folder and be linked to "high up" in the main README and it should probably get filled out before it gets merged. I am happy to take the first pass at filling in the table.

Thoughts related to but outside of the scope of this pull request: We can also move the data formats section to doc (+ the reorganization mentioned here: #334 (comment)), but I think that's a separate PR. We might consider including both the notion of origin and associated analysis, but perhaps associated analysis in its own markdown document that isn't included in the download.

Agree!

Change tense

jaclyn-taroni · 2019-12-16T11:03:57Z

@sjspielman I made some changes last night — are those consistent with your goals for this document?

@jharenza I filled in everything that I felt comfortable filling in — can someone on your side fill in the rest and can you check what I filled in for accuracy and clarity?

Thank you!

Add missing fields

-add workflows -note: `WGS.hg38.mutect2.unpadded.bed` should be renamed to `WGS.hg38.mutect2.vardict.unpadded.bed` in the next release, but kept as is for now since this description is for v11 files

jaclyn-taroni · 2019-12-17T16:01:54Z

I'm going to get this merged because we expect this to get updated as part of the pull request that includes the v12 release.

### release-v12-20191217 - release date: 2019-12-17 - status: available - changes: - Add `data-file-descriptions.md` with data release to better track file types, origins, and workflows per [#334](#334) and [#336](#336) - Add stranded RNA-Seq for 23 PNOC samples and 21 CBTTC samples previously sequenced using a polyA library prep. Files updated: - pbta-fusion-arriba.tsv.gz - pbta-fusion-starfusion.tsv.gz - pbta-gene-expression-rsem-tpm.stranded.rds - pbta-gene-expression-rsem-fpkm.stranded.rds - pbta-isoform-expression-rsem-tpm.stranded.rds - pbta-isoform-counts-rsem-expected_count.stranded.rds - pbta-gene-counts-rsem-expected_count.stranded.rds - pbta-gene-expression-kallisto.stranded.rds - pbta-gene-expression-rsem-fpkm-collapsed.stranded.rds - Add recurrently-fused genes by histology and matrix of recurrently-fused genes by biospecimen from [fusion filtering and prioritization analysis](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/fusion_filtering) - Update consensus TMB files and MAF [#333]](#333) - Add RNA-Seq [collapsed matrices](#287) - wrong files (tables of transcripts removed) were included with [V10](#273) - Rename `WGS.hg38.mutect2.unpadded.bed` to `WGS.hg38.mutect2.vardict.unpadded.bed` to better reflect usage - Update `pbta-histologies.tsv` to add new RNA-Seq samples listed above, [#222 harmonize disease separators](#222), and reran [medulloblastoma classifier](https://github.com/d3b-center/medullo-classifier-package) using V12 RSEM fpkm collapsed files - BS_2Z1MKS84, BS_5VQP0E6K re-classified from Group4 to WNT and BS_3BDAG9YN, BS_8T7DZV2F, and BS_JTMXAMB7 re-classified from Group3 to WNT - Add CNVkit GISTIC results focal CN analyses, eg: [#244](#244) and [#8](#8)

* Release V12 data ### release-v12-20191217 - release date: 2019-12-17 - status: available - changes: - Add `data-file-descriptions.md` with data release to better track file types, origins, and workflows per [#334](#334) and [#336](#336) - Add stranded RNA-Seq for 23 PNOC samples and 21 CBTTC samples previously sequenced using a polyA library prep. Files updated: - pbta-fusion-arriba.tsv.gz - pbta-fusion-starfusion.tsv.gz - pbta-gene-expression-rsem-tpm.stranded.rds - pbta-gene-expression-rsem-fpkm.stranded.rds - pbta-isoform-expression-rsem-tpm.stranded.rds - pbta-isoform-counts-rsem-expected_count.stranded.rds - pbta-gene-counts-rsem-expected_count.stranded.rds - pbta-gene-expression-kallisto.stranded.rds - pbta-gene-expression-rsem-fpkm-collapsed.stranded.rds - Add recurrently-fused genes by histology and matrix of recurrently-fused genes by biospecimen from [fusion filtering and prioritization analysis](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/fusion_filtering) - Update consensus TMB files and MAF [#333]](#333) - Add RNA-Seq [collapsed matrices](#287) - wrong files (tables of transcripts removed) were included with [V10](#273) - Rename `WGS.hg38.mutect2.unpadded.bed` to `WGS.hg38.mutect2.vardict.unpadded.bed` to better reflect usage - Update `pbta-histologies.tsv` to add new RNA-Seq samples listed above, [#222 harmonize disease separators](#222), and reran [medulloblastoma classifier](https://github.com/d3b-center/medullo-classifier-package) using V12 RSEM fpkm collapsed files - BS_2Z1MKS84, BS_5VQP0E6K re-classified from Group4 to WNT and BS_3BDAG9YN, BS_8T7DZV2F, and BS_JTMXAMB7 re-classified from Group3 to WNT - Add CNVkit GISTIC results focal CN analyses, eg: [#244](#244) and [#8](#8) * Update release-notes.md fix link * Update data-files-description.md fix GISTIC table sectioning * Update data-files-description.md fix spacing on data description table * Update data-files-description.md fix more spacing in data file description file * Update download-data.sh add new release date to download script * Update the TMB file descriptions * Update TMB file formats section * Update fusion section of data formats Also more specific description of the by sample file * Add GISTIC file to data-formats * Update download-data.sh * Update download-data.sh * data description md is also included in md5sum * TMB exon -> coding sequence * Coding TMB CDS, not exon

Started a template markdown file for contributors to provide and trac…

1228403

…k the source and description of data files

sjspielman added documentation Improvements or additions to documentation data discussion labels Dec 13, 2019

sjspielman assigned jharenza and jaclyn-taroni Dec 13, 2019

jharenza suggested changes Dec 13, 2019

View reviewed changes

DATA-DESCRIPTION.md Outdated Show resolved Hide resolved

DATA-DESCRIPTION.md Outdated Show resolved Hide resolved

DATA-DESCRIPTION.md Outdated Show resolved Hide resolved

Stephanie and others added 3 commits December 13, 2019 14:49

Update DATA-DESCRIPTION.md

e56299c

Co-Authored-By: Jo Lynne <jharenza@gmail.com>

Update DATA-DESCRIPTION.md

c0716f2

Co-Authored-By: Jo Lynne <jharenza@gmail.com>

Update DATA-DESCRIPTION.md

01dfc46

Co-Authored-By: Jo Lynne <jharenza@gmail.com>

jharenza suggested changes Dec 13, 2019

View reviewed changes

DATA-DESCRIPTION.md Outdated Show resolved Hide resolved

jaclyn-taroni added 3 commits December 13, 2019 17:19

Merge branch 'master' into master

e987719

Move file to doc

d1ccaa8

@jharenza suggested change

e32f5c3

jharenza approved these changes Dec 15, 2019

View reviewed changes

jaclyn-taroni added 2 commits December 15, 2019 21:21

Introduce PBTA data concept

87a4f17

Change tense

First pass at filling in the table

9e0f962

update data descriptions

3202804

Add missing fields

jharenza requested a review from cgreene December 16, 2019 14:39

cgreene removed their request for review December 16, 2019 14:46

jaclyn-taroni and others added 4 commits December 16, 2019 10:21

Fix some description spacing

22db28d

Update data-description.md

88abd4b

-add workflows -note: `WGS.hg38.mutect2.unpadded.bed` should be renamed to `WGS.hg38.mutect2.vardict.unpadded.bed` in the next release, but kept as is for now since this description is for v11 files

Merge branch 'master' into master

11a1fc7

Merge branch 'master' into master

af7266d

jaclyn-taroni merged commit b5f0cfc into AlexsLemonade:master Dec 17, 2019

This was referenced Dec 17, 2019

Data release documentation checklist #343

Closed

Documentation: Descriptions of all files in data/ #334

Closed

Add "modules at a glance" table #345

Merged

jharenza mentioned this pull request Dec 17, 2019

Release V12 data #347

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Template markdown file for tracking data information/descriptions #336

Template markdown file for tracking data information/descriptions #336

sjspielman commented Dec 13, 2019

jharenza left a comment

sjspielman commented Dec 13, 2019

jharenza left a comment

jaclyn-taroni commented Dec 13, 2019 •

edited

Loading

jharenza commented Dec 15, 2019

jaclyn-taroni commented Dec 16, 2019

jaclyn-taroni commented Dec 17, 2019

Template markdown file for tracking data information/descriptions #336

Template markdown file for tracking data information/descriptions #336

Conversation

sjspielman commented Dec 13, 2019

Purpose/implementation Section

What scientific question is your analysis addressing?

What was your approach?

What GitHub issue does your pull request address?

Directions for reviewers. Tell potential reviewers what kind of feedback you are soliciting.

Which areas should receive a particularly close look?

Is there anything that you want to discuss further?

Is the analysis in a mature enough form that the resulting figure(s) and/or table(s) are ready for review?

Results

What types of results are included (e.g., table, figure)?

What is your summary of the results?

jharenza left a comment

Choose a reason for hiding this comment

sjspielman commented Dec 13, 2019

jharenza left a comment

Choose a reason for hiding this comment

jaclyn-taroni commented Dec 13, 2019 • edited Loading

jharenza commented Dec 15, 2019

jaclyn-taroni commented Dec 16, 2019

jaclyn-taroni commented Dec 17, 2019

jaclyn-taroni commented Dec 13, 2019 •

edited

Loading