Skip to content
This repository has been archived by the owner on Jun 21, 2023. It is now read-only.

Subtype chordoma #475

Merged
merged 27 commits into from
Jan 31, 2020
Merged

Conversation

mkoptyra
Copy link
Contributor

@mkoptyra mkoptyra commented Jan 24, 2020

Purpose/implementation Section

What scientific question is your analysis addressing?

To subtype the chordoma pediatric tumors by the expression and/or loss of SMARCB1/SNF5 gene.

What was your approach?

What GitHub issue does your pull request address?

#250

Directions for reviewers. Tell potential reviewers what kind of feedback you are soliciting.

Which areas should receive a particularly close look?

Is there anything that you want to discuss further?

Is the analysis in a mature enough form that the resulting figure(s) and/or table(s) are ready for review?

Results

What types of results are included (e.g., table, figure)?

What is your summary of the results?

Reproducibility Checklist

  • The dependencies required to run the code in this pull request have been added to the project Dockerfile.
  • This analysis has been added to continuous integration.

Documentation Checklist

  • This analysis module has a README and it is up to date.
  • This analysis is recorded in the table in analyses/README.md and the entry is up to date.
  • The analytical code is documented and contains comments.

@jaclyn-taroni jaclyn-taroni added the molecular subtyping Related to molecular subtyping of tumors label Jan 25, 2020
Copy link
Member

@jashapiro jashapiro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your contribution @mkoptyra! I have a few small suggestions for organization, but most of my suggestions are about adding some documentation to better describe the steps as they happen. As you are working in a notebook, it is very convenient to add descriptive sections before code blocks to write out the intent of the code that follows to make it easier to follow.

As a reminder, as @jaclyn-taroni has made some additions to this branch on github, you will want to "Pull" the changes to the branch on Github to your local machine before making further changes to avoid conflicts between edits made at different times.

Comment on lines 24 to 28
```{r}
chordoma_samples <- histologies_df %>%
filter(short_histology == "Chordoma") %>%
pull(Kids_First_Biospecimen_ID)
```
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest moving this code block closer to where you will actually use chrodoma_samples. Keep all the reading of files in this section, but move processing later.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you


## Prepare the data

```{r}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is probably worth adding some documentation here. What are you doing in this section? Something simple like:

"First we extract the chordoma samples that have a loss of SMARCB1 from focal_cn_df"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you - updated

Comment on lines 94 to 97
```{r}
# remove large copy number data frame
rm(focal_cn_df)
```
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function seems out of place here. It probably isn't necessary, but if you want to clean up, it is probably better to do it nearer to where the last use of the df occurred, so around line 69

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you - moved

smarcb1_expression
```

```{r}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a comment here to make it clear that you are joining the copy number data with the expression data in this step

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you - added

Kids_First_Participant_ID)),
by = "sample_id")

# combining the two biospecimen identifiers to a single column (all biospecimen IDs for a sampl separated by a comma)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we doing this? Is there a reason not to leave these as two separate columns? Perhaps naming one as WGS and one as expression?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possibly - I suppose this was a step to make sure the Kids_First_Participant_ID is clearly the way to identify/select by?
@jaclyn-taroni - could you comment on this

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We started out handling the biospecimen IDs this was for the reason that @mkoptyra points out. But since giving it more thought and realizing the clinical file this will go into is focused on the Kids_First_Biospecimen_ID, we now do as @jashapiro suggestions in the HGG (#249) and non-MB, non-ATRT embryonal tumors. I can change this part @mkoptyra and then leave a detailed comment about what I did. Let me know when you’ve pushed your changes and are ready for me to make edits!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now we rename the columns Kids_First_Biospecimen_ID_DNA and Kids_First_Biospecimen_ID_RNA --

chordoma_smarcb1_df <- smarcb1_expression %>%
  # any missing samples will get filled with NA when using a full join
  full_join(chordoma_copy, by = "sample_id") %>%
  rename(Kids_First_Biospecimen_ID_DNA = Kids_First_Biospecimen_ID,
         Kids_First_Biospecimen_ID_RNA = biospecimen_id)

That's what the rename function is doing.

I also now do:

chordoma_smarcb1_df <- chordoma_id_df %>%
  select(sample_id, Kids_First_Participant_ID) %>%
  distinct() %>%
  inner_join(chordoma_smarcb1_df,
             by = "sample_id")

instead of:

chordoma_smarcb1_df <- chordoma_smarcb1_df %>%
  inner_join(distinct(select(chordoma_id_df, 
                             sample_id, 
                             Kids_First_Participant_ID)),
             by = "sample_id")

It's doing the same thing (adding the Kids_First_Participant_ID using the chordoma_id_df data frame) but it's a little easier to read in my opinion.


Write the table to file.

```{r}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this output file include a subtype "call"? Or is the deletion status sufficient?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean by subtype "call"?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think @jashapiro is asking about whether we need a column that indicates whether or not a tumor is poorly-differentiated. I was hesitant to add this until the copy number was in better shape. We now have copy number consensus files (contain calls that multiple methods agree on) that we could use instead of the old version that uses the URL.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the notebook to use the consensus file and also updated the documentation to reflect that change in c26db85

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some instances where the expression levels between losses and the neutral calls are similar (current state of plot below):
smarcb1_expression_copy_status

We may want to hold off on adding calls until we do a bit more digging as part of #387 and #486

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed

mkoptyra and others added 8 commits January 30, 2020 04:32
Making folder name more consistent with other analyses (f270af8)
Updating directory name in CI (2fca75c)
Using older version of CNVkit file (4ded6e8)
Adding saving input (e7c23bd)
Refreshing notebook (5058543)
Adding more information to the README (4ec4aec)
Adding entry to modules at a glance (db258e9 )
Thank you for the suggestions

Co-Authored-By: jashapiro <jashapiro@gmail.com>
Thank you

Co-Authored-By: jashapiro <jashapiro@gmail.com>
Co-Authored-By: jashapiro <jashapiro@gmail.com>
- adding more informative description to some chunks; moving order of the "rm(focal_cn_df)" chunk; editing the graph
@mkoptyra
Copy link
Contributor Author

@jaclyn-taroni and @jashapiro - Thank you for all your work
Pulled all updated changes made by Jaclyn to my local notebook and updated some changes suggested by Josh
There are still 2 comments/questions in Josh's comments

@jaclyn-taroni
Copy link
Member

@mkoptyra I would check to make sure you have committed and pushed all your changes via GitKraken. There are a couple instances where you have commented that you’ve made an update but the version I can see using the “Files changed” tab doesn’t yet have those updates. For the outstanding questions from @jashapiro, I can take a look, make changes, and then make sure to leave a detailed comment explaining those changes.

@jaclyn-taroni
Copy link
Member

Thanks @mkoptyra, I'll take a look!

@jaclyn-taroni
Copy link
Member

@jashapiro ready for another look 👀 !

@@ -26,6 +26,7 @@ Note that _nearly all_ modules use the harmonized clinical data file (`pbta-hist
| [`independent-samples`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/independent-samples) | `pbta-histologies.tsv` | Generates independent specimen lists for WGS/WXS samples | `results/independent-specimens.wgs.primary.tsv` <br> `results/independent-specimens.wgs.primary-plus.tsv` <br> `results/independent-specimens.wgswxs.primary.tsv` <br> `results/independent-specimens.wgswxs.primary-plus.tsv` (included in data download)
| [`interaction-plots`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/interaction-plots) | `independent-specimens.wgs.primary-plus.tsv` <br> `pbta-snv-consensus-mutation.maf.tsv.gz` | Creates interaction plots for mutation mutual exclusivity/co-occurrence [#13](https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/13); may be updated to include other data types (e.g., fusions) | N/A
| [`molecular-subtyping-ATRT`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/molecular-subtyping-ATRT) | `analyses/gene-set-enrichment-analysis/results/gsva_scores_stranded.tsv` <br> `pbta-gene-expression-rsem-fpkm-collapsed.stranded.rds` <br> `analyses/focal-cn-file-preparation/results/consensus_seg_annotated_cn_autosomes.tsv.gz` <br> `pbta-snv-consensus-mutation-tmb-all.tsv` <br> `2019-01-28-consensus-cnv.zip` from [#453 (comment)](https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/453#issuecomment-579340618) | Summarizing data into tabular format in order to molecularly subtype ATRT samples [#244](https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/244); this analysis did not work | N/A
| [`molecular-subtyping-chordoma`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/molecular-subtyping-chordoma) | `analyses/focal-cn-file-preparation/results/consensus_seg_annotated_cn_autosomes.tsv.gz` <br> `pbta-gene-expression-rsem-fpkm-collapsed.stranded.rds` | *In progress*; identifying poorly-differentiated chordoma samples per [#250](https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/250) | N/A
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we now have an output file?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do have an output file, but not one I expect to be consumed by other analyses. I expect molecular subtype labels, when we have them, will go into the pbta-histologies.tsv file and be consumed that way.

Copy link
Member

@jashapiro jashapiro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!


Write the table to file.

```{r}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed

@jaclyn-taroni
Copy link
Member

Thanks for your work on this @mkoptyra ! I'm going to merge now.

@jaclyn-taroni jaclyn-taroni merged commit eadb78d into AlexsLemonade:master Jan 31, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
molecular subtyping Related to molecular subtyping of tumors
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants