Skip to content
This repository has been archived by the owner on Jun 21, 2023. It is now read-only.

Snv database re-run to include all maf columns #954

Closed
wants to merge 2 commits into from

Conversation

kgaonkar6
Copy link
Collaborator

Purpose/implementation Section

What scientific question is your analysis addressing?

Can we update the sql database creation step such that all maf tables contain all maf columns ? We will mainly need it when we want to scavenge back hotspots called by 1 or more calls as a maf format.

What was your approach?

I added mutect2 and vardict to the list of table names that should contain all maf columns

What GitHub issue does your pull request address?

#819

Directions for reviewers. Tell potential reviewers what kind of feedback you are soliciting.

Which areas should receive a particularly close look?

Adding all columns of a maf file creates a huge 50G file, maybe there is some way to reduce the file size or save the database elsewhere for easy access?

Is there anything that you want to discuss further?

NA

Is the analysis in a mature enough form that the resulting figure(s) and/or table(s) are ready for review?

This is just a code update, the sql file gets saved into scratch folder which is ignored in this repo.

Results

What types of results are included (e.g., table, figure)?

NA

What is your summary of the results?

NA

Reproducibility Checklist

  • The dependencies required to run the code in this pull request have been added to the project Dockerfile.
  • This analysis has been added to continuous integration.

Documentation Checklist

  • This analysis module has a README and it is up to date.
  • This analysis is recorded in the table in analyses/README.md and the entry is up to date.
  • The analytical code is documented and contains comments.

@kgaonkar6 kgaonkar6 changed the title Snv rerun Snv database re-run to include all maf columns Mar 11, 2021
@jashapiro
Copy link
Member

Adding all columns of a maf file creates a huge 50G file, maybe there is some way to reduce the file size or save the database elsewhere for easy access?

This is the main reason this was not done in the first place. I am hesitant to add in all of the columns like this for a limited use case of searching for specific hotspots. That would likely be more efficiently accomplished by creating a subset of the original maf file for sites that match known hotspot locations.

Alternatively, if there are specific columns you need from all tables, you might add those to needed_types. The only reason we kept all of them was so that we could write out a "complete" maf file for the consensus. As I understand the analysis goals for this project, I don't think you need that?

@kgaonkar6
Copy link
Collaborator Author

kgaonkar6 commented Mar 11, 2021

The results of #819 would be maf file, so I believe we would need all columns. Are you suggesting instead of the database I filter merged file from each caller for hotspots ? I can give that a try as well, I think the issue was with reading vardict merged file I can look into it.

@kgaonkar6
Copy link
Collaborator Author

It seems reading vardict in R is using up all RAM in my laptop, so I think I'll switch to python to read and filter the files.
But since I have the filtering script in R I was also thinking of splitting vardict merged file via split -l <line number> and then give that as input to the R script , thoughts?

@jashapiro
Copy link
Member

It seems reading vardict in R is using up all RAM in my laptop, so I think I'll switch to python to read and filter the files.
But since I have the filtering script in R I was also thinking of splitting vardict merged file via split -l <line number> and then give that as input to the R script , thoughts?

Yes, vardict is huge and this is a perennial problem. The reason for the DB was partly to alleviate this problem when we had to deal with the full data set. In this case, when we need only a small subset and have some simple things to filter by, I would expect that grep (or awk if you want to get fancy) before reading in the file could be most efficient.

There is a function in readr called read_csv_chunked() that could also make the read more efficient if you want to stay in R. You could write your filter function and use that as the callback argument to filter a set of lines. But my memory is that this isn't the easiest thing to set up.

@kgaonkar6
Copy link
Collaborator Author

Thanks @jashapiro I did think about greping for the genes in the terminal and then filter for Amino_Acid_Position and genomic regions, but then 1 of the input file is in excel format from source here with 2 tabs of sheets😅

I ended up splitting the vardict file in #956 and filtering for hotspots since I'd already run other callers through that script. Hmm, maybe I should just save the hotspot database calls as tsv file in input folder then I could grep, I can try it out.

@kgaonkar6
Copy link
Collaborator Author

Please close if #954 seems more reasonable for gathering the hotspots. Thanks!

@jaclyn-taroni
Copy link
Member

Closing in favor of #956

@kgaonkar6 kgaonkar6 deleted the snv-rerun branch May 13, 2021 17:17
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants