Snv database re-run to include all maf columns #954

kgaonkar6 · 2021-03-11T18:00:02Z

Purpose/implementation Section

What scientific question is your analysis addressing?

Can we update the sql database creation step such that all maf tables contain all maf columns ? We will mainly need it when we want to scavenge back hotspots called by 1 or more calls as a maf format.

What was your approach?

I added mutect2 and vardict to the list of table names that should contain all maf columns

What GitHub issue does your pull request address?

#819

Directions for reviewers. Tell potential reviewers what kind of feedback you are soliciting.

Which areas should receive a particularly close look?

Adding all columns of a maf file creates a huge 50G file, maybe there is some way to reduce the file size or save the database elsewhere for easy access?

Is there anything that you want to discuss further?

NA

Is the analysis in a mature enough form that the resulting figure(s) and/or table(s) are ready for review?

This is just a code update, the sql file gets saved into scratch folder which is ignored in this repo.

Results

What types of results are included (e.g., table, figure)?

NA

What is your summary of the results?

NA

Reproducibility Checklist

The dependencies required to run the code in this pull request have been added to the project Dockerfile.
This analysis has been added to continuous integration.

Documentation Checklist

This analysis module has a README and it is up to date.
This analysis is recorded in the table in analyses/README.md and the entry is up to date.
The analytical code is documented and contains comments.

jashapiro · 2021-03-11T18:21:55Z

Adding all columns of a maf file creates a huge 50G file, maybe there is some way to reduce the file size or save the database elsewhere for easy access?

This is the main reason this was not done in the first place. I am hesitant to add in all of the columns like this for a limited use case of searching for specific hotspots. That would likely be more efficiently accomplished by creating a subset of the original maf file for sites that match known hotspot locations.

Alternatively, if there are specific columns you need from all tables, you might add those to needed_types. The only reason we kept all of them was so that we could write out a "complete" maf file for the consensus. As I understand the analysis goals for this project, I don't think you need that?

kgaonkar6 · 2021-03-11T18:26:12Z

The results of #819 would be maf file, so I believe we would need all columns. Are you suggesting instead of the database I filter merged file from each caller for hotspots ? I can give that a try as well, I think the issue was with reading vardict merged file I can look into it.

kgaonkar6 · 2021-03-12T14:41:03Z

It seems reading vardict in R is using up all RAM in my laptop, so I think I'll switch to python to read and filter the files.
But since I have the filtering script in R I was also thinking of splitting vardict merged file via split -l <line number> and then give that as input to the R script , thoughts?

jashapiro · 2021-03-12T15:33:50Z

It seems reading vardict in R is using up all RAM in my laptop, so I think I'll switch to python to read and filter the files.
But since I have the filtering script in R I was also thinking of splitting vardict merged file via split -l <line number> and then give that as input to the R script , thoughts?

Yes, vardict is huge and this is a perennial problem. The reason for the DB was partly to alleviate this problem when we had to deal with the full data set. In this case, when we need only a small subset and have some simple things to filter by, I would expect that grep (or awk if you want to get fancy) before reading in the file could be most efficient.

There is a function in readr called read_csv_chunked() that could also make the read more efficient if you want to stay in R. You could write your filter function and use that as the callback argument to filter a set of lines. But my memory is that this isn't the easiest thing to set up.

kgaonkar6 · 2021-03-12T17:44:44Z

Thanks @jashapiro I did think about greping for the genes in the terminal and then filter for Amino_Acid_Position and genomic regions, but then 1 of the input file is in excel format from source here with 2 tabs of sheets😅

I ended up splitting the vardict file in #956 and filtering for hotspots since I'd already run other callers through that script. Hmm, maybe I should just save the hotspot database calls as tsv file in input folder then I could grep, I can try it out.

kgaonkar6 · 2021-03-15T15:53:03Z

Please close if #954 seems more reasonable for gathering the hotspots. Thanks!

jaclyn-taroni · 2021-03-16T13:33:22Z

Closing in favor of #956

kgaonkar6 added 2 commits March 10, 2021 09:18

add al cols in maf

a10873b

all cols in chunk

40bc969

kgaonkar6 changed the title ~~Snv rerun~~ Snv database re-run to include all maf columns Mar 11, 2021

kgaonkar6 requested review from cansavvy, jashapiro and jharenza March 11, 2021 18:00

kgaonkar6 mentioned this pull request Mar 11, 2021

Part 1 #819 Combine snv per caller and filter to scavenge hotspots #947

Closed

5 tasks

jaclyn-taroni closed this Mar 16, 2021

kgaonkar6 deleted the snv-rerun branch May 13, 2021 17:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Snv database re-run to include all maf columns #954

Snv database re-run to include all maf columns #954

kgaonkar6 commented Mar 11, 2021

jashapiro commented Mar 11, 2021

kgaonkar6 commented Mar 11, 2021 •

edited

Loading

kgaonkar6 commented Mar 12, 2021

jashapiro commented Mar 12, 2021

kgaonkar6 commented Mar 12, 2021

kgaonkar6 commented Mar 15, 2021

jaclyn-taroni commented Mar 16, 2021

Snv database re-run to include all maf columns #954

Snv database re-run to include all maf columns #954

Conversation

kgaonkar6 commented Mar 11, 2021

Purpose/implementation Section

What scientific question is your analysis addressing?

What was your approach?

What GitHub issue does your pull request address?

Directions for reviewers. Tell potential reviewers what kind of feedback you are soliciting.

Which areas should receive a particularly close look?

Is there anything that you want to discuss further?

Is the analysis in a mature enough form that the resulting figure(s) and/or table(s) are ready for review?

Results

What types of results are included (e.g., table, figure)?

What is your summary of the results?

Reproducibility Checklist

Documentation Checklist

jashapiro commented Mar 11, 2021

kgaonkar6 commented Mar 11, 2021 • edited Loading

kgaonkar6 commented Mar 12, 2021

jashapiro commented Mar 12, 2021

kgaonkar6 commented Mar 12, 2021

kgaonkar6 commented Mar 15, 2021

jaclyn-taroni commented Mar 16, 2021

kgaonkar6 commented Mar 11, 2021 •

edited

Loading