-
Notifications
You must be signed in to change notification settings - Fork 67
Snv database re-run to include all maf columns #954
Conversation
This is the main reason this was not done in the first place. I am hesitant to add in all of the columns like this for a limited use case of searching for specific hotspots. That would likely be more efficiently accomplished by creating a subset of the original maf file for sites that match known hotspot locations. Alternatively, if there are specific columns you need from all tables, you might add those to |
The results of #819 would be maf file, so I believe we would need all columns. Are you suggesting instead of the database I filter merged file from each caller for hotspots ? I can give that a try as well, I think the issue was with reading vardict merged file I can look into it. |
It seems reading vardict in R is using up all RAM in my laptop, so I think I'll switch to python to read and filter the files. |
Yes, vardict is huge and this is a perennial problem. The reason for the DB was partly to alleviate this problem when we had to deal with the full data set. In this case, when we need only a small subset and have some simple things to filter by, I would expect that There is a function in |
Thanks @jashapiro I did think about greping for the genes in the terminal and then filter for Amino_Acid_Position and genomic regions, but then 1 of the input file is in excel format from source here with 2 tabs of sheets😅 I ended up splitting the vardict file in #956 and filtering for hotspots since I'd already run other callers through that script. Hmm, maybe I should just save the hotspot database calls as tsv file in input folder then I could grep, I can try it out. |
Please close if #954 seems more reasonable for gathering the hotspots. Thanks! |
Closing in favor of #956 |
Purpose/implementation Section
What scientific question is your analysis addressing?
Can we update the sql database creation step such that all maf tables contain all maf columns ? We will mainly need it when we want to scavenge back hotspots called by 1 or more calls as a maf format.
What was your approach?
I added mutect2 and vardict to the list of table names that should contain all maf columns
What GitHub issue does your pull request address?
#819
Directions for reviewers. Tell potential reviewers what kind of feedback you are soliciting.
Which areas should receive a particularly close look?
Adding all columns of a maf file creates a huge 50G file, maybe there is some way to reduce the file size or save the database elsewhere for easy access?
Is there anything that you want to discuss further?
NA
Is the analysis in a mature enough form that the resulting figure(s) and/or table(s) are ready for review?
This is just a code update, the sql file gets saved into scratch folder which is ignored in this repo.
Results
What types of results are included (e.g., table, figure)?
NA
What is your summary of the results?
NA
Reproducibility Checklist
Documentation Checklist
README
and it is up to date.analyses/README.md
and the entry is up to date.