Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CNA long format (RFC 65) #9847

Merged
merged 3 commits into from
Nov 17, 2022
Merged

Conversation

BasLee
Copy link

@BasLee BasLee commented Oct 12, 2022

Add long data format for CNA as described in RFC 65.

Looking for feedback on the new importer and functionality that should be shared between the new ImportCnaDiscreteLongData and ImportTabDelimData.

image
Sample of new data format

In the current CNA data format the events of each gene x sample are stored in a tabular format, with a column for each sample and a row for each gene.

The new CNA 'long' data format contains a row for each gene x sample.

Thie new data format will import data using a new importer, the old data format is kept as is. Shared functionality should be extracted as much as possible.

Changes

  • Add new importer ImportCnaDiscreteLongData for CNA long format
  • Extract shared functionality of ImportTabDelimData and ImportCnaDiscreteLongData into CnaUtil

image
Database schema change

Tests

  • Added new tests for ImportCnaDiscreteLongData in TestImportCnaDiscreteLongData

@BasLee BasLee changed the title Custom cna namespace Custom cna namespace (RFC 66) Oct 12, 2022
@inodb inodb changed the title Custom cna namespace (RFC 66) Custom cna namespace + CNA longform (RFC 66) Oct 12, 2022
@BasLee BasLee changed the title Custom cna namespace + CNA longform (RFC 66) Custom cna namespace + CNA longform (RFC 65) Oct 12, 2022
@BasLee BasLee force-pushed the custom-cna-namespace branch 8 times, most recently from 77adf0c to 9660ad2 Compare October 13, 2022 10:07
@BasLee BasLee changed the title Custom cna namespace + CNA longform (RFC 65) CNA long format (RFC 65) Oct 13, 2022
Copy link
Member

@Luke-Sikina Luke-Sikina left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty good. Lots of small changes :)

Copy link
Member

@dippindots dippindots left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @BasLee! Just a few minor comments about this pr, it looks good to me.

@dippindots
Copy link
Member

There is one more suggestion about this pull request, since this pr introduces a new data format, we should also add a new method to validate this new data format in validateData.py

@BasLee
Copy link
Author

BasLee commented Oct 20, 2022

There is one more suggestion about this pull request, since this pr introduces a new data format, we should also add a new method to validate this new data format in validateData.py

Good point! The validator should be updated. I discussed it with @pvannierop @oplantalech and we would like to do this in a separate PR.

@pvannierop
Copy link
Contributor

pvannierop commented Oct 27, 2022

@averyniceday Yes, we restrict a study to have only one format. I do not see how the recognition of the new format will differ from any other data types. The meta file for discrete CNA will reference the format used (long or legacy wide). Or am I overlooking a complication here?

@averyniceday
Copy link
Contributor

@pvannierop This might be specific to how we import on our end and how the importer is implemented (in conjunction with how we organize datatypes in google sheets). We currently have a one to one mapping of meta to data files (e.g. meta_cna to data_cna). We also have a step inside the importer that merges records for genes (I imagine this is implemented to be specific to the legacy format).

If we are not planning on introducing a new meta/data pairing for the long format (e.g. data_cna_long/meta_cna_long) and instead only differentiating by a field inside the metafile, the importer will need to be updated to be able to differentiate between the two based on the metafile contents.

@pvannierop
Copy link
Contributor

@averyniceday The current wide format is still supported by this change (there is 100% backward compatibility). Therefore, I fail to see how this PR could possible interfere with your current processes. This PR merely adds a format. If you internally decide not to use this long format for the time being you should be fine.

And on a note of process, I think it is not really proper procedure to have merging of PRs to cBioPortal codebase to depend on update of internal tooling.

@averyniceday
Copy link
Contributor

@pvannierop @BasLee Agreed, this doesn't need to hold up merging into the cbioportal codebase. 👍

Could you also add documentation for the new format in the File Formats section?

@pvannierop
Copy link
Contributor

@averyniceday The documentation will be added for sure! We will do this in the PR that updates the python validator script.

@BasLee BasLee force-pushed the custom-cna-namespace branch 2 times, most recently from e78f1b4 to b90f830 Compare November 3, 2022 08:57
@dippindots
Copy link
Member

@BasLee CNA long format validator merged, please feel free to rebase this pr.

@BasLee BasLee force-pushed the custom-cna-namespace branch 3 times, most recently from 773bff4 to 3a09063 Compare November 16, 2022 10:31
@BasLee
Copy link
Author

BasLee commented Nov 16, 2022

@BasLee CNA long format validator merged, please feel free to rebase this pr.

Rebased and fixed some validation issues, most (relevant?) checks seem to run now

@BasLee
Copy link
Author

BasLee commented Nov 17, 2022

After some discussion with @pvannierop:

The new DISCRETE_LONG format should probably not be passed to the front end, because the old 'wide' and the new 'long' CNA format should both result in exactly the same DISCRETE data after the import.

Only the importer should have knowledge of the DISCRETE_LONG format, and the importer should update it into DISCRETE. All other CNA logic can just use the old DISCRETE format.

These changes can be found in 9c92524

@sonarcloud
Copy link

sonarcloud bot commented Nov 17, 2022

SonarCloud Quality Gate failed.    Quality Gate failed

Bug E 3 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 24 Code Smells

0.0% 0.0% Coverage
0.0% 0.0% Duplication

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants