Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add GTEx import resources #646

Merged
merged 22 commits into from
Jan 8, 2024
Merged

Add GTEx import resources #646

merged 22 commits into from
Jan 8, 2024

Conversation

KoalaQin
Copy link
Contributor

@KoalaQin KoalaQin commented Nov 9, 2023

This is to add an automated import function for GTEx data to gnomAD resources. Instead of using manually reheadered file, I downloaded the original expression and sample metadata to get the tissue information, while keeping the original GTEx Sample ID.
The only thing is that where we put these original files for importing?

@KoalaQin KoalaQin self-assigned this Nov 9, 2023
@KoalaQin KoalaQin requested a review from jkgoodrich November 9, 2023 22:42
@KoalaQin KoalaQin marked this pull request as draft November 9, 2023 22:49
@KoalaQin KoalaQin changed the title add GTEx import function Add GTEx import resources Nov 13, 2023
@KoalaQin KoalaQin marked this pull request as ready for review November 13, 2023 19:28
gnomad/resources/grch38/reference_data.py Outdated Show resolved Hide resolved
gnomad/resources/grch38/reference_data.py Outdated Show resolved Hide resolved
default_version="v7",
versions={
"v7": GnomadPublicTableResource(
path="gs://gnomad-public-requester-pays/resources/grch38/gtex_rsem/gtex_rsem_v7.mt",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
path="gs://gnomad-public-requester-pays/resources/grch38/gtex_rsem/gtex_rsem_v7.mt",
path="gs://gnomad-public-requester-pays/resources/grch37/gtex/gtex_rsem_v7.mt",

gnomad/resources/grch37/reference_data.py Outdated Show resolved Hide resolved
gnomad/resources/grch37/reference_data.py Outdated Show resolved Hide resolved
gnomad/resources/grch37/reference_data.py Show resolved Hide resolved
gnomad/resources/grch37/reference_data.py Outdated Show resolved Hide resolved
gnomad/resources/grch37/reference_data.py Show resolved Hide resolved
gnomad/resources/grch37/reference_data.py Outdated Show resolved Hide resolved
.replace("\\(", "_")
.replace("\\)", "")
)
mt = mt.key_rows_by()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it work to just not use row_key="transcript_id" when importing?

Copy link
Contributor Author

@KoalaQin KoalaQin Jan 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When it's key, I can't remove the version number using annotate.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, but I think without row_key="transcript_id" in the import it won't already be keyed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, I had row_key in my test code.

@KoalaQin
Copy link
Contributor Author

KoalaQin commented Jan 8, 2024

I also add a resource for CDS intervals, I was digging into Beryl's "CDS" bed (located here as of now: gs://gnomad-public/papers/2019-tx-annotation/data/other_data/gencode.v19.CDS.Hail.021519.bed) and found it's not accurate for the following reasons:

  1. all the start positions were 1bp smaller than the Gencode GTF, it causes confusions when using import_bed by default: left inclusive but right exclusive, while import_gtf is left and right inclusive by default;
  2. when merging back to Gencode intervals, they were not only "CDS" but also the other features:
    "CDS" 723784
    "Selenocysteine" 19
    "UTR" 54082
    "exon" 743326
    "gene" 162
    "start_codon" 84144
    "stop_codon" 76196
    "transcript" 171
    Good news is that it included all the CDS intervals minus the 1bp left/right confusion, I think it's better to start directly from the Gencode GTF file.

@KoalaQin KoalaQin requested a review from jkgoodrich January 8, 2024 19:05
gnomad/resources/grch37/reference_data.py Outdated Show resolved Hide resolved
gnomad/resources/grch37/reference_data.py Outdated Show resolved Hide resolved
gnomad/resources/grch37/reference_data.py Outdated Show resolved Hide resolved
gnomad/resources/grch37/reference_data.py Outdated Show resolved Hide resolved
gnomad/resources/grch37/reference_data.py Outdated Show resolved Hide resolved
gnomad/resources/grch37/reference_data.py Show resolved Hide resolved
gnomad/resources/grch37/reference_data.py Outdated Show resolved Hide resolved
gnomad/resources/grch37/reference_data.py Outdated Show resolved Hide resolved
gnomad/resources/grch37/reference_data.py Outdated Show resolved Hide resolved
.replace("\\(", "_")
.replace("\\)", "")
)
mt = mt.key_rows_by()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, but I think without row_key="transcript_id" in the import it won't already be keyed

@KoalaQin KoalaQin requested a review from jkgoodrich January 8, 2024 20:06
gnomad/resources/grch37/reference_data.py Outdated Show resolved Hide resolved
gnomad/resources/grch37/reference_data.py Outdated Show resolved Hide resolved
gnomad/resources/grch37/reference_data.py Outdated Show resolved Hide resolved
gnomad/resources/grch37/reference_data.py Outdated Show resolved Hide resolved
gnomad/resources/grch37/reference_data.py Outdated Show resolved Hide resolved
gnomad/resources/grch37/reference_data.py Outdated Show resolved Hide resolved
@KoalaQin KoalaQin requested a review from jkgoodrich January 8, 2024 20:38
Copy link
Contributor

@jkgoodrich jkgoodrich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just have one comment and then this is good to go. I think I would like to wait on adding the gencode filter function to a PR where it is needed so I can see it used

gnomad/utils/filtering.py Outdated Show resolved Hide resolved
@KoalaQin KoalaQin requested a review from jkgoodrich January 8, 2024 21:55
gnomad/utils/filtering.py Outdated Show resolved Hide resolved
@KoalaQin KoalaQin requested a review from jkgoodrich January 8, 2024 22:14
Copy link
Contributor

@jkgoodrich jkgoodrich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@KoalaQin KoalaQin merged commit f4da260 into main Jan 8, 2024
3 checks passed
@jkgoodrich jkgoodrich deleted the qh/pext branch January 22, 2024 14:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants