Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(susie_finemapper): ensure proper output paths #48

Merged
merged 7 commits into from
Oct 18, 2024

Conversation

project-defiant
Copy link
Collaborator

@project-defiant project-defiant commented Oct 17, 2024

Context

This PR resolves 2 issues with the output paths from finemapping of ukb_ppp_eur datasource.

❯ gsutil ls 'gs://ukb_ppp_eur_data/credible_set_datasets/susie/studyLocusId=f6c6edb08ed02502869faea2d46e7c5*'
gs://ukb_ppp_eur_data/credible_set_datasets/susie/studyLocusId=f6c6edb08ed02502869faea2d46e7c5c.log

gs://ukb_ppp_eur_data/credible_set_datasets/susie/studyLocusId=f6c6edb08ed02502869faea2d46e7c5c/:
gs://ukb_ppp_eur_data/credible_set_datasets/susie/studyLocusId=f6c6edb08ed02502869faea2d46e7c5c/
gs://ukb_ppp_eur_data/credible_set_datasets/susie/studyLocusId=f6c6edb08ed02502869faea2d46e7c5c/_SUCCESS
gs://ukb_ppp_eur_data/credible_set_datasets/susie/studyLocusId=f6c6edb08ed02502869faea2d46e7c5c/part-00000-e1d26427-69b6-473b-a6fb-d5c470f28280-c000.snappy.parquet
gs://ukb_ppp_eur_data/credible_set_datasets/susie/studyLocusId=f6c6edb08ed02502869faea2d46e7c5c/part-00003-e1d26427-69b6-473b-a6fb-d5c470f28280-c000.snappy.parquet

Finemapping results from ukb_ppp_eur contained log file and finemapped loci parquet in the same directory, which could cause trouble when reading the loci files.

The output path of the finemapping results contained the studyLocusId=XXX segment that was inferred by spark, as if the dataset was partitionned by studyLocusId column but in reality it was not partitionned at all.

see more info in opentargets/gentropy#859

This PR attempts to:

  • Change the output path of the finemapping loci to stop being evaluated as a partition. The studyLocusId= is being removed from the prefix path.
  • Introduced the log_path parameter that will be used as a base path to the finemapping logs
  • Removed task to move logs after the finemapping batch job.

@project-defiant project-defiant marked this pull request as ready for review October 18, 2024 09:58
@project-defiant project-defiant requested review from d0choa and addramir and removed request for d0choa October 18, 2024 10:07
@project-defiant project-defiant merged commit e4e7997 into dev Oct 18, 2024
2 checks passed
@project-defiant project-defiant deleted the susie-log-file branch October 18, 2024 14:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants