Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(dag): add data transfer task group for release process #528

Merged
merged 13 commits into from
Mar 14, 2024

Conversation

DSuveges
Copy link
Contributor

@DSuveges DSuveges commented Mar 7, 2024

✨ Context

Some updates were made for the next data release + minor improvements in the Airflow layer. Scope defined under #3238.

πŸ›  What does this PR implement

  • Updated configurations for release 24.03
  • Targest and interactions are considered as static assests (the airflow service account has no access to platform buckets.)
  • The ETL DAG now transfers required datasets from the datasource specific buckets.
  • FINNGEN susie ingestion outputs to the canonical location.

🚦 Before submitting

  • Do these changes cover one single feature (one change at a time)?
  • Did you read the contributor guideline?
  • Did you make sure to update the documentation with your changes?
  • Did you make sure there is no commented out code in this PR?
  • Did you follow conventional commits standards in PR title and commit messages?
  • Did you make sure the branch is up-to-date with the dev branch?
  • Did you write any new necessary tests?
  • Did you make sure the changes pass local tests (make test)?
  • Did you make sure the changes pass pre-commit rules (e.g poetry run pre-commit run --all-files)?

@DSuveges DSuveges marked this pull request as ready for review March 11, 2024 20:24
Copy link
Contributor

@ireneisdoomed ireneisdoomed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!
Changes here mostly concern adding a group of tasks that place all necessary files in the release bucket prior to the ETL run. Therefore, I'd suggest changing the PR title to sth like feat(dag): add data transfer task group for release process
This is the ETL DAG now:
image

I made some comments, let me know your thoughts. Overall, I like the implementation and the idea of introducing the ShortCiruitOperator to safeguard ourselves from rerunning unnecessary processes is very nice.

@@ -36,9 +36,9 @@ anderson: ${datasets.static_assets}/andersson2014/enhancer_tss_associations.bed
javierre: ${datasets.static_assets}/javierre_2016_preprocessed
jung: ${datasets.static_assets}/jung2019_pchic_tableS3.csv
thurman: ${datasets.static_assets}/thurman2012/genomewideCorrs_above0.7_promoterPlusMinus500kb_withGeneNames_32celltypeCategories.bed8.gz
target_index: ${datasets.release_folder}/targets # OTP 23.12 data
target_index: ${datasets.static_assets}/targets # OTP 23.12 data
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to just point to our OTP release bucket?

@@ -36,9 +36,9 @@ anderson: ${datasets.static_assets}/andersson2014/enhancer_tss_associations.bed
javierre: ${datasets.static_assets}/javierre_2016_preprocessed
jung: ${datasets.static_assets}/jung2019_pchic_tableS3.csv
thurman: ${datasets.static_assets}/thurman2012/genomewideCorrs_above0.7_promoterPlusMinus500kb_withGeneNames_32celltypeCategories.bed8.gz
target_index: ${datasets.release_folder}/targets # OTP 23.12 data
target_index: ${datasets.static_assets}/targets # OTP 23.12 data
gene_interactions: ${datasets.static_assets}/interaction # OTP 23.12 data
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to just point to our OTP release bucket?

Copy link
Contributor Author

@DSuveges DSuveges Mar 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might makes sense, however the service account can't read that bucket. And I don't see a reason to extend its permission to other buckets. In a way, while the platform and the genetics etls are separated, this solution is fine.

src/airflow/dags/test_DAG.py Outdated Show resolved Hide resolved
src/airflow/dags/test_DAG.py Outdated Show resolved Hide resolved

# Datasource paths:
GWAS_CATALOG_BUCKET_NAME = "gwas_catalog_data"
EQTL_BUCKET_NAME = "eqtl_catalog_data"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't concern this PR and it's incredibly minor, but it annoys me a bit that the source name is incorrect. Do you think we should rename it to eqtl_catalogue_data?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, totally makes sense. It must be that the US form was derived from the GWAS Catalog.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because of funding as far as I understand



# Test if release folder exists:
def test_release_folder_exists() -> bool:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This util might be useful as a sensor in other instances.
You could parametrise it and move it to common_airflow.py

def test_release_folder_exists(bucket: str, path: str) -> bool:
    """This function tests if the release folder exists.

    Args:
        bucket (str): Name of the bucket in GCS
        path (str): Object name in GCS to check 

    Returns:
        bool: False if the folder exists, True otherwise.
    """
    hook = GCSHook(gcp_conn_id="google_cloud_default")
    return not hook.exists(bucket, path)

Minor, but the GCSHook connector is interesting

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, it makes sense to generalise.

src/airflow/dags/test_DAG.py Outdated Show resolved Hide resolved
src/airflow/dags/test_DAG.py Outdated Show resolved Hide resolved
src/airflow/dags/test_DAG.py Outdated Show resolved Hide resolved
src/airflow/dags/test_DAG.py Outdated Show resolved Hide resolved
DSuveges and others added 2 commits March 12, 2024 17:05
Co-authored-by: Irene LΓ³pez <45119610+ireneisdoomed@users.noreply.github.com>
@DSuveges DSuveges changed the title chore: update for 24.03 data release eat(dag): add data transfer task group for release process Mar 13, 2024
@DSuveges DSuveges changed the title eat(dag): add data transfer task group for release process feat(dag): add data transfer task group for release process Mar 13, 2024
Copy link
Contributor

@ireneisdoomed ireneisdoomed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for addressing the comments!

@@ -59,6 +60,24 @@
}


# Test if release folder exists:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Test if release folder exists:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately we are still not there yet. (that's why I haven't ask for a re-reivew) For some reason the the gcshook.exists() stopped working. Now I'm trying to find out what is the problem. Keep you posted.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, thanks for testing it.

Comment on lines +104 to +106
python_callable=lambda bucket, path: not common.check_gcp_folder_exists(
bucket, path
),
Copy link
Contributor Author

@DSuveges DSuveges Mar 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be controversial, but I'm up for defending this decision: I don't like to write separate functions for testing the presence and for testing the absence of something. I like the idea to test something positive, because that's eaiser to understand eg. the tested folder is there -> returns True.

However it implies that if we need to test the absence, we have to flip the boolean value. We can either flip the return value or make the function flexible to decide if we want the presence or the absence. I don't like the latter solution. This means if I'm using an airflow operator on a python callable, I need to use lambda. It might not be the nicest, but I think this is the right choice here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not controversial to me. Functions are meant to be reused. Thanks!

@ireneisdoomed
Copy link
Contributor

Thank you for the changes! Was the hook failing or it simply stopped working?

@DSuveges
Copy link
Contributor Author

Was the hook failing or it simply stopped working?

Apparently the GCSHook.exists() cannot be used to test if a folder exists or not, as there are no such concepts in google cloud explained by this SO post. Instead I need to list the content of a bucket and evaluate if there are content in the release folder. Not as nice, but kind of make sense.

@DSuveges DSuveges merged commit 4dcecc3 into dev Mar 14, 2024
4 checks passed
@DSuveges DSuveges deleted the ds_3238_update_ETL_DAG branch March 14, 2024 12:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants