Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove pudl_etl and ferc_to_sqlite commands in favor of dagster job execute #3161

Open
4 tasks
Tracked by #3956
bendnorman opened this issue Dec 15, 2023 · 1 comment
Open
4 tasks
Tracked by #3956
Labels
cli Scripts and other command line interfaces to PUDL. dagster Issues related to our use of the Dagster orchestrator

Comments

@bendnorman
Copy link
Member

bendnorman commented Dec 15, 2023

Currently, pudl_etl and ferc_to_sqlite cli commands use the dagster.build_reconstructable_job method for executing multi process dagster jobs. build_reconstructable_job is an experimental method and is kind of confusing. We can likely completely replace our pudl_etl and ferc_to_sqlite cli command code by just creating preconfigured jobs and executing them with the dagster cli:

dagster job execute <name of job>

Jobs will likely be the ones we currently have plus a nightly_build_etl_full and a nightly_build_ferc_to_sqlite_full job. If we move to this, we'll need to define the preconfigured jobs in python. We mostly do this now with the exception of the args people to the pudl_etl and ferc_to_sqlite cli commands.

How can we incorporate pudl_etl arguments into the dagster configuration system? Current args that aren't included right now are loglevel and logfile. Same args for ferc_to_sqlite with the addition of the dataset_only arg.

How do we want to generate the configurations? 90% of our config is generated in pud.etl.__init__.py via a few strategies:

  1. Loading default configuration of dagster resources

    define_asset_job(
    name="etl_full",
    description="This job executes all years of all assets.",
    config=default_config,
    ),

  2. Using default configuration + asset selection

    define_asset_job(
    name="etl_full_no_cems",
    selection=create_non_cems_selection(default_assets),
    description="This job executes all years of all assets except the "
    "core_epacems__hourly_emissions asset and all assets downstream.",
    ),

  3. Loading configuration from a yaml file

    define_asset_job(
    name="etl_fast",
    config=default_config
    | {
    "resources": {
    "dataset_settings": {
    "config": load_dataset_settings_from_file("etl_fast")
    }
    }
    },
    description="This job executes the most recent year of each asset.",
    ),

We also have a default_config dictionary that should shared by all jobs:

default_tag_concurrency_limits = [
{
"key": "memory-use",
"value": "high",
"limit": 4,
},
]
default_config = pudl.helpers.get_dagster_execution_config(
tag_concurrency_limits=default_tag_concurrency_limits
)
default_config |= pudl.analysis.ml_tools.get_ml_models_config()

Tasks

Preview Give feedback
@bendnorman bendnorman converted this from a draft issue Dec 15, 2023
@zaneselvans zaneselvans added cli Scripts and other command line interfaces to PUDL. dagster Issues related to our use of the Dagster orchestrator labels Jul 3, 2024
@jdangerx
Copy link
Member

This would be nice! What's the difference between the nightly build versions of the jobs & the current etl_full and ferc_to_sqlite_full jobs?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cli Scripts and other command line interfaces to PUDL. dagster Issues related to our use of the Dagster orchestrator
Projects
Status: New
Development

No branches or pull requests

3 participants