Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nightly Build Failure 2024-04-03 #3533

Closed
5 tasks
jdangerx opened this issue Apr 3, 2024 · 14 comments · Fixed by #3541
Closed
5 tasks

Nightly Build Failure 2024-04-03 #3533

jdangerx opened this issue Apr 3, 2024 · 14 comments · Fixed by #3541
Assignees
Labels
bug Things that are just plain broken. nightly-builds Anything having to do with nightly builds or continuous deployment.

Comments

@jdangerx
Copy link
Member

jdangerx commented Apr 3, 2024

Overview

What do you think happened, if anything?

The _out_ferc714__hourly_demand_matrix job ran into an OOM kill.

From the logs we can see that it started at 07:10:43 UTC and got the kill at 07:11:46 UTC.

here are the VM instance RAM metrics over the 07:00 - 07:20 timeframe, you can see that when the job started we were using ~35GB of RAM already and then we quickly jumped up to 60+ before getting killed at 07:12.

image

We also saw this same issue 2024-04-02, but didn't make an issue for it.

Next steps

What next steps do we need to do to understand or remediate the issue?

We could maybe force this asset to run non-concurrently.

We could maybe reduce the RAM usage of this asset.

Verify that everything is fixed!

Once you've applied any necessary fixes, make sure that the nightly build outputs are all in their right places.

Tasks

Preview Give feedback

Relevant logs

link to build logs from internal distribution bucket

2024-04-03 07:10:03 +0000 - dagster - DEBUG - etl_job - be4a0899-56fa-429f-831a-93cd28c6648b - 1779 - _out_ferc714__hourly_demand_matrix - STEP_WORKER_STARTING - Launching subprocess for "_out_ferc714__hourly_demand_matrix".
2024-04-03 07:10:40 +0000 - dagster - DEBUG - etl_job - be4a0899-56fa-429f-831a-93cd28c6648b - 23579 - _out_ferc714__hourly_demand_matrix - STEP_WORKER_STARTED - Executing step "_out_ferc714__hourly_demand_matrix" in subprocess.
2024-04-03 07:10:42 +0000 - dagster - DEBUG - etl_job - be4a0899-56fa-429f-831a-93cd28c6648b - 23579 - _out_ferc714__hourly_demand_matrix - RESOURCE_INIT_STARTED - Starting initialization of resources [io_manager].
2024-04-03 07:10:42 +0000 - dagster - DEBUG - etl_job - be4a0899-56fa-429f-831a-93cd28c6648b - 23579 - _out_ferc714__hourly_demand_matrix - RESOURCE_INIT_SUCCESS - Finished initialization of resources [io_manager].
2024-04-03 07:10:43 +0000 - dagster - DEBUG - etl_job - be4a0899-56fa-429f-831a-93cd28c6648b - 23579 - _out_ferc714__hourly_demand_matrix - STEP_START - Started execution of step "_out_ferc714__hourly_demand_matrix".
2024-04-03 07:11:46 +0000 - dagster - DEBUG - etl_job - be4a0899-56fa-429f-831a-93cd28c6648b - 1779 - _out_ferc714__hourly_demand_matrix - ENGINE_EVENT - Multiprocess executor: child process for step _out_ferc714__hourly_demand_matrix was terminated by signal 9 (SIGKILL). This usually indicates that the process was killed by the operating system due to running out of memory. Possible solutions include increasing the amount of memory available to the run, reducing the amount of memory used by the ops in the run, or configuring the executor to run fewer ops concurrently.

There are several other concurrent tasks running during these two minutes:

> ag --no-numbers '07:1[01].*STEP' 2024-04-03-0602-c28abb2be-main-pudl-etl.log | cut -d ' ' -f15 | sort | uniq 
_core_eia860__boilers
_core_eia860__ownership
_core_eia860__plants
_core_eia860__utilities
_core_eia923__boiler_fuel
_core_eia923__coalmine
_core_eia923__cooling_system_information
_core_eia923__fgd_operation_maintenance
_out_ferc1__yearly_steam_plants_sched402_with_plant_ids.ferc_to_ferc.merge_steam_fuel_dfs
_out_ferc714__hourly_demand_matrix
core_demand_side_management_eia861
core_eia861__yearly_balancing_authority
core_eia861__yearly_dynamic_pricing
extract_phmsagas
raw_eia930__balance
raw_eia930__interchange
raw_eia930__subregion
all the dagster steps ``` > ag --no-numbers '07:1[01].*STEP' 2024-04-03-0602-c28abb2be-main-pudl-etl.log | cut -d ' ' -f1 -f2 -f15 -f17 2024-04-03 07:07:11 core_demand_side_management_eia861 STEP_WORKER_STARTED 2024-04-03 07:07:11 core_eia861__yearly_dynamic_pricing STEP_WORKER_STARTING 2024-04-03 07:10:00 extract_phmsagas STEP_WORKER_STARTING 2024-04-03 07:10:00 _core_eia860__ownership STEP_WORKER_STARTED 2024-04-03 07:10:02 _core_eia860__plants STEP_WORKER_STARTED 2024-04-03 07:10:03 _out_ferc714__hourly_demand_matrix STEP_WORKER_STARTING 2024-04-03 07:10:03 _core_eia860__ownership STEP_START 2024-04-03 07:10:03 _core_eia860__ownership STEP_INPUT 2024-04-03 07:10:04 _core_eia860__utilities STEP_WORKER_STARTED 2024-04-03 07:10:04 _core_eia860__plants STEP_START 2024-04-03 07:10:05 _core_eia860__plants STEP_INPUT 2024-04-03 07:10:06 _core_eia860__utilities STEP_START 2024-04-03 07:10:07 _core_eia860__utilities STEP_INPUT 2024-04-03 07:10:11 _core_eia860__utilities STEP_OUTPUT 2024-04-03 07:10:17 _out_ferc1__yearly_steam_plants_sched402_with_plant_ids.ferc_to_ferc.merge_steam_fuel_dfs STEP_WORKER_STARTING 2024-04-03 07:10:28 _core_eia860__plants STEP_OUTPUT 2024-04-03 07:10:29 _core_eia860__ownership STEP_OUTPUT 2024-04-03 07:10:29 _core_eia860__ownership STEP_SUCCESS 2024-04-03 07:10:30 extract_phmsagas STEP_WORKER_STARTED 2024-04-03 07:10:31 _core_eia860__plants STEP_SUCCESS 2024-04-03 07:10:36 raw_eia930__balance STEP_WORKER_STARTING 2024-04-03 07:10:37 raw_eia930__interchange STEP_WORKER_STARTING 2024-04-03 07:10:37 extract_phmsagas STEP_START 2024-04-03 07:10:40 _out_ferc714__hourly_demand_matrix STEP_WORKER_STARTED 2024-04-03 07:10:41 extract_phmsagas STEP_INPUT 2024-04-03 07:10:41 extract_phmsagas STEP_OUTPUT 2024-04-03 07:10:42 extract_phmsagas STEP_OUTPUT 2024-04-03 07:10:42 extract_phmsagas STEP_OUTPUT 2024-04-03 07:10:43 extract_phmsagas STEP_OUTPUT 2024-04-03 07:10:43 extract_phmsagas STEP_OUTPUT 2024-04-03 07:10:43 _core_eia860__boilers STEP_OUTPUT 2024-04-03 07:10:43 _out_ferc714__hourly_demand_matrix STEP_START 2024-04-03 07:10:43 _out_ferc714__hourly_demand_matrix STEP_INPUT 2024-04-03 07:10:43 extract_phmsagas STEP_OUTPUT 2024-04-03 07:10:43 extract_phmsagas STEP_OUTPUT 2024-04-03 07:10:44 extract_phmsagas STEP_OUTPUT 2024-04-03 07:10:44 extract_phmsagas STEP_OUTPUT 2024-04-03 07:10:44 extract_phmsagas STEP_OUTPUT 2024-04-03 07:10:45 extract_phmsagas STEP_OUTPUT 2024-04-03 07:10:45 extract_phmsagas STEP_OUTPUT 2024-04-03 07:10:45 extract_phmsagas STEP_OUTPUT 2024-04-03 07:10:45 extract_phmsagas STEP_OUTPUT 2024-04-03 07:10:47 raw_eia930__subregion STEP_WORKER_STARTING 2024-04-03 07:10:50 _out_ferc1__yearly_steam_plants_sched402_with_plant_ids.ferc_to_ferc.merge_steam_fuel_dfs STEP_WORKER_STARTED 2024-04-03 07:10:51 core_eia861__yearly_balancing_authority STEP_WORKER_STARTING 2024-04-03 07:10:57 raw_eia930__balance STEP_WORKER_STARTED 2024-04-03 07:10:57 raw_eia930__interchange STEP_WORKER_STARTED 2024-04-03 07:10:59 raw_eia930__balance STEP_START 2024-04-03 07:10:59 raw_eia930__interchange STEP_START 2024-04-03 07:11:00 _core_eia923__boiler_fuel STEP_WORKER_STARTING 2024-04-03 07:11:05 raw_eia930__subregion STEP_WORKER_STARTED 2024-04-03 07:11:08 raw_eia930__subregion STEP_START 2024-04-03 07:11:11 core_eia861__yearly_balancing_authority STEP_WORKER_STARTED 2024-04-03 07:11:18 core_eia861__yearly_balancing_authority STEP_START 2024-04-03 07:11:19 core_eia861__yearly_balancing_authority STEP_INPUT 2024-04-03 07:11:22 _core_eia923__boiler_fuel STEP_WORKER_STARTED 2024-04-03 07:11:23 _core_eia923__coalmine STEP_WORKER_STARTING 2024-04-03 07:11:24 _core_eia923__boiler_fuel STEP_START 2024-04-03 07:11:26 _core_eia923__boiler_fuel STEP_INPUT 2024-04-03 07:11:46 _out_ferc714__hourly_demand_matrix STEP_FAILURE 2024-04-03 07:11:46 _core_eia923__cooling_system_information STEP_WORKER_STARTING 2024-04-03 07:11:52 raw_eia930__interchange STEP_INPUT 2024-04-03 07:11:52 raw_eia930__interchange STEP_OUTPUT 2024-04-03 07:11:52 _core_eia923__boiler_fuel STEP_OUTPUT 2024-04-03 07:11:53 _core_eia923__coalmine STEP_WORKER_STARTED 2024-04-03 07:11:53 raw_eia930__balance STEP_INPUT 2024-04-03 07:11:53 raw_eia930__balance STEP_OUTPUT 2024-04-03 07:11:55 _core_eia923__coalmine STEP_START 2024-04-03 07:11:57 raw_eia930__subregion STEP_INPUT 2024-04-03 07:11:57 raw_eia930__subregion STEP_OUTPUT 2024-04-03 07:11:57 raw_eia930__subregion STEP_SUCCESS 2024-04-03 07:11:57 _core_eia923__fgd_operation_maintenance STEP_WORKER_STARTING 2024-04-03 07:11:58 _core_eia923__coalmine STEP_INPUT ```
@jdangerx jdangerx added bug Things that are just plain broken. nightly-builds Anything having to do with nightly builds or continuous deployment. labels Apr 3, 2024
@bendnorman
Copy link
Member

If we're confident _out_ferc714__hourly_demand_matrix is the asset that is blowing up memory we can limit concurrentcy when the asset is run. We do this for the core_epacems__hourly_emissions asset because using all CPUs locally destroys our computers.

Side note: I don't know if we're are using all CPUs for processing EPA CEMS on the VM. Is this intentional? In gcp_pudl_etl.sh we're using the default number of works for the --epacems-workers argument which is 2.

@zaneselvans
Copy link
Member

With 64MB of memory in the nightly builds, we could definitely be using more cores on EPA CEMS, but we're approaching 400 assets, and it's not deterministic which ones will be running at the same time. Right now the FERC-714 imputations and the EPA CEMS are the long poles. I really wish there were some way to chunk the EPA CEMS into a smaller memory footprint. The state-year chunks we had before were very manageable and let us do a lot more parallelism.

I wish there were some way for Dagster to explicitly manage memory usage as well as CPU usage.

@zaneselvans
Copy link
Member

Short-term, reducing the concurrency for the internally parallelized work of the FERC-714 asset seems like the simplest fix!

@jdangerx
Copy link
Member Author

jdangerx commented Apr 3, 2024

It's definitely simple, though it's possible some other task is also taking up a bunch of RAM and this just happened to get killed both times. Worth a shot while we figure out a better strategy re: memory usage.

@zaneselvans
Copy link
Member

I would be tempted to blame the extremely parallelized extraction of spreadsheets, which we've just added a bunch of additional work to, but that happens right at the beginning of the ETL, and this crash isn't happening for 5 hours!

@bendnorman
Copy link
Member

Sounds good! Are there other suspects in the list of tasks that we were running when we got the OOM error?

@zaneselvans
Copy link
Member

TBH I am also very confused how the ETL can take 5 hours to get to this point. With 16 CPUs and 64 GB of memory, it's 5 hours into doing something that takes my laptop with 10 CPUs and 32 GB of memory 2 hours tops.

@zaneselvans zaneselvans moved this from New to In progress in Catalyst Megaproject Apr 3, 2024
@zaneselvans
Copy link
Member

The only other thing in there that I could imagine being unusually resource intensive is the FERC to FERC matching step, but I don't know if that particular merge is a big memory hog.

@jdangerx
Copy link
Member Author

jdangerx commented Apr 3, 2024

TBH I am also very confused how the ETL can take 5 hours to get to this point. With 16 CPUs and 64 GB of memory, it's 5 hours into doing something that takes my laptop with 10 CPUs and 32 GB of memory 2 hours tops.

The ETL returns with a failure code 5 hours in, but if you look at the timestamps we see that the ETL starts at 06:08 UTC:

2024-04-03 06:08:39 +0000 - dagster - DEBUG - ferc_to_sqlite_job - 20634ea8-e16f-4b96-b880-c208266b0a27 - 102 - RUN_START - Started execution of run for "ferc_to_sqlite_job".

And then the OOM happens an hour in, at 07:11:

2024-04-03 07:11:46 +0000 - dagster - DEBUG - etl_job - be4a0899-56fa-429f-831a-93cd28c6648b - 1779 - _out_ferc714__hourly_demand_matrix - ENGINE_EVENT - Multiprocess executor: child process for step _out_ferc714__hourly_demand_matrix was terminated by signal 9 (SIGKILL). This usually indicates that the process was killed by the operating system due to running out of memory. Possible solutions include increasing the amount of memory available to the run, reducing the amount of memory used by the ops in the run, or configuring the executor to run fewer ops concurrently.

And then finally the run fails:

2024-04-03 11:00:06 +0000 - dagster - ERROR - etl_job - be4a0899-56fa-429f-831a-93cd28c6648b - 1779 - RUN_FAILURE - Execution of run for "etl_job" failed. Steps failed: ['_out_ferc714__hourly_demand_matrix'].

@zaneselvans
Copy link
Member

Oooh, interesting. So it's just the one asset materialization that fails, and everything else keeps running -- the VM doesn't crash. I wonder if we could just add a retry to this asset?

@bendnorman
Copy link
Member

On second thought, I might be misunderstanding dagster's concurrency limit feature:

Limits can be specified for all ops/assets with a specific tag key or key-value pair. If any limit is exceeded by launching an op/asset, then the op/asset will be queued.

I think limiting pudl.etl.epacems_assets.process_single_year using tags works because we are running this function multiple times. Our default value for CEMS is 2. Does the tag limiting capability restrict the number of processes available to the specific op or for all ops trying to run in the job. For example, when pudl.etl.epacems_assets.process_single_year is running can other ops be running?

@jdangerx
Copy link
Member Author

jdangerx commented Apr 4, 2024

This happened again on 2024-04-04, which is to be expected because #3541 wasn't merged yet. Just noting that down here for posterity.

@zaneselvans
Copy link
Member

@jdangerx do you think that #3541 would have any effect? There's already only 1 CPU being used in the asset that's getting killed.

This problem seems to be happening reliably in the nightly builds, but I've still never seen it running the full ETL locally on a machine with half as much memory. If it's just that a couple of high memory assets happen to be running at the same time could we use an op retry to just keep attempting to materialize it until there's enough free memory for it to complete? Or does the fact that it's getting killed rather than raising an exception mean that won't work. Locally this asset takes ~3 minutes to run and peaks at ~4GB of memory usage.

@bendnorman
Copy link
Member

I think the ETL fails more often on the VM because GCP VMs do not have swap enabled by default:

Caution: You must ensure that tmpfs volumes on your instance do not use up all of your instance memory. By default, Compute Engine instances do not have any swap space. If you use all of the available instance memory in a RAM disk, the instance can crash and potentially cause you to lose your data.

I don't think it's trivial to enable it.

@github-project-automation github-project-automation bot moved this from In progress to Done in Catalyst Megaproject Apr 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Things that are just plain broken. nightly-builds Anything having to do with nightly builds or continuous deployment.
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

3 participants