-
-
Notifications
You must be signed in to change notification settings - Fork 118
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Nightly Build Failure 2024-04-03 #3533
Comments
If we're confident Side note: I don't know if we're are using all CPUs for processing EPA CEMS on the VM. Is this intentional? In |
With 64MB of memory in the nightly builds, we could definitely be using more cores on EPA CEMS, but we're approaching 400 assets, and it's not deterministic which ones will be running at the same time. Right now the FERC-714 imputations and the EPA CEMS are the long poles. I really wish there were some way to chunk the EPA CEMS into a smaller memory footprint. The state-year chunks we had before were very manageable and let us do a lot more parallelism. I wish there were some way for Dagster to explicitly manage memory usage as well as CPU usage. |
Short-term, reducing the concurrency for the internally parallelized work of the FERC-714 asset seems like the simplest fix! |
It's definitely simple, though it's possible some other task is also taking up a bunch of RAM and this just happened to get killed both times. Worth a shot while we figure out a better strategy re: memory usage. |
I would be tempted to blame the extremely parallelized extraction of spreadsheets, which we've just added a bunch of additional work to, but that happens right at the beginning of the ETL, and this crash isn't happening for 5 hours! |
Sounds good! Are there other suspects in the list of tasks that we were running when we got the OOM error? |
TBH I am also very confused how the ETL can take 5 hours to get to this point. With 16 CPUs and 64 GB of memory, it's 5 hours into doing something that takes my laptop with 10 CPUs and 32 GB of memory 2 hours tops. |
The only other thing in there that I could imagine being unusually resource intensive is the FERC to FERC matching step, but I don't know if that particular merge is a big memory hog. |
The ETL returns with a failure code 5 hours in, but if you look at the timestamps we see that the ETL starts at 06:08 UTC:
And then the OOM happens an hour in, at 07:11:
And then finally the run fails:
|
Oooh, interesting. So it's just the one asset materialization that fails, and everything else keeps running -- the VM doesn't crash. I wonder if we could just add a retry to this asset? |
On second thought, I might be misunderstanding dagster's concurrency limit feature:
I think limiting |
This happened again on 2024-04-04, which is to be expected because #3541 wasn't merged yet. Just noting that down here for posterity. |
@jdangerx do you think that #3541 would have any effect? There's already only 1 CPU being used in the asset that's getting killed. This problem seems to be happening reliably in the nightly builds, but I've still never seen it running the full ETL locally on a machine with half as much memory. If it's just that a couple of high memory assets happen to be running at the same time could we use an op retry to just keep attempting to materialize it until there's enough free memory for it to complete? Or does the fact that it's getting killed rather than raising an exception mean that won't work. Locally this asset takes ~3 minutes to run and peaks at ~4GB of memory usage. |
I think the ETL fails more often on the VM because GCP VMs do not have swap enabled by default:
I don't think it's trivial to enable it. |
Overview
What do you think happened, if anything?
The
_out_ferc714__hourly_demand_matrix
job ran into an OOM kill.From the logs we can see that it started at 07:10:43 UTC and got the kill at 07:11:46 UTC.
here are the VM instance RAM metrics over the 07:00 - 07:20 timeframe, you can see that when the job started we were using ~35GB of RAM already and then we quickly jumped up to 60+ before getting killed at 07:12.
We also saw this same issue 2024-04-02, but didn't make an issue for it.
Next steps
What next steps do we need to do to understand or remediate the issue?
We could maybe force this asset to run non-concurrently.
We could maybe reduce the RAM usage of this asset.
Verify that everything is fixed!
Once you've applied any necessary fixes, make sure that the nightly build outputs are all in their right places.
Tasks
Relevant logs
link to build logs from internal distribution bucket
There are several other concurrent tasks running during these two minutes:
all the dagster steps
``` > ag --no-numbers '07:1[01].*STEP' 2024-04-03-0602-c28abb2be-main-pudl-etl.log | cut -d ' ' -f1 -f2 -f15 -f17 2024-04-03 07:07:11 core_demand_side_management_eia861 STEP_WORKER_STARTED 2024-04-03 07:07:11 core_eia861__yearly_dynamic_pricing STEP_WORKER_STARTING 2024-04-03 07:10:00 extract_phmsagas STEP_WORKER_STARTING 2024-04-03 07:10:00 _core_eia860__ownership STEP_WORKER_STARTED 2024-04-03 07:10:02 _core_eia860__plants STEP_WORKER_STARTED 2024-04-03 07:10:03 _out_ferc714__hourly_demand_matrix STEP_WORKER_STARTING 2024-04-03 07:10:03 _core_eia860__ownership STEP_START 2024-04-03 07:10:03 _core_eia860__ownership STEP_INPUT 2024-04-03 07:10:04 _core_eia860__utilities STEP_WORKER_STARTED 2024-04-03 07:10:04 _core_eia860__plants STEP_START 2024-04-03 07:10:05 _core_eia860__plants STEP_INPUT 2024-04-03 07:10:06 _core_eia860__utilities STEP_START 2024-04-03 07:10:07 _core_eia860__utilities STEP_INPUT 2024-04-03 07:10:11 _core_eia860__utilities STEP_OUTPUT 2024-04-03 07:10:17 _out_ferc1__yearly_steam_plants_sched402_with_plant_ids.ferc_to_ferc.merge_steam_fuel_dfs STEP_WORKER_STARTING 2024-04-03 07:10:28 _core_eia860__plants STEP_OUTPUT 2024-04-03 07:10:29 _core_eia860__ownership STEP_OUTPUT 2024-04-03 07:10:29 _core_eia860__ownership STEP_SUCCESS 2024-04-03 07:10:30 extract_phmsagas STEP_WORKER_STARTED 2024-04-03 07:10:31 _core_eia860__plants STEP_SUCCESS 2024-04-03 07:10:36 raw_eia930__balance STEP_WORKER_STARTING 2024-04-03 07:10:37 raw_eia930__interchange STEP_WORKER_STARTING 2024-04-03 07:10:37 extract_phmsagas STEP_START 2024-04-03 07:10:40 _out_ferc714__hourly_demand_matrix STEP_WORKER_STARTED 2024-04-03 07:10:41 extract_phmsagas STEP_INPUT 2024-04-03 07:10:41 extract_phmsagas STEP_OUTPUT 2024-04-03 07:10:42 extract_phmsagas STEP_OUTPUT 2024-04-03 07:10:42 extract_phmsagas STEP_OUTPUT 2024-04-03 07:10:43 extract_phmsagas STEP_OUTPUT 2024-04-03 07:10:43 extract_phmsagas STEP_OUTPUT 2024-04-03 07:10:43 _core_eia860__boilers STEP_OUTPUT 2024-04-03 07:10:43 _out_ferc714__hourly_demand_matrix STEP_START 2024-04-03 07:10:43 _out_ferc714__hourly_demand_matrix STEP_INPUT 2024-04-03 07:10:43 extract_phmsagas STEP_OUTPUT 2024-04-03 07:10:43 extract_phmsagas STEP_OUTPUT 2024-04-03 07:10:44 extract_phmsagas STEP_OUTPUT 2024-04-03 07:10:44 extract_phmsagas STEP_OUTPUT 2024-04-03 07:10:44 extract_phmsagas STEP_OUTPUT 2024-04-03 07:10:45 extract_phmsagas STEP_OUTPUT 2024-04-03 07:10:45 extract_phmsagas STEP_OUTPUT 2024-04-03 07:10:45 extract_phmsagas STEP_OUTPUT 2024-04-03 07:10:45 extract_phmsagas STEP_OUTPUT 2024-04-03 07:10:47 raw_eia930__subregion STEP_WORKER_STARTING 2024-04-03 07:10:50 _out_ferc1__yearly_steam_plants_sched402_with_plant_ids.ferc_to_ferc.merge_steam_fuel_dfs STEP_WORKER_STARTED 2024-04-03 07:10:51 core_eia861__yearly_balancing_authority STEP_WORKER_STARTING 2024-04-03 07:10:57 raw_eia930__balance STEP_WORKER_STARTED 2024-04-03 07:10:57 raw_eia930__interchange STEP_WORKER_STARTED 2024-04-03 07:10:59 raw_eia930__balance STEP_START 2024-04-03 07:10:59 raw_eia930__interchange STEP_START 2024-04-03 07:11:00 _core_eia923__boiler_fuel STEP_WORKER_STARTING 2024-04-03 07:11:05 raw_eia930__subregion STEP_WORKER_STARTED 2024-04-03 07:11:08 raw_eia930__subregion STEP_START 2024-04-03 07:11:11 core_eia861__yearly_balancing_authority STEP_WORKER_STARTED 2024-04-03 07:11:18 core_eia861__yearly_balancing_authority STEP_START 2024-04-03 07:11:19 core_eia861__yearly_balancing_authority STEP_INPUT 2024-04-03 07:11:22 _core_eia923__boiler_fuel STEP_WORKER_STARTED 2024-04-03 07:11:23 _core_eia923__coalmine STEP_WORKER_STARTING 2024-04-03 07:11:24 _core_eia923__boiler_fuel STEP_START 2024-04-03 07:11:26 _core_eia923__boiler_fuel STEP_INPUT 2024-04-03 07:11:46 _out_ferc714__hourly_demand_matrix STEP_FAILURE 2024-04-03 07:11:46 _core_eia923__cooling_system_information STEP_WORKER_STARTING 2024-04-03 07:11:52 raw_eia930__interchange STEP_INPUT 2024-04-03 07:11:52 raw_eia930__interchange STEP_OUTPUT 2024-04-03 07:11:52 _core_eia923__boiler_fuel STEP_OUTPUT 2024-04-03 07:11:53 _core_eia923__coalmine STEP_WORKER_STARTED 2024-04-03 07:11:53 raw_eia930__balance STEP_INPUT 2024-04-03 07:11:53 raw_eia930__balance STEP_OUTPUT 2024-04-03 07:11:55 _core_eia923__coalmine STEP_START 2024-04-03 07:11:57 raw_eia930__subregion STEP_INPUT 2024-04-03 07:11:57 raw_eia930__subregion STEP_OUTPUT 2024-04-03 07:11:57 raw_eia930__subregion STEP_SUCCESS 2024-04-03 07:11:57 _core_eia923__fgd_operation_maintenance STEP_WORKER_STARTING 2024-04-03 07:11:58 _core_eia923__coalmine STEP_INPUT ```The text was updated successfully, but these errors were encountered: