Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retry hash file allocation #33565

Merged
merged 7 commits into from
Oct 16, 2023
Merged

Conversation

HaoranYi
Copy link
Contributor

@HaoranYi HaoranYi commented Oct 6, 2023

Problem

Recently, on Oct 6, 20023, we saw a validator crash on one of the canaries boxes.

[2023-10-06T06:29:48.605280562Z INFO  solana_metrics::metrics] datapoint: accounts_db_active shrink=0i
[2023-10-06T06:29:48.653273411Z ERROR solana_metrics::metrics] datapoint: panic program="validator" thread="solAccountsLo03" one=1i message="panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 12, kind: OutOfMemory, message: \"Cannot allocate memory\" }', accounts-db/src/accounts_hash.rs:102:34" location="accounts-db/src/accounts_hash.rs:102:34" version="1.18.0 (src:e0091d69; feat:1091887072, client:SolanaLabs)"

The crash happens when allocating account hash file during hash dedup.

[2023-10-06T06:29:31.721389079Z INFO  solana_metrics::metrics] datapoint: memory-stats total=269907476480i swap_total=0i free_percent=3.6710358146504354 used_bytes=39749242880i avail_percent=85.27301155255498 buffers_percent=0.6657022085615106 cached_percent=79.78760422961368 swap_free_percent=0
[2023-10-06T06:29:36.744595335Z INFO  solana_metrics::metrics] datapoint: memory-stats total=269907476480i swap_total=0i free_percent=5.081942011716148 used_bytes=35619844096i avail_percent=86.80294278597377 buffers_percent=0.6663881443585271 cached_percent=79.90525739437273 swap_free_percent=0
[2023-10-06T06:29:41.745857461Z INFO  solana_metrics::metrics] datapoint: memory-stats total=269907476480i swap_total=0i free_percent=5.645556638416836 used_bytes=35985874944i avail_percent=86.66732933325522 buffers_percent=0.6665914970063151 cached_percent=79.21095985788381 swap_free_percent=0
[2023-10-06T06:29:46.768160791Z INFO  solana_metrics::metrics] datapoint: memory-stats total=269907476480i swap_total=0i free_percent=0.6511670469158839 used_bytes=37269983232i avail_percent=86.19157063818434 buffers_percent=0.6666142603624108 cached_percent=83.64288057827423 swap_free_percent=0

The used memory increase by 2G and the free memory drops to 0.65 percent.

It appears that the box still has enough physical memory. However, the allocation fails. Probably due to memory defragmentation and multiple threads allocate mmaps at the same time.

Summary of Changes

Account hash calculation is highly parallel. It seems that when all threads start allocation files to store account hashes at the same time, the system may hit OOM for some of those allocations. To mitigate that, we add retry for account hash file allocation. If a thread fails its allocation, it will sleep for some time and then retry. This will help to stagger the memory allocation from all parallel threads.

Hopefully, the kernel has time to defrag the memory in-between the retries, and the later retry for allocation will succeed.

Fixes #

@HaoranYi HaoranYi changed the title retry hash file allocation Retry hash file allocation Oct 6, 2023
@brooksprumo
Copy link
Contributor

Hopefully, the kernel has defrag the memory and retry allocation will succeed.

I'm not familiar with this part of the kernel. How likely is it that (1) fragmentation was the issue, and (2) that a defrag will run before we've hit our retry limit?

@brooksprumo
Copy link
Contributor

Is there any way to test this? I can't think of anything deterministic off the top of my head. But that would be nice...

@codecov
Copy link

codecov bot commented Oct 6, 2023

Codecov Report

Merging #33565 (afe0c36) into master (7afb11f) will decrease coverage by 0.1%.
Report is 26 commits behind head on master.
The diff coverage is 57.6%.

@@            Coverage Diff            @@
##           master   #33565     +/-   ##
=========================================
- Coverage    81.7%    81.7%   -0.1%     
=========================================
  Files         807      807             
  Lines      218287   218305     +18     
=========================================
- Hits       178524   178501     -23     
- Misses      39763    39804     +41     

@jeffwashington
Copy link
Contributor

This seems reasonable to me. I'd like to see a data point published if we actually have to retry.

@HaoranYi HaoranYi force-pushed the retry_hash_file branch 3 times, most recently from 2a5a57f to 5a20ee7 Compare October 10, 2023 13:36
data.write_all(&[0]).unwrap();
data.rewind().unwrap();
data.flush().unwrap();
// Retry 5 times for allocation the AccountHashFile. The memory maybe fragmented and
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean AccountHashesFile?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Original:

Retry 5 times for allocation the..

suggestion:
Retry 5 times to allocate the...

Maybe more grammar than typo.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok. done.

num_retries
);
}
datapoint_info!("account_hash_file_allocation_retry", ("retry", 1, i64),);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

trailing , to remove since there's another typo above.

@HaoranYi HaoranYi marked this pull request as ready for review October 10, 2023 17:46
accounts-db/src/accounts_hash.rs Outdated Show resolved Hide resolved
);
}
datapoint_info!("retry_account_hashes_file_allocation", ("retry", 1, i64));
thread::sleep(time::Duration::from_millis(num_retries * 100));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

100 milliseconds is a long time. Maybe too long? I'm not sure. How was this constant chosen?

Copy link
Contributor Author

@HaoranYi HaoranYi Oct 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah. I think this should not happen very often, so choose 100ms to be conservative.

HaoranYi and others added 2 commits October 10, 2023 14:30
Co-authored-by: Brooks <brooks@prumo.org>
@HaoranYi
Copy link
Contributor Author

It seems that a similar crash happened again on mc3 last Friday.
@jeffwashington @brooksprumo Do you mind if we merge this PR and give it a try on the canary box?

Copy link
Contributor

@jeffwashington jeffwashington left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@HaoranYi HaoranYi merged commit 167dac2 into solana-labs:master Oct 16, 2023
16 checks passed
@HaoranYi HaoranYi added the v1.17 PRs that should be backported to v1.17 label Oct 28, 2023
mergify bot pushed a commit that referenced this pull request Oct 28, 2023
* retry hash file allocation

* add sleep

* submit a datapoint for retry

* typo

* more typos

* Update accounts-db/src/accounts_hash.rs

Co-authored-by: Brooks <brooks@prumo.org>

* fmt

---------

Co-authored-by: HaoranYi <haoran.yi@solana.com>
Co-authored-by: Brooks <brooks@prumo.org>
(cherry picked from commit 167dac2)

# Conflicts:
#	accounts-db/src/accounts_hash.rs
HaoranYi added a commit that referenced this pull request Oct 30, 2023
* Retry hash file allocation (#33565)

* retry hash file allocation

* add sleep

* submit a datapoint for retry

* typo

* more typos

* Update accounts-db/src/accounts_hash.rs

Co-authored-by: Brooks <brooks@prumo.org>

* fmt

---------

Co-authored-by: HaoranYi <haoran.yi@solana.com>
Co-authored-by: Brooks <brooks@prumo.org>
(cherry picked from commit 167dac2)

# Conflicts:
#	accounts-db/src/accounts_hash.rs

* fix conflicts

---------

Co-authored-by: HaoranYi <haoran.yi@gmail.com>
Co-authored-by: HaoranYi <haoran.yi@solana.com>
anwayde pushed a commit to firedancer-io/solana that referenced this pull request Nov 16, 2023
…lana-labs#33918)

* Retry hash file allocation (solana-labs#33565)

* retry hash file allocation

* add sleep

* submit a datapoint for retry

* typo

* more typos

* Update accounts-db/src/accounts_hash.rs

Co-authored-by: Brooks <brooks@prumo.org>

* fmt

---------

Co-authored-by: HaoranYi <haoran.yi@solana.com>
Co-authored-by: Brooks <brooks@prumo.org>
(cherry picked from commit 167dac2)

# Conflicts:
#	accounts-db/src/accounts_hash.rs

* fix conflicts

---------

Co-authored-by: HaoranYi <haoran.yi@gmail.com>
Co-authored-by: HaoranYi <haoran.yi@solana.com>
anwayde pushed a commit to firedancer-io/solana that referenced this pull request Nov 16, 2023
…lana-labs#33918)

* Retry hash file allocation (solana-labs#33565)

* retry hash file allocation

* add sleep

* submit a datapoint for retry

* typo

* more typos

* Update accounts-db/src/accounts_hash.rs

Co-authored-by: Brooks <brooks@prumo.org>

* fmt

---------

Co-authored-by: HaoranYi <haoran.yi@solana.com>
Co-authored-by: Brooks <brooks@prumo.org>
(cherry picked from commit 167dac2)

# Conflicts:
#	accounts-db/src/accounts_hash.rs

* fix conflicts

---------

Co-authored-by: HaoranYi <haoran.yi@gmail.com>
Co-authored-by: HaoranYi <haoran.yi@solana.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
v1.17 PRs that should be backported to v1.17
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants