Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retry hash file allocation #33565

Merged
merged 7 commits into from
Oct 16, 2023
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
68 changes: 53 additions & 15 deletions accounts-db/src/accounts_hash.rs
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,7 @@ use {
atomic::{AtomicU64, AtomicUsize, Ordering},
Arc,
},
thread, time,
},
tempfile::tempfile_in,
};
Expand Down Expand Up @@ -87,22 +88,59 @@ impl AccountHashesFile {
if self.writer.is_none() {
// we have hashes to write but no file yet, so create a file that will auto-delete on drop

let mut data = tempfile_in(&self.dir_for_temp_cache_files).unwrap_or_else(|err| {
panic!(
"Unable to create file within {}: {err}",
self.dir_for_temp_cache_files.display()
)
});
let get_file = || -> Result<_, std::io::Error> {
let mut data = tempfile_in(&self.dir_for_temp_cache_files).unwrap_or_else(|err| {
panic!(
"Unable to create file within {}: {err}",
self.dir_for_temp_cache_files.display()
)
});

// Theoretical performance optimization: write a zero to the end of
// the file so that we won't have to resize it later, which may be
// expensive.
assert!(self.capacity > 0);
data.seek(SeekFrom::Start((self.capacity - 1) as u64))?;
data.write_all(&[0])?;
data.rewind()?;
data.flush()?;
Ok(data)
};

// Retry 5 times to allocate the AccountHashesFile. The memory might be fragmented and
// causes memory allocation failure. Therefore, let's retry after failure. Hoping that the
// kernel has the chance to defrag the memory between the retries, and retries succeed.
let mut num_retries = 0;
let data = loop {
num_retries += 1;

match get_file() {
HaoranYi marked this conversation as resolved.
Show resolved Hide resolved
Ok(data) => {
break data;
}
Err(err) => {
info!(
"Unable to create account hashes file within {}: {}, retry counter {}",
self.dir_for_temp_cache_files.display(),
err,
num_retries
);

// Theoretical performance optimization: write a zero to the end of
// the file so that we won't have to resize it later, which may be
// expensive.
assert!(self.capacity > 0);
data.seek(SeekFrom::Start((self.capacity - 1) as u64))
.unwrap();
data.write_all(&[0]).unwrap();
data.rewind().unwrap();
data.flush().unwrap();
if num_retries > 5 {
panic!(
"Unable to create account hashes file within {}: after {} retries",
self.dir_for_temp_cache_files.display(),
num_retries
);
}
datapoint_info!(
"retry_account_hashes_file_allocation",
("retry", num_retries, i64)
);
thread::sleep(time::Duration::from_millis(num_retries * 100));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

100 milliseconds is a long time. Maybe too long? I'm not sure. How was this constant chosen?

Copy link
Contributor Author

@HaoranYi HaoranYi Oct 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah. I think this should not happen very often, so choose 100ms to be conservative.

}
}
};

//UNSAFE: Required to create a Mmap
let map = unsafe { MmapMut::map_mut(&data) };
Expand Down