-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: stop infinite process of presignature message missing triples #560
Conversation
Terraform Feature Environment (dev-560)Terraform Initialization ⚙️
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've checked the logs and there are no messages (no new messages received
) without signature requests. There is no infinite loop of messages, but maybe they stay in the loops for a long time.
We should choose one approach here and use it in triple, presignature, and signature generation protocols since the problematics is the same.
node/src/types.rs
Outdated
@@ -29,6 +29,9 @@ pub const FAILED_TRIPLES_TIMEOUT: Duration = Duration::from_secs(120 * 60); | |||
/// Default invalidation time for taken triples and presignatures. 120 mins | |||
pub const TAKEN_TIMEOUT: Duration = Duration::from_secs(120 * 60); | |||
|
|||
/// Default invalidation time for requested presignatures. 120 mins |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's not specify the same numbers in the comments. They can be changed in the future and make future devs confused.
The message is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So for the most part, we're seeing these logs when we're waiting for triples to get generated, and mostly occurs when we don't have a stockpile of triples yet, when the presignature manager is waiting for specific triples. So we should be distinguishing between TripleIsMissing
and TripleIsGenerating
sort of errors
The case that's more problematic is when a node does not have a triple, and not going to generate it (for example, if somehow datastore did not persist triples successfully or the node goes offline at an unfortunate time), but it will always be processing the msg that asks the node to generate a presignature with that triple. Such msg will stay in the queue forever. TripleIsGenerating makes sense. We could differentiate between the two errors, and add timestamp to the messages, and skip a message if it already timed out. |
Agree we should use the same approach. |
a00b871
to
566e3eb
Compare
removed requested altogether. |
@@ -83,3 +84,16 @@ pub fn get_triple_timeout() -> Duration { | |||
.unwrap_or_default() | |||
.unwrap_or(crate::types::PROTOCOL_TRIPLE_TIMEOUT) | |||
} | |||
|
|||
pub fn is_elapsed_longer_than_timeout(timestamp_sec: u64, timeout: Duration) -> bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We've used simpler pattern before:
if timestamp.elapsed() < DEFAULT_TIMEOUT {...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that elapsed works with Instant, which records the relative timestamp in the system, not the absolute timestamp. Since we are sending the timestamp as a field in the messages, it needs to be comprehensible in all nodes' system, thus using u64 timestamp to represent the time.
@@ -43,6 +44,8 @@ pub struct TripleMessage { | |||
pub epoch: u64, | |||
pub from: Participant, | |||
pub data: MessageData, | |||
// UNIX timestamp as seconds since the epoch | |||
pub timestamp: u64, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's add this field to all message types. Let's make it a standard. We want to delete all types of messages from the queue when they expire. Message trait may be a good idea here.
+ chrono::Duration::nanoseconds(timeout.subsec_nanos() as i64); | ||
elapsed_duration > timeout | ||
} else { | ||
false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we return false in case of failure, such message will never be deleted from the queue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like the case where it's not Single
only happens in the following
Returns MappedLocalTime::None on out-of-range number of seconds and/or invalid nanosecond, otherwise always returns MappedLocalTime::Single.
So we should be fine for the most part since our timestamp should never be out of range since we created it with Utc::now().timestamp()
in the first place, but maybe we should add some guardrails to the type itself later when we get the chance, just in case someone ends up setting timestamp
to be something else not from the chrono
crate
hmmm I realized adding the timestamp as required to messages may make it non-backward compatible. It should be an optional field instead. Will change. |
@ppca currently we are updating all the nodes at the same time. We better do it as required now than dealing with Optional later. |
Problem is partner testnet is not....they update within an hour but not at the same time |
Ok, but let's delete Optional after the update. |
Terraform Feature Environment Destroy (dev-560)Terraform Initialization ⚙️
|
Currently, we would push presignature message to leftover messages if not both triples are found. This means we will likely process these messages over and over and never get rid of them.
This change will time such message out if > 1 min (presignature timeout threshold), and this message won't be retained in the queue.