Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PVF: re-preparing artifact on failed runtime construction #3187

Conversation

maksimryndin
Copy link
Contributor

@maksimryndin maksimryndin commented Feb 2, 2024

resolve #3139

@maksimryndin
Copy link
Contributor Author

@s0me0ne-unkn0wn before I continue, could you please have a look? Just to confirm whether it is a right direction

@s0me0ne-unkn0wn
Copy link
Contributor

Looks good, thank you! TBH, I had a feeling that it would require a more complex design, but it looks like you managed to keep it simple.

Considering the complexity of PVF host internals, we could definitely use a test to check the behavior is correct, but probably creating it is not as easy as it sounds 🤔

@s0me0ne-unkn0wn s0me0ne-unkn0wn added the T0-node This PR/Issue is related to the topic “node”. label Feb 3, 2024
@maksimryndin
Copy link
Contributor Author

@s0me0ne-unkn0wn hi!:) I believe T13-documentation should be added

I was wrong on the ordering of the reply to result channel of ValidationBackend and the artifact removal in case of RuntimeConstruction. They are concurrent but as the artifact name is random it's not a race issue (and in practice, the artifact is removed first in majority of cases - I did some tests).

@maksimryndin maksimryndin marked this pull request as ready for review February 5, 2024 17:12
@s0me0ne-unkn0wn
Copy link
Contributor

Still struggling to find time to review it myself and run some local tests, maybe other guys will be faster than me. Overall, looks very good! I just need to dive a bit deeper into it.

Copy link
Contributor

@s0me0ne-unkn0wn s0me0ne-unkn0wn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks really good! I've done some local tests, and it works even better than I expected; it catches some cases of on-disk artifact corruption that were only supposed to be addressed by checking hashes.

Left some nits, comments, and questions below, but none are blockers.

Comment on lines 74 to 75
/// A possibly transient runtime instantion error happend during the execution; maybe retried
/// with preparation
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// A possibly transient runtime instantion error happend during the execution; maybe retried
/// with preparation
/// A possibly transient runtime instantiation error happened during the execution; may be retried
/// with re-preparation

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My bad :) thank you! Fixed at a10f16d

) -> Result<(), Fatal> {
let (artifact_id, path) = artifacts
.remove(artifact_id)
.expect("artifact sent by the execute queue for removal exists; qed");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it necessarily true? Could the pruning procedure remove it in the meantime? I'm just discoursing on it now, not asserting. Probably, it's not a valid concern, as the execution should mark the artifact as recently used, and pruning shouldn't touch it. But how bad could it be if we silently ignored the fact that we tried to remove an artifact that didn't even exist in the cache? Sounds safe to me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree that it is safe and there is no need for qed assertion :). Fixed at a10f16d

@@ -238,6 +238,14 @@ impl Artifacts {
.is_none());
}

/// Remove artifact by its id.
pub fn remove(&mut self, artifact_id: ArtifactId) -> Option<(ArtifactId, PathBuf)> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do I get it right, we don't currently remove anything from disk, either here or in the unused artifact pruning procedure? Are they always removed from the memory cache table only, and the disk is cleaned up only on node startup? It's totally okay for now, but we shouldn't forget that if we ever decide to re-enable artifact persistence.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, it seems like I indeed missed the moment when artifact names became really random, so right now, it's not a concern at all. Never mind.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the logic of remove is the same as with pruning, i.e. it affects only the cache

@maksimryndin
Copy link
Contributor Author

Looks really good! I've done some local tests, and it works even better than I expected; it catches some cases of on-disk artifact corruption that were only supposed to be addressed by checking hashes.

Left some nits, comments, and questions below, but none are blockers.

@s0me0ne-unkn0wn thank you for the catch with artifacts' stale cache and a potential panic because of it!

Copy link
Contributor

@AndreiEres AndreiEres left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good job, thank you!

loop {
// Stop retrying if we exceeded the timeout.
if total_time_start.elapsed() + retry_delay > exec_timeout {
break
}

let mut wait_retry_delay = true;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about retry_immediately? I would appreciate a brief comment on why we need it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, thank you! Fixed d722ccf

@s0me0ne-unkn0wn s0me0ne-unkn0wn added this pull request to the merge queue Feb 28, 2024
Merged via the queue into paritytech:master with commit 4261366 Feb 28, 2024
129 of 130 checks passed
ordian added a commit that referenced this pull request Mar 1, 2024
…head-data

* origin/master:
  Fix call enum's metadata regression (#3513)
  Enable elastic scaling node feature in local testnets genesis (#3509)
  update development setup in sdk-docs (#3506)
  Fix accidental no-shows on node restart (#3277)
  Remove `AssignmentProviderConfig` and use parameters from `HostConfiguration` instead (#3181)
  [Deprecation] Remove sp_weights::OldWeight (#3491)
  Fixup multi-collator parachain transition to async backing (#3510)
  Multi-Block-Migrations, `poll` hook and new System callbacks (#1781)
  Snowbridge - Extract Ethereum Chain ID (#3501)
  PVF: re-preparing artifact on failed runtime construction (#3187)
  Add documentation around FRAME Offchain workers (#3463)
  [prdoc] Optional SemVer bumps and Docs (#3441)
  rpc-v2/tx/tests: Add transaction broadcast tests and check propagated tx status (#3193)
ordian added a commit that referenced this pull request Mar 1, 2024
…data

* ao-collator-parent-head-data:
  Fix call enum's metadata regression (#3513)
  Enable elastic scaling node feature in local testnets genesis (#3509)
  update development setup in sdk-docs (#3506)
  Fix accidental no-shows on node restart (#3277)
  Remove `AssignmentProviderConfig` and use parameters from `HostConfiguration` instead (#3181)
  [Deprecation] Remove sp_weights::OldWeight (#3491)
  Fixup multi-collator parachain transition to async backing (#3510)
  Multi-Block-Migrations, `poll` hook and new System callbacks (#1781)
  Snowbridge - Extract Ethereum Chain ID (#3501)
  PVF: re-preparing artifact on failed runtime construction (#3187)
  Add documentation around FRAME Offchain workers (#3463)
  [prdoc] Optional SemVer bumps and Docs (#3441)
  rpc-v2/tx/tests: Add transaction broadcast tests and check propagated tx status (#3193)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
T0-node This PR/Issue is related to the topic “node”.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

PVF: Consider re-preparing artifact on failed runtime construction
3 participants