Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix incorrect usage of tracing span in async functions. #10663

Merged
merged 2 commits into from
Mar 8, 2024

Conversation

robin-near
Copy link
Contributor

See https://docs.rs/tracing/latest/tracing/struct.Span.html#method.enter

It is incorrect to do this:

async fn f() {
  let _ = tracing::span!(...).entered();
  ...
}

because when the function yields, the tracing span remains entered. This causes incorrect span nesting, and we have observed a stack overflow as a result, when the system is under load.

@robin-near
Copy link
Contributor Author

@nagisa Any idea how to make this lintable? It's pretty subtle and I don't imagine this is very common knowledge. On the other hand, it's not such a huge deal because we don't use a lot async functions outside of networking code. Still, one day someone is going to do this again.

Copy link

codecov bot commented Feb 27, 2024

Codecov Report

Attention: Patch coverage is 65.32258% with 43 lines in your changes are missing coverage. Please review.

Project coverage is 71.71%. Comparing base (374287d) to head (4babb51).

Files Patch % Lines
chain/network/src/peer/peer_actor.rs 65.32% 37 Missing and 6 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master   #10663      +/-   ##
==========================================
+ Coverage   71.67%   71.71%   +0.03%     
==========================================
  Files         756      756              
  Lines      152394   152413      +19     
  Branches   152394   152413      +19     
==========================================
+ Hits       109231   109300      +69     
+ Misses      38184    38156      -28     
+ Partials     4979     4957      -22     
Flag Coverage Δ
backward-compatibility 0.24% <0.00%> (-0.01%) ⬇️
db-migration 0.24% <0.00%> (-0.01%) ⬇️
genesis-check 1.42% <0.00%> (-0.01%) ⬇️
integration-tests 36.80% <52.41%> (+0.01%) ⬆️
linux 70.49% <57.25%> (+0.06%) ⬆️
linux-nightly 71.15% <65.32%> (+0.01%) ⬆️
macos 54.79% <0.00%> (-0.02%) ⬇️
pytests 1.65% <0.00%> (-0.01%) ⬇️
sanity-checks 1.43% <0.00%> (-0.01%) ⬇️
unittests 67.57% <46.77%> (+0.02%) ⬆️
upgradability 0.29% <0.00%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@akhi3030
Copy link
Collaborator

I wonder if it is possible to check statically if you are inside an async function. If it possible to do that, then we could introduce that check around the call to span. Would something like that work?

@nagisa
Copy link
Collaborator

nagisa commented Feb 27, 2024

I don't think there are easy ways to prevent tracing::Spans from existing across yield points without (breaking) changes to tracing itself.

One way would be to have Span not be Send+Sync, which with our threaded use would make the futures !Send !Sync as well.

More reasonably, this wouldn't have occurred if we chosen to use the #[instrument] macro instead of span! everywhere. It transparently does the correct thing based on the function type, and any use of other stuff would appear pretty out-of-place, which would suggest reviewers to pay a closer attention.

I would suggest doing exactly that in this instance instead of manually unrolling async {}s.


It's pretty subtle and I don't imagine this is very common knowledge.

The warning to that effect is alluded to within the first page of the tracing crate. It is pretty hard for the project to do much else to draw awareness to the issue. I can only say that it is not uncommon to see people asking about invalid traces, but it is also not particularly common in absolute terms, because in most cases people just use #[instrument] and in other cases async use in a project is so pervasive, that it is hardly possible to not be aware of the gotcha.

All that said this is my first time seeing this sort of misuse becoming a stack overflow. I would consider sending a bug report to the upstream project(s) with a minimal reproducer they could experiment with. If you have the time for it, of course.

.instrument(tracing::trace_span!(
target: "network",
"receive_routed_message",
"type" = body_type,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case #[instrument] application would need to use an initially empty field and then record it later as demonstrated in the last example here.

@nagisa
Copy link
Collaborator

nagisa commented Feb 27, 2024

As for diagnostics: rust-lang/rust#83310

Copy link
Collaborator

@akhi3030 akhi3030 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving to unblock.

@robin-near
Copy link
Contributor Author

I'm merging this to unblock the stack overflow issue, but not improving upon it as I don't have the bandwidth right now. Thanks @nagisa for your thoughts!

@robin-near robin-near added this pull request to the merge queue Mar 8, 2024
Merged via the queue into near:master with commit a4c2b30 Mar 8, 2024
28 of 30 checks passed
@robin-near robin-near deleted the async-fix2 branch March 8, 2024 19:18
nagisa added a commit to nagisa/nearcore that referenced this pull request Mar 13, 2024
Pushing people towards the `tracing::*_span!` has caused us some
significant grief, especially in near#10663
so we better allow and encourage `#[instrument]`.
github-merge-queue bot pushed a commit that referenced this pull request Mar 13, 2024
Pushing people towards the `tracing::*_span!` has caused us some
significant grief, especially in
#10663 so we better allow and
encourage `#[instrument]`.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants