Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tracing: ensmallerate assembly generated by macro expansions #943

Merged
merged 9 commits into from
Aug 21, 2020

Conversation

hawkw
Copy link
Member

@hawkw hawkw commented Aug 18, 2020

Motivation

Currently, tracing's macros generate a lot of code at the expansion
site, most of which is not hidden behind function calls. Additionally,
several functions called in the generated code are inlined when they
probably shouldn't be. This results in a really gross disassembly for
code that calls a tracing macro.

Inlining code can be faster, by avoiding function calls. However, so
much code can have some negative impacts. Obviously, the output binary
is much larger. Perhaps more importantly, though, this code may optimize
much less well --- for example, functions that contain tracing macros
are unlikely to be inlined, since they are quite large.

Solution

This branch "outlines" most of the macro code, with the exception of the
two hottest paths for skipping a disabled callsite (the global max level
and the per-callsite cache). Actually recording the span or event,
performing runtime filtering via Subscriber::enabled, and registering
the callsite for the first time, are behind a function call. This means
that the fast paths still don't require a stack frame, but the amount of
code they jump past is significantly shorter.

Also, while working on this, I noticed that the event macro expansions
were hitting the dispatcher thread-local two separate times for enabled
events. That's not super great. I folded these together into one
LocalKey::with call, so that we don't access the thread-local twice.

Building ValueSets to actually record a span or event is the path that
generates the most code at the callsite currently. In order to put this
behind a function call, it was necessary to capture the local variables
that comprise the ValueSet using a closure. While closures have
impacted build times in the past, my guess is that this may be offset a
bit by generating way less code overall. It was also necessary to add
new dispatcher-accessing functions that take a FnOnce rather than a
FnMut. This is because fields are sometimes moved into the
ValueSet by value rather than by ref, when they are inside of a call
to tracing::debug or display that doesn't itself borrow the value.
Not supporting this would be a breaking change, so we had to make it
work. Capturing a FnOnce into the FnMut in an &mut Option seemed
inefficient, so I did this instead.

Before & After

Here's the disassembly of the following function:

#[inline(never)]
pub fn event() {
    tracing::info!("hello world");
}
On master:
event:
; pub fn event() {
               	push	r15
               	push	r14
               	push	r12
               	push	rbx
               	sub	rsp, 184
;         Relaxed => intrinsics::atomic_load_relaxed(dst),
               	lea	rax, [rip + 227515]
               	mov	rax, qword ptr [rax]
;             Self::ERROR_USIZE => Self::ERROR,
               	add	rax, -3
               	cmp	rax, 3
               	jb	395 <event+0x1b1>
               	lea	rbx, [rip + 227427]
               	mov	qword ptr [rsp + 96], rbx
;         Acquire => intrinsics::atomic_load_acq(dst),
               	mov	rax, qword ptr [rip + 227431]
;         self.state_and_queue.load(Ordering::Acquire) == COMPLETE
               	cmp	rax, 3
;         if self.is_completed() {
               	jne	36 <event+0x63>
               	mov	qword ptr [rsp], rbx
;         Relaxed => intrinsics::atomic_load_relaxed(dst),
               	mov	rax, qword ptr [rip + 227398]
;             if interest.is_always() {
               	test	rax, -3
               	je	86 <event+0xa8>
               	mov	rdi, rsp
;             crate::dispatcher::get_default(|current| current.enabled(self.meta))
               	call	-922 <tracing_core::dispatcher::get_default::h8c0486e541dd0400>
;     tracing::info!("hello world");
               	test	al, al
               	jne	84 <event+0xb2>
               	jmp	334 <event+0x1b1>
               	lea	rax, [rsp + 96]
;         let mut f = Some(f);
               	mov	qword ptr [rsp + 136], rax
               	lea	rax, [rsp + 136]
;         self.call_inner(false, &mut |_| f.take().unwrap()());
               	mov	qword ptr [rsp], rax
               	lea	rdi, [rip + 227357]
               	lea	rcx, [rip + 219174]
               	mov	rdx, rsp
               	xor	esi, esi
               	call	qword ptr [rip + 226083]
               	mov	qword ptr [rsp], rbx
;         Relaxed => intrinsics::atomic_load_relaxed(dst),
               	mov	rax, qword ptr [rip + 227312]
;             if interest.is_always() {
               	test	rax, -3
               	jne	-86 <event+0x52>
;                 0 => Interest::never(),
               	cmp	rax, 2
;     tracing::info!("hello world");
               	jne	255 <event+0x1b1>
;             &self.meta
               	mov	rbx, qword ptr [rip + 227295]
;     tracing::info!("hello world");
               	lea	r14, [rip + 2528]
               	mov	rdi, rbx
               	call	r14
               	lea	r12, [rsp + 136]
               	mov	rdi, r12
               	mov	rsi, rax
               	call	qword ptr [rip + 225926]
               	mov	rdi, rbx
               	call	r14
               	mov	r14, rax
               	mov	r15, rsp
               	mov	rdi, r15
               	mov	rsi, r12
               	call	qword ptr [rip + 225926]
;             Some(val) => val,
               	cmp	qword ptr [rsp + 8], 0
               	je	194 <event+0x1c0>
               	mov	rax, qword ptr [rsp + 32]
               	mov	qword ptr [rsp + 128], rax
               	movups	xmm0, xmmword ptr [rsp]
               	movups	xmm1, xmmword ptr [rsp + 16]
               	movaps	xmmword ptr [rsp + 112], xmm1
               	movaps	xmmword ptr [rsp + 96], xmm0
;         Arguments { pieces, fmt: None, args }
               	lea	rax, [rip + 218619]
               	mov	qword ptr [rsp], rax
               	mov	qword ptr [rsp + 8], 1
               	mov	qword ptr [rsp + 16], 0
               	lea	rax, [rip + 165886]
               	mov	qword ptr [rsp + 32], rax
               	mov	qword ptr [rsp + 40], 0
               	lea	rax, [rsp + 96]
;     tracing::info!("hello world");
               	mov	qword ptr [rsp + 72], rax
               	mov	qword ptr [rsp + 80], r15
               	lea	rax, [rip + 218570]
               	mov	qword ptr [rsp + 88], rax
               	lea	rax, [rsp + 72]
;         ValueSet {
               	mov	qword ptr [rsp + 48], rax
               	mov	qword ptr [rsp + 56], 1
               	mov	qword ptr [rsp + 64], r14
               	lea	rax, [rsp + 48]
;         Event {
               	mov	qword ptr [rsp + 136], rax
               	mov	qword ptr [rsp + 144], rbx
               	mov	qword ptr [rsp + 152], 1
               	lea	rdi, [rsp + 136]
;         crate::dispatcher::get_default(|current| {
               	call	-1777 <tracing_core::dispatcher::get_default::h32aa5a3d270b4ae2>
; }
               	add	rsp, 184
               	pop	rbx
               	pop	r12
               	pop	r14
               	pop	r15
               	ret
;             None => expect_failed(msg),
               	lea	rdi, [rip + 165696]
               	lea	rdx, [rip + 218426]
               	mov	esi, 34
               	call	qword ptr [rip + 226383]
               	ud2
And on this branch:
event:
; pub fn event() {
               	sub	rsp, 24
;         Relaxed => intrinsics::atomic_load_relaxed(dst),
               	lea	rax, [rip + 219189]
               	mov	rax, qword ptr [rax]
;             Self::ERROR_USIZE => Self::ERROR,
               	add	rax, -3
               	cmp	rax, 3
               	jb	59 <event+0x53>
;         Relaxed => intrinsics::atomic_load_relaxed(dst),
               	mov	rcx, qword ptr [rip + 219105]
;                 0 => Interest::never(),
               	test	rcx, rcx
               	je	47 <event+0x53>
               	mov	al, 1
               	cmp	rcx, 1
               	je	8 <event+0x34>
               	cmp	rcx, 2
               	jne	38 <event+0x58>
               	mov	al, 2
               	lea	rcx, [rip + 219077]
               	mov	qword ptr [rsp + 16], rcx
               	mov	byte ptr [rsp + 15], al
               	lea	rdi, [rsp + 15]
               	lea	rsi, [rsp + 16]
;             tracing_core::dispatcher::get_current(|current| {
               	call	-931 <tracing_core::dispatcher::get_current::he31e3ce09e66e5fa>
; }
               	add	rsp, 24
               	ret
;                 _ => self.register(),
               	lea	rdi, [rip + 219041]
               	call	qword ptr [rip + 218187]
;             InterestKind::Never => true,
               	test	al, al
;     tracing::info!("hello world");
               	jne	-53 <event+0x34>
               	jmp	-24 <event+0x53>
               	nop	dword ptr [rax + rax]

So, uh...that seems better.

Signed-off-by: Eliza Weisman <eliza@buoyant.io>
Signed-off-by: Eliza Weisman <eliza@buoyant.io>
Signed-off-by: Eliza Weisman <eliza@buoyant.io>
@hawkw hawkw changed the title [WIP] shrink code inlined by macro expansions tracing: ensmallerate assembly generated by macro expansions Aug 20, 2020
@hawkw hawkw marked this pull request as ready for review August 20, 2020 01:22
@hawkw hawkw requested a review from a team as a code owner August 20, 2020 01:22
@Mark-Simulacrum
Copy link

Are you piping the assembly through sed or something like that? It looks... weird in some places, e.g. [rsp, +, 15] has commas in the middle...

But this looks great. Let me know if you want to get this run by perf.rlo, I am happy to help out with that.

@hawkw
Copy link
Member Author

hawkw commented Aug 20, 2020

Are you piping the assembly through sed or something like that? It looks... weird in some places, e.g. [rsp, +, 15] has commas in the middle...

But this looks great. Let me know if you want to get this run by perf.rlo, I am happy to help out with that.

This is from cargo asm, its formatter is...not great.

@iximeow
Copy link

iximeow commented Aug 20, 2020

yeah. there's some "good" reason cargo asm does that (i've looked into it and decided fixing it was more effort than i could put in at the time) but it's well in "plumbing data from the right apis" territory

@hawkw
Copy link
Member Author

hawkw commented Aug 20, 2020

Clearly I should have just used objdump. This saved me from having to a recompile in a separate command, and I didn't notice the commas until you pointed them out. Now it's driving me crazy. :)

@hawkw
Copy link
Member Author

hawkw commented Aug 20, 2020

Let me know if you want to get this run by perf.rlo, I am happy to help out with that.

A perf run would be great. Our microbenchmarks were not terribly helpful here, and I imagine the impact is a lot more noticeable when there's actual non-tracing code involved.

Cargo.toml Outdated Show resolved Hide resolved
///
/// [dispatcher]: ../dispatcher/struct.Dispatch.html
#[cfg(feature = "std")]
#[doc(hidden)]
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding this was necessary since constructing the ValueSet in a closure means we needed a FnOnce so that values can be moved into the closure (which happens primarily when users wrap values in field::display/field::debug without borrowing). Not supporting this would break at least some code. I opted to add a new function that takes a FnOnce, rather than capturing an Option<FnOnce> into the FnMut and takeing it, because that seemed significantly less efficient.

We could probably commit to making this a public API (and even deprecate get_default, the version that takes a FnMut). This is probably a more useful API than get_default taking a FnMut... I didn't make it public here, though, because introducing a new public API seemed like a job for another PR.

Also, in 0.2, the Value changes will make Value::debug and Value::display also borrow rather than move. So, the FnOnce will no longer be necessary, and we could, alternatively, remove this function in 0.2, rather than deprecating the existing get_current.

So, there should probably be an actual design discussion... :)

Signed-off-by: Eliza Weisman <eliza@buoyant.io>
Signed-off-by: Eliza Weisman <eliza@buoyant.io>
Signed-off-by: Eliza Weisman <eliza@buoyant.io>
@hawkw
Copy link
Member Author

hawkw commented Aug 20, 2020

I replaced the cargo asm assembly with llvm-objdump disassembly (with source lines! yay!) because everyone whined about cargo asm's bad formatter. :P

Signed-off-by: Eliza Weisman <eliza@buoyant.io>
Copy link
Member

@LucioFranco LucioFranco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! This was a really neat PR to read :) left a nit and a question but nothing blocking.

tracing/src/lib.rs Show resolved Hide resolved
pub fn register(&'static self) -> Interest {
self.registration
.call_once(|| crate::callsite::register(self));
match self.interest.load(Ordering::Relaxed) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it actually worse here to call the inlined interest fn here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be fine for it to be inlined into this function, since this function shouldn't be inlined. The reason it doesn't is that the interest function calls register (this function) if the cached value is not 0, 1, or 2. That means that if we register the callsite and the cached interest value is still not one of those, we could recurse again, potentially infinitely.

This shouldn't happen ever, but if there's a bug elsewhere, I'd rather just assume the cached value is invalid and return Interest::sometimes() (which is what this method does), rather than overflowing the stack.

Co-authored-by: Lucio Franco <luciofranco14@gmail.com>
@hawkw hawkw merged commit 2692560 into master Aug 21, 2020
hawkw added a commit that referenced this pull request Aug 21, 2020
Fixed

- When combining `Interest` from multiple subscribers, if the interests
  differ, the current subscriber is now always asked if a callsite
  should be enabled (#927)

Added

- Internal API changes to support optimizations in the `tracing` crate
  (#943)
- **docs**: Multiple fixes and improvements (#913, #941)
hawkw added a commit that referenced this pull request Aug 23, 2020
### Fixed

- When combining `Interest` from multiple subscribers, if the interests
  differ, the current subscriber is now always asked if a callsite
  should be enabled (#927)

### Added

- Internal API changes to support optimizations in the `tracing` crate
  (#943)
- **docs**: Multiple fixes and improvements (#913, #941)
hawkw added a commit that referenced this pull request Aug 24, 2020
Changed

- Significantly reduced assembly generated by macro invocations (#943)
- Updated `tracing-core` to 0.1.15 (#943)

Added

- Documented minimum supported Rust version policy (#941)
hawkw added a commit that referenced this pull request Aug 24, 2020
Changed

- Significantly reduced assembly generated by macro invocations (#943)
- Updated `tracing-core` to 0.1.15 (#943)

Added

- Documented minimum supported Rust version policy (#941)
hawkw added a commit that referenced this pull request Aug 24, 2020
### Changed

- Significantly reduced assembly generated by macro invocations (#943)
- Updated `tracing-core` to 0.1.15 (#943)

### Added 

- Documented minimum supported Rust version policy (#941)
hawkw added a commit that referenced this pull request Sep 25, 2020
## Motivation

In PR #943, the construction of `ValueSet`s for events and spans was
moved out of the code expanded at the callsite and into a closure, in
order to reduce the amount of assembly generated in functions containing
tracing macros. However, this introduced an accidental breaking change
for some dependent crates. Borrowing values inside a closure meant that
when a field of a struct, array, or tuple was used as a field value, the
closure must borrow the _entire_ struct, array, or tuple, rather than
the individual field. This broke code where another unrelated field of
that struct, array, or tuple would then be mutably borrowed or moved
elsewhere. 

## Solution

This branch fixes the breaking change by moving `ValueSet` construction
back out of a closure and into the code expanded at the callsite. This
_does_ regress the amount of assembly generated a bit: a function
containing a single `event!` macro generates 32 instructions in release
mode on master, and after this change, it generates 83 instructions.
However, this is better than reverting PR #943 entirely, which generates
103 instructions for the same function. This change allows us to
continue to benefit from *some* of the changes made in #943, although we
no longer can benefit from the most significant one.

Rather than trying to further optimize the macro expanded code now, I
think we should wait for the `ValueSet` rework that will land in
`tracing` 0.2, where we could potentially generate significantly less
code by virtue of constructing a `ValueSet` with a much simpler array
literal (no `FieldSet` iteration required).

Fixes #954 
Closes #986

Signed-off-by: Eliza Weisman <eliza@buoyant.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants