Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tracing: enable always-on tracing by default #58897

Merged

Conversation

irfansharif
Copy link
Contributor

@irfansharif irfansharif commented Jan 12, 2021

This is follow-up work from #58712, where we measured the overhead for
always-on tracing and found it to be minimal/acceptable. Lets switch
this on by default to shake the implications of doing so. We can
reasonably expect two kinds of fallout:

  1. Unexpected blow up in memory usage due to resource leakage (which is
    a can be problem now that we're always maintaining open spans in an
    internal registry, see *: spans that are never explicitly Finish()-ed #58721)

  2. Performance degradation due to tracing overhead per-request
    (something tracing,sql: introduce BenchmarkTracing #58712) was spot checking for.

For 1 we'll introduce a future test in a separate PR. For 2, we'll
monitor roachperf over the next few weeks.


Also moved some of the documentation for the cluster setting into a
comment form above. Looking at what's rendered in our other cluster
settings (SHOW ALL CLUSTER SETTINGS), we default to a very pity,
unwrapped description.

Release note: None

@irfansharif irfansharif requested review from knz and tbg January 12, 2021 23:59
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@irfansharif irfansharif force-pushed the 210112.enable-tracing-by-default branch 2 times, most recently from 01de71b to 4a5c52f Compare January 13, 2021 00:03
@tbg
Copy link
Member

tbg commented Jan 13, 2021

Wow, so much red in CI! Curious what this is all about.

@irfansharif
Copy link
Contributor Author

Probably the span leak we found in #58902. I'm not sure what to do about it yet.

@tbg
Copy link
Member

tbg commented Jan 13, 2021

How can it be that? We're not checking for span leaks in this PR.

@irfansharif
Copy link
Contributor Author

Was a stab in the dark, but I was thinking it could be due to some memory blow up. I'll take a closer look in the morning.

@irfansharif
Copy link
Contributor Author

Looks like it was something else, but I'm not really sure what to do about it. With always on tracing, even spans with "nothing in them" generaterecordings. Cause it's always on, we'll at least generate the portion of the trace with the operation name.

0.000ms      0.000ms    === operation:table reader

With the legacy tracing mode, cause we weren't explicitly recording, we used to return nil traces when requested:

if m == modeLegacy && s.recordingType() == RecordingOff {
// In legacy tracing (pre always-on), we avoid allocations when the
// Span is not actively recording.
//
// TODO(tbg): we could consider doing the same when background tracing
// is on but the current span contains "nothing of interest".
return nil
}

Now that this is not nil, we'll attach it to the our processor's trailingMeta here:

if trace := pb.span.GetRecording(); trace != nil {
pb.trailingMeta = append(pb.trailingMeta, execinfrapb.ProducerMetadata{TraceData: trace})
}

Most of the failing tests aren't expecting this:

metas = ignoreLeafTxnState(metas)
metas = ignoreMetricsMeta(metas)
if len(metas) != 0 {
t.Errorf("unexpected metadata: %v", metas)
}

I'm not sure what to do here. At first I figured if it's just tests, we could ignore the metas with trace data. Using something like:

// ignoreTraceData takes a slice of metadata and returns the entries
// excluding the ones with trace data.
func ignoreTraceData(metas []execinfrapb.ProducerMetadata) []execinfrapb.ProducerMetadata {
	res := make([]execinfrapb.ProducerMetadata, 0)
	for _, m := range metas {
		if m.TraceData == nil {
			res = append(res, m)
		}
	}
	return res
}

But I'm not sure that's right, and also it doesn't help TestCSVImportCanBeResumed, which fails legitimately (the import process doesn't actually resume with always-on tracing). I think it's because we use trailingMeta in our control flow?

// trailingMeta always has priority; it seems like a good idea because it
// causes metadata to be sent quickly after it is produced (e.g. the error
// passed to MoveToDraining()).
if len(pb.trailingMeta) > 0 {
return pb.popTrailingMeta()
}

I'm don't know what the right patch here would be. Should we not include the trace data in trailing meta if there are no remote+children spans (aka behave like earlier)? Should we change our handling of trailingMeta to possibly account for these "spans with nothing in them"? Perhaps @asubiotto or @yuzefovich have ideas; I'm lost.

@yuzefovich
Copy link
Member

yuzefovich commented Jan 16, 2021

I don't see that we need to do anything special at this moment (other than silencing, in tests, these TraceData meta as you suggested). TestCSVImportCanBeResumed fails legitimately because there is a bug in the implementation of the import processor, try this diff:

diff --git a/pkg/ccl/importccl/import_processor.go b/pkg/ccl/importccl/import_processor.go
index 9da93ace86..9fdb3ed043 100644
--- a/pkg/ccl/importccl/import_processor.go
+++ b/pkg/ccl/importccl/import_processor.go
@@ -135,16 +135,15 @@ func (idp *readImportDataProcessor) Next() (rowenc.EncDatumRow, *execinfrapb.Pro
        // Once the import is done, send back to the controller the serialized
        // summary of the import operation. For more info see roachpb.BulkOpSummary.
        countsBytes, err := protoutil.Marshal(idp.summary)
+       idp.MoveToDraining(err)
        if err != nil {
-               idp.MoveToDraining(err)
                return nil, idp.DrainHelper()
        }
 
-       idp.MoveToDraining(nil /* err */)
        return rowenc.EncDatumRow{
                rowenc.DatumToEncDatum(types.Bytes, tree.NewDBytes(tree.DBytes(countsBytes))),
                rowenc.DatumToEncDatum(types.Bytes, tree.NewDBytes(tree.DBytes([]byte{}))),
-       }, idp.DrainHelper()
+       }, nil
 }

The contract of RowSource.Next method is that only one of row and meta is non-nil, and it was broken.

@irfansharif irfansharif requested review from a team and pbardea and removed request for a team and pbardea January 18, 2021 20:44
...in the import processor. The contract of `RowSource.Next` requires
that at most one of the return values will be non-empty. This wasn't the
case here, and caused opaque failures in cockroachdb#58897. cockroachdb#58897 tries to enable
background tracing by default, which for us means that the trailing meta
can be non-empty (they'll contain span recordings). That behaviour ends
up tickling this bug, tripping up TestCSVImportCanBeResumed.

Release note: None
This is follow-up work from cockroachdb#58712, where we measured the overhead for
always-on tracing and found it to be minimal/acceptable. Lets switch
this on by default to shake the implications of doing so. We can
reasonably expect two kinds of fallout:

(1) Unexpected blow up in memory usage due to resource leakage (which is
a can be problem now that we're always maintaining open spans in an
internal registry, see cockroachdb#58721)

(2) Performance degradataion due to tracing overhead per-request
(something cockroachdb#58712) was spot checking for.

For (1) we'll introduce a future test in a separate PR. For (2), we'll
monitor roachperf over the next few weeks.

---

Also moved some of the documentation for the cluster setting into a
comment form above. Looking at what's rendered in our other cluster
settings (`SHOW ALL CLUSTER SETTINGS`), we default to a very pithy,
unwrapped description.

Release note: None
@irfansharif irfansharif force-pushed the 210112.enable-tracing-by-default branch from c76be2a to 17ea476 Compare January 18, 2021 22:22
@irfansharif
Copy link
Contributor Author

irfansharif commented Jan 19, 2021

Thanks for the tip, sent it out #59115. @tbg / @knz, mind rubber stamping the second commit here?

craig bot pushed a commit that referenced this pull request Jan 19, 2021
58995: util/log: stop syncing writes excessively r=itsbilal a=knz

Requested by @bdarnell 

Fixes #58025 

Release note (cli change): Previously, for certain log files
CockroachDB would both flush individual writes (i.e. propagate them
from within the `cockroach` process to the OS) and also synchronize
writes (i.e. ask the OS to confirm the log data was written to
disk). The per-write synchronization part was unnecessary and, in
fact, found to be possibly detrimental to performance and operating
cost, so it was removed. Meanwhile, the log data continues to be
flushed as previously, and CockroachDB also periodically (every 30s)
request synchronization, also as previously.

Release note (cli change): The parameter `sync-writes` for file sink
configurations has been removed. (This is not a backward-incompatible
change because the configuration feature is new in v21.1.)


Release note (cli change): The parameter `buffered-writes` for file
sink configurations has been added. It is set to `true` (writes are
buffered) by default; and set to `false` (i.e. avoid buffering and
flush every log entry) when the `auditable` flag is requested.

59115: importccl: account for the RowSource return constraint r=irfansharif a=irfansharif

...in the import processor. The contract of `RowSource.Next` requires
that at most one of the return values will be non-empty. This wasn't the
case here, and caused opaque failures in #58897. #58897 tries to enable
background tracing by default, which for us means that the trailing meta
can be non-empty (they'll contain span recordings). That behaviour ends
up tickling this bug, tripping up TestCSVImportCanBeResumed.

Release note: None

Co-authored-by: Raphael 'kena' Poss <knz@thaumogen.net>
Co-authored-by: irfan sharif <irfanmahmoudsharif@gmail.com>
@irfansharif
Copy link
Contributor Author

bors r+

@craig
Copy link
Contributor

craig bot commented Jan 19, 2021

Build failed:

@irfansharif
Copy link
Contributor Author

curl: (22) The requested URL returned error: 504 GATEWAY_TIMEOUT

bors r+

@craig
Copy link
Contributor

craig bot commented Jan 19, 2021

Build succeeded:

@craig craig bot merged commit debeb0b into cockroachdb:master Jan 19, 2021
@irfansharif irfansharif deleted the 210112.enable-tracing-by-default branch January 19, 2021 21:30
craig bot pushed a commit that referenced this pull request Jan 28, 2021
59132: sql: introduce sql.statement_stats.sample_rate to sample execution stats r=RaduBerinde,dhartunian a=asubiotto

Depends on #58897 
Depends on #59103 

This PR puts the "always-on" into always-on EXPLAIN ANALYZE. Take a look at separate commits for details. What actually goes on is that we're taking the slightly safer route of introducing a cluster setting which defines a sample rate for execution stats. These execution stats are collected and propagated using background tracing, which is cheaper than verbose tracing. This allows us to power the new DB Console statement stats views. Currently we still need user input in order to turn up this sample rate.

The sample rate is 0 by default in this PR for safety reasons. I'd like to discuss the default value of this cluster setting or whether we need it at all separately before the 21.1 release, but this gives us a nice escape hatch if for whatever reason stats collection results in poor performance.

Closes #54556 

59536: colexec,bazel: pin the `types` dependency in generated files r=irfansharif a=irfansharif

This is a workaround for bazel auto-generated code. goimports does not
automatically pick up the right packages when run within the bazel
sandbox, so we have to pin it by hand.

Release note: None

Co-authored-by: Alfonso Subiotto Marques <alfonso@cockroachlabs.com>
Co-authored-by: irfan sharif <irfanmahmoudsharif@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants