-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
log/logconfig: change exit-on-error to default to true for stderr #62763
log/logconfig: change exit-on-error to default to true for stderr #62763
Conversation
This change unearths runtime out-of-memory errors. Before this change, a runtime-detected failure to allocate memory would not make it to the logs. This was tested by injecting a rapid memory leak into the process and determining first that it was not present in logs and then with this change that the stack traces were present in the stderr log. I'm omitting a release note primarily because I do not know how to frame one. Relates to debugging cockroachdb#62320. Release note: None
cc @tbg |
This is ... confusing. The title of the commit contradicts the change. Also there is no test. How do we know that this change is indeed what helps the particular case you mention in the commit description? Without understanding the particular mechanism, I'd say it's possible there is an underlying common cause. What about you detail the commit message with your observations? Before/after and what was the problem being solved. This also vaguely seems related to #62768 which could suggest that some sinks do not get adequately flushed, which would explain further. Also a few existing tests would need to be updated accordingly. |
Ack regarding the tests. Thought I'd post something given I had something that changed behavior. I should have left it as a draft. Happy to update the tests. If you'd like to address this in a different way, let me know. I thought about posting the different log files but it didn't seem that interesting. I still have them and will attach in a minute. |
The following diff was applied to cockroach as a leak: --- a/pkg/server/server_sql.go
+++ b/pkg/server/server_sql.go
@@ -13,6 +13,7 @@ package server
import (
"context"
"math"
+ "math/rand"
"net"
"os"
"path/filepath"
@@ -73,6 +74,7 @@ import (
"github.com/cockroachdb/cockroach/pkg/util/metric"
"github.com/cockroachdb/cockroach/pkg/util/mon"
"github.com/cockroachdb/cockroach/pkg/util/netutil"
+ "github.com/cockroachdb/cockroach/pkg/util/randutil"
"github.com/cockroachdb/cockroach/pkg/util/stop"
"github.com/cockroachdb/cockroach/pkg/util/syncutil"
"github.com/cockroachdb/cockroach/pkg/util/timeutil"
@@ -887,6 +889,15 @@ func (s *SQLServer) preStart(
scheduledjobs.ProdJobSchedulerEnv,
)
+ s.stopper.RunAsyncTask(ctx, "leak-memory", func(ctx context.Context) {
+ const blockSize = 1 << 20
+ var leaked []interface{}
+ r := rand.New(rand.NewSource(1))
+ for {
+ data := randutil.RandBytes(r, blockSize)
+ leaked = append(leaked, data)
+ }
+ })
return nil
} Before: After: cockroach-stderr.ajwerner-test-0001.ubuntu.2021-03-29T22_16_53Z.012005.log |
It must be that when cockroach/pkg/util/log/clog.go Lines 383 to 394 in 8451341
and in particular we don't call I assume we end up in cockroach/pkg/util/log/flags.go Line 201 in a3d35c8
but that just means |
I looked at the attached files: none of the logging stuff is responsible for whether the Go runtime errors gets written to this file or not. That's what the "fd2 capture" mechanism is about: as soon as The only explanation I see is the following:
Conversely, when "exit on err" is false, the logging error at step 2 above does not cause the process to call This is why in this case we should also have a look at the other log files generated to see what's in there. Also I am thankful for the repro because it will help us study this situation further. |
In the attached zip find the logs dir from the node that produced these files as copied using scp (thus the symlinks are now files). |
@ajwerner thanks for that zip file. That's an instance failure where the go runtime error did make its way to |
Both are in that zip. Notice the two distinct timestamps. The ones around 21_37_22 are with a build with the default setting. |
oh thanks. |
Poking at this a bit since every other day we're looking at opaquely dead nodes in roachtest. diff --git a/pkg/cli/start.go b/pkg/cli/start.go
index 80718edb50..273f736d8a 100644
--- a/pkg/cli/start.go
+++ b/pkg/cli/start.go
@@ -688,6 +688,11 @@ If problems persist, please see %s.`
fmt.Print(msgS.StripMarkers())
}
+ stopper.RunAsyncTask(ctx, "leak-memory", func(ctx context.Context) {
+ const blockSize = 50000 * 1 << 20 // 50gb
+ leaked := make([]byte, blockSize)
+ _ = leaked
+ })
return nil
}(); err != nil {
errChan <- err
This is intentionally set up to trigger
Perhaps unsurprisingly, it does make it to |
It has to be something like Raphael suggested, that we are calling |
Hmm, I tried this a few times and it always worked (with Andrew's diff above pretty much). I probably messed something up, going to drop this for now. |
@tbg do you mean that you get the data in the log file with andrew's PR (this PR)? or that you get it without the PR? Do we need to change the criticality of stderr at all? |
I was testing without Andrew's PR (but with the intentional leak), i.e. wanted to reproduce the original issue. I'm pretty sure nothing got fixed in the meantime, but probably it depends on the intrinsics of the experiment. I imagine Andrew was running load against the cluster, I was not. |
See cockroachdb#62763. We seem to frequently miss the runtime errors resulting from out-of-memory conditions in the stderr logs. We don't understand exactly why yet, but it is very likely that with `exit-on-error` (which is true by default) we are hitting errors outputting to the sink which then kill the process before the runtime errors bubble up. While we develop a proper fix, avoid the problematic configuration on roachprod clusters, which notably includes roachtests. Release note: None
See cockroachdb#62763. We seem to frequently miss the runtime errors resulting from out-of-memory conditions in the stderr logs. We don't understand exactly why yet, but it is very likely that with `exit-on-error` (which is true by default) we are hitting errors outputting to the sink which then kill the process before the runtime errors bubble up. While we develop a proper fix, avoid the problematic configuration on roachprod clusters, which notably includes roachtests. Release note: None
See cockroachdb#62763. We seem to frequently miss the runtime errors resulting from out-of-memory conditions in the stderr logs. We don't understand exactly why yet, but it is very likely that with `exit-on-error` (which is true by default) we are hitting errors outputting to the sink which then kill the process before the runtime errors bubble up. While we develop a proper fix, avoid the problematic configuration on roachprod clusters, which notably includes roachtests. v20.2 did not have the `--log` flag yet, so we only do this when starting v21.1 (i.e. `master` at time of writing). Release note: None
63472: roachprod: use exit-on-error:false for crdb log cfg r=ajwerner a=tbg See #62763. We seem to frequently miss the runtime errors resulting from out-of-memory conditions in the stderr logs. We don't understand exactly why yet, but it is very likely that with `exit-on-error` (which is true by default) we are hitting errors outputting to the sink which then kill the process before the runtime errors bubble up. While we develop a proper fix, avoid the problematic configuration on roachprod clusters, which notably includes roachtests. Release note: None Co-authored-by: Tobias Grieger <tobias.b.grieger@gmail.com>
Closing in favor of #63472. |
That just works around the issue and I'm not even sure it works all of the
time. I filed an issue so we don't lose track:
#63910
|
This change unearths runtime out-of-memory errors. Before this change, a
runtime-detected failure to allocate memory would not make it to the logs.
This was tested by injecting a rapid memory leak into the process and
determining first that it was not present in logs and then with this
change that the stack traces were present in the stderr log.
I'm omitting a release note primarily because I do not know how to frame one.
Relates to debugging #62320.
Release note: None