Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

log/logconfig: change exit-on-error to default to true for stderr #62763

Conversation

ajwerner
Copy link
Contributor

This change unearths runtime out-of-memory errors. Before this change, a
runtime-detected failure to allocate memory would not make it to the logs.

This was tested by injecting a rapid memory leak into the process and
determining first that it was not present in logs and then with this
change that the stack traces were present in the stderr log.

I'm omitting a release note primarily because I do not know how to frame one.

Relates to debugging #62320.

Release note: None

This change unearths runtime out-of-memory errors. Before this change, a
runtime-detected failure to allocate memory would not make it to the logs.

This was tested by injecting a rapid memory leak into the process and
determining first that it was not present in logs and then with this
change that the stack traces were present in the stderr log.

I'm omitting a release note primarily because I do not know how to frame one.

Relates to debugging cockroachdb#62320.

Release note: None
@ajwerner ajwerner requested a review from knz March 29, 2021 22:28
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@ajwerner
Copy link
Contributor Author

cc @tbg

@knz
Copy link
Contributor

knz commented Mar 30, 2021

This is ... confusing. The title of the commit contradicts the change.

Also there is no test. How do we know that this change is indeed what helps the particular case you mention in the commit description? Without understanding the particular mechanism, I'd say it's possible there is an underlying common cause.

What about you detail the commit message with your observations? Before/after and what was the problem being solved.

This also vaguely seems related to #62768 which could suggest that some sinks do not get adequately flushed, which would explain further.

Also a few existing tests would need to be updated accordingly.

@ajwerner
Copy link
Contributor Author

Ack regarding the tests. Thought I'd post something given I had something that changed behavior. I should have left it as a draft. Happy to update the tests. If you'd like to address this in a different way, let me know. I thought about posting the different log files but it didn't seem that interesting. I still have them and will attach in a minute.

@ajwerner ajwerner marked this pull request as draft March 30, 2021 02:22
@ajwerner
Copy link
Contributor Author

The following diff was applied to cockroach as a leak:

--- a/pkg/server/server_sql.go
+++ b/pkg/server/server_sql.go
@@ -13,6 +13,7 @@ package server
 import (
        "context"
        "math"
+       "math/rand"
        "net"
        "os"
        "path/filepath"
@@ -73,6 +74,7 @@ import (
        "github.com/cockroachdb/cockroach/pkg/util/metric"
        "github.com/cockroachdb/cockroach/pkg/util/mon"
        "github.com/cockroachdb/cockroach/pkg/util/netutil"
+       "github.com/cockroachdb/cockroach/pkg/util/randutil"
        "github.com/cockroachdb/cockroach/pkg/util/stop"
        "github.com/cockroachdb/cockroach/pkg/util/syncutil"
        "github.com/cockroachdb/cockroach/pkg/util/timeutil"
@@ -887,6 +889,15 @@ func (s *SQLServer) preStart(
                scheduledjobs.ProdJobSchedulerEnv,
        )
 
+       s.stopper.RunAsyncTask(ctx, "leak-memory", func(ctx context.Context) {
+               const blockSize = 1 << 20
+               var leaked []interface{}
+               r := rand.New(rand.NewSource(1))
+               for {
+                       data := randutil.RandBytes(r, blockSize)
+                       leaked = append(leaked, data)
+               }
+       })
        return nil
 }

Before:
cockroach-stderr.ajwerner-test-0001.ubuntu.2021-03-29T21_37_22Z.011204.log

After: cockroach-stderr.ajwerner-test-0001.ubuntu.2021-03-29T22_16_53Z.012005.log

@tbg
Copy link
Member

tbg commented Mar 30, 2021

It must be that when exit-on-error is true, we hit the bottom branch here:

if !s.criticality {
// An error on this sink is not critical. Just report
// the error and move on.
l.reportErrorEverywhereLocked(context.Background(), err)
} else {
// This error is critical. We'll have to terminate the
// process below.
if outputErr == nil {
outputErrExitCode = s.sink.exitCode()
}
outputErr = errors.CombineErrors(outputErr, err)
}

and in particular we don't call reportErrorEverywhereLocked, which would shout the message and do a best-effort sync.

I assume we end up in outputLogEntry because... I don't know, something tries to log as a result of the panic? We do take over stderr here:

if err := fileSink.takeOverInternalStderr(secLogger); err != nil {

but that just means os.Stderr points at a file, this wouldn't invoke the above code. I'm also confused, why would writes to that file get lost if the program panics? I always assumed that when you write to a file, and the program crashes (but not the OS) the writes still appear. Is that not true?

@knz
Copy link
Contributor

knz commented Mar 30, 2021

I looked at the attached files: none of the logging stuff is responsible for whether the Go runtime errors gets written to this file or not.

That's what the "fd2 capture" mechanism is about: as soon as cockroach-stderr.log is opened, its file descriptor is passed to the go runtime. The go runtime writes to it without buffering. And yes that guarantees that the data gets there if the process crashes.

The only explanation I see is the following:

  1. there is some other logging output going on around the time of the event (Andrew, we'd like to see the other log files around that time)
  2. one of the logging events encounters a write error - either a soft out-of-resources error or something else.
  3. because "criticality" (exit-on-err) is true (current default), the logging code calls os.Exit upon seeing the logging error. This ensures that the process terminates before the Go runtime error is encountered
  4. as a result the cockroach-stderr.log file appears empty

Conversely, when "exit on err" is false, the logging error at step 2 above does not cause the process to call os.Exit and then it trips up a bit later with an actual Go runtime error.

This is why in this case we should also have a look at the other log files generated to see what's in there. Also I am thankful for the repro because it will help us study this situation further.

@ajwerner
Copy link
Contributor Author

In the attached zip find the logs dir from the node that produced these files as copied using scp (thus the symlinks are now files).

logs.zip

@knz
Copy link
Contributor

knz commented Mar 30, 2021

@ajwerner thanks for that zip file. That's an instance failure where the go runtime error did make its way to cockroach-stderr.log. Do you also have an instance of when it was missing?

@ajwerner
Copy link
Contributor Author

Both are in that zip. Notice the two distinct timestamps. The ones around 21_37_22 are with a build with the default setting.

@knz
Copy link
Contributor

knz commented Mar 30, 2021

oh thanks.

@ajwerner ajwerner changed the title log/logconfig: change exit-on-error to default to true for stderr log/logconfig: change exit-on-error to default to false for stderr Apr 5, 2021
@tbg tbg changed the title log/logconfig: change exit-on-error to default to false for stderr log/logconfig: change exit-on-error to default to true for stderr Apr 12, 2021
@tbg
Copy link
Member

tbg commented Apr 12, 2021

Poking at this a bit since every other day we're looking at opaquely dead nodes in roachtest.

diff --git a/pkg/cli/start.go b/pkg/cli/start.go
index 80718edb50..273f736d8a 100644
--- a/pkg/cli/start.go
+++ b/pkg/cli/start.go
@@ -688,6 +688,11 @@ If problems persist, please see %s.`
 				fmt.Print(msgS.StripMarkers())
 			}
 
+			stopper.RunAsyncTask(ctx, "leak-memory", func(ctx context.Context) {
+				const blockSize = 50000 * 1 << 20 // 50gb
+				leaked := make([]byte, blockSize)
+				_ = leaked
+			})
 			return nil
 		}(); err != nil {
 			errChan <- err

This is intentionally set up to trigger panic("out of memory") from the runtime locally when invoked with

rm -rf cockroach-data ; make buildshort && ./cockroach start-single-node --insecure

Perhaps unsurprisingly, it does make it to cockroach-stderr.log just fine.

@tbg
Copy link
Member

tbg commented Apr 12, 2021

It has to be something like Raphael suggested, that we are calling exitLocked "prematurely" but it's hard to figure out the details. You can't recover from panic(out of memory) because it's on the system stack, so how are writes to a sink failing in a way that is recoverable? Either way, we should be able to verify that this is the failure mode because the exit status of CRDB will tell us something.

@tbg
Copy link
Member

tbg commented Apr 12, 2021

# on roachprod vm
$ ./cockroach start-single-node --insecure --store=path=/mnt/data1/cockroach --log-dir=${HOME}/logs --cache=25% --max-sql-memory=25% --port=26257 --http-port=26258 --locality=cloud=gce,region=us-east1,zone=us-east1-b --advertise-host=$(hostname -I | awk '{print $1}')
[...]
[waiting]
ubuntu@tobias-oom-0001:~$ echo $?
2
$ grep 'out of memory' logs/cockroach-stderr.log 
fatal error: runtime: out of memory

Hmm, I tried this a few times and it always worked (with Andrew's diff above pretty much). I probably messed something up, going to drop this for now.

@knz
Copy link
Contributor

knz commented Apr 12, 2021

@tbg do you mean that you get the data in the log file with andrew's PR (this PR)? or that you get it without the PR? Do we need to change the criticality of stderr at all?

@tbg
Copy link
Member

tbg commented Apr 12, 2021

I was testing without Andrew's PR (but with the intentional leak), i.e. wanted to reproduce the original issue. I'm pretty sure nothing got fixed in the meantime, but probably it depends on the intrinsics of the experiment. I imagine Andrew was running load against the cluster, I was not.

tbg added a commit to tbg/cockroach that referenced this pull request Apr 12, 2021
See cockroachdb#62763.

We seem to frequently miss the runtime errors resulting from
out-of-memory conditions in the stderr logs. We don't understand
exactly why yet, but it is very likely that with `exit-on-error`
(which is true by default) we are hitting errors outputting to
the sink which then kill the process before the runtime errors
bubble up.

While we develop a proper fix, avoid the problematic configuration
on roachprod clusters, which notably includes roachtests.

Release note: None
tbg added a commit to tbg/cockroach that referenced this pull request Apr 13, 2021
See cockroachdb#62763.

We seem to frequently miss the runtime errors resulting from
out-of-memory conditions in the stderr logs. We don't understand
exactly why yet, but it is very likely that with `exit-on-error`
(which is true by default) we are hitting errors outputting to
the sink which then kill the process before the runtime errors
bubble up.

While we develop a proper fix, avoid the problematic configuration
on roachprod clusters, which notably includes roachtests.

Release note: None
tbg added a commit to tbg/cockroach that referenced this pull request Apr 14, 2021
See cockroachdb#62763.

We seem to frequently miss the runtime errors resulting from
out-of-memory conditions in the stderr logs. We don't understand
exactly why yet, but it is very likely that with `exit-on-error`
(which is true by default) we are hitting errors outputting to
the sink which then kill the process before the runtime errors
bubble up.

While we develop a proper fix, avoid the problematic configuration
on roachprod clusters, which notably includes roachtests.

v20.2 did not have the `--log` flag yet, so we only do this when
starting v21.1 (i.e. `master` at time of writing).

Release note: None
craig bot pushed a commit that referenced this pull request Apr 14, 2021
63472: roachprod: use exit-on-error:false for crdb log cfg r=ajwerner a=tbg

See #62763.

We seem to frequently miss the runtime errors resulting from
out-of-memory conditions in the stderr logs. We don't understand
exactly why yet, but it is very likely that with `exit-on-error`
(which is true by default) we are hitting errors outputting to
the sink which then kill the process before the runtime errors
bubble up.

While we develop a proper fix, avoid the problematic configuration
on roachprod clusters, which notably includes roachtests.

Release note: None


Co-authored-by: Tobias Grieger <tobias.b.grieger@gmail.com>
@ajwerner
Copy link
Contributor Author

Closing in favor of #63472.

@tbg
Copy link
Member

tbg commented Apr 20, 2021 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants