This document describes the logging guidelines for the control plane of NGINX Gateway Fabric (NGF).
The data plane logging is not covered here: such a concern is owned by NGINX developers, and NGF developers don't have control over it.
Before delving into the guidelines, we must first establish the high-level requirements for logging to ensure a clear understanding of what these guidelines intend to address:
We don't anticipate these high-level requirements to change for the duration of the project. If changes occur, we will need to update this document. However, we don't need to document enhancements based on these requirements, unless they directly impact the guidelines.
There are two distinct personas of NGF:
- User, who uses NGF. For logging, the user persona maps to the cluster operator from the Gateway API personas.
- Developer, who develops NGF.
Below are common user stories for both personas and also persona-specific stories.
As a user/developer, I want to:
- Have confidence the control plane works properly by looking at the logs.
- Control the amount/level of detail of logged information.
- Use logs to troubleshoot an issue with the control plane, either in real-time or by looking at the log history.
- In case of errors, see what event or Kubernetes resource caused it.
As a user, I want to:
- Export the logs into my logging infrastructure and parse them for further processing (like querying).
- Share the logs with developers to ask for help or report a bug.
As a developer, I want to:
- See human-readable (text) as opposed to machine-readable (JSON) format.
- Use logs to troubleshoot an issue collected from a user.
To make the logs easily parseable/processable by other programs, we use structured logging - i.e. logging data in a
structured
way (like JSON) as opposed to a non-structured way (like a long string from a printf
invocation).
The code logs via the logr library, which abstracts away the logger implementation. This library covers our needs, is well adopted, and many logger implementations implement its interfaces.
As an implementation, we use zap, which:
- supports logr via https://github.com/go-logr/zapr
- supports both JSON and text format
- supports log levels
- demonstrates good performance
- is stable and well-adopted.
We initialize zap via controller-runtime framework helpers.
The table below describes the log verbosity levels. A level includes all messages of the levels above.
Level | Contents | Target persona |
---|---|---|
error |
errors | user |
info (default) |
information about how the control plane operates | user |
debug |
extra information about how the control plane operates to help troubleshoot an issue | developer |
The levels match the levels of the controller-runtime framework without any custom levels.
We log two kinds of messages (this is the extent of the logr API):
- Error, which describes a non-critical error in the control plane (for which it will not terminate) that
requires the user's attention.
Logging an error is a way of handling it. If you handle the error differently, log it as info, or don't log it at all.
See What not to Log section about handling critical errors.
- Info, which describes anything that's not an error.
- Messages logged at the
info
level target the user persona. - Messages logged at the
debug
level target the developer persona, and we don't expect the user to fully understand them.
- Messages logged at the
The message kind is not the same as the log level.
In addition to the actual message, the log message should include the context:
- Timestamp (handled by the logging library) - to make the user/developer aware when something occurs.
- Kubernetes resource (if any) - to correlate to the corresponding Kubernetes resource.
- Component name - to correlate with the component (for example, eventHandler), which logs the message.
- Relevant extra context - additional context depending on the location of the log line in the code.
Additionally, specifically for the developer persona:
- Stacktrace - to pinpoint where in the code the message was logged.
The next section shows how to add context.
-
A simple info message with a timestamp with the default
info
level:logger.Info("Info message")
{"level":"info","ts":"2023-07-20T15:10:03-04:00","msg":"Info message"}
-
An info message with the
debug
level:logger.V(1).Info("Info message")
{"level":"debug","ts":"2023-07-20T15:10:03-04:00","msg":"Info message"}
-
An error:
logger.Error(errors.New("test error"), "Error message")
{"level":"error","ts":"2023-07-20T15:10:03-04:00","msg":"Error message","error":"test error","stacktrace":"main.main\n\t/<REDACTED for the document>.go:73\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:250"}
An error message includes the stack trace by default.
-
A message with a Kubernetes resource:
// hr implements client.Object (controller-runtime) // hr is *v1.HTTPRoute logger.Info( "Processed resource", "resource", hr, )
{"level":"info","ts":"2023-07-20T15:10:03-04:00","msg":"Processed resource","resource":{"apiVersion":"gateway.networking.k8s.io/v1","kind":"HTTPRoute","namespace":"test","name":"hr-1"}}
The resource must include
TypeMeta
, otherwise itsapiVersion
andkind
will not be printed. -
A message with a component name:
logger = logger.WithName("component-a") logger.Info("Hello")
{"level":"info","ts":"2023-07-20T15:10:03-04:00","logger":"component-a","msg":"Hello"}
-
A message with extra context:
logger.Info( "Info message", "status", "ok", "generation", 123, )
{"level":"info","ts":"2023-07-20T15:10:03-04:00","msg":"Info message","status":"ok","generation":123}
-
An info message with a stack trace configured for the
info
level (done separately during the logger initialization):logger.Info("Info message")
{"level":"info","ts":"2023-07-20T15:10:03-04:00","msg":"Info message","stacktrace":"main.main\n\t/<REDACTED for the document>.go:100\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:250"}
-
Start a message with an uppercase letter. For example,
"Reloaded nginx"
. -
Do not put dynamic values in the message, add them as key values (extra context):
DO NOT:
logger.Info(fmt.Sprintf("Got response with status %q", "ok"))
DO:
logger.Info( "Got response", "status", "ok", )
-
For readability, put each key-value pair on a separate line, as in the previous example.
-
Use camelCase for key names.
-
When dealing with a Kubernetes resource, use
resource
as a key:logger.Info( "Processed resource", "resource", hr, )
-
Don't put structs as values except for Kubernetes resources, to prevent messy and potentially unbounded output.
By default, log messages are formatted as JSON, so that external systems can easily parse them.
For the developer, log messages are formatted as text strings (except key/values), for human-friendly parsing. For example:
2023-07-21T12:41:37.640-0400 INFO Processed resource {"resource": {"apiVersion": "gateway.networking.k8s.io/v1", "kind": "HTTPRoute", "namespace": "test", "name": "hr-1"}}
The formatting is controlled during the logger initialization.
Below are some examples of when to log a message:
- Lifecycle events
- The control plane starts or shuts down
- Lifecycle of the components.
- Handling a transaction/request
- The main transaction in the control plane is processing a batch of events related to Kubernetes resources taken from the main event loop.
- Additional transactions could be out of the event loop requests to auxiliary control plane endpoints (APIs).
- Interaction with external components. When the control plane interacts with an external component, log the stages of that interaction as well as the result of it.
For those cases:
- Log an error only if the code doesn't handle it (performs some action). If it is handled, but it is still important to record the fact of that error, log it as an info message.
- Consider including performance-related information. For example, the duration of an operation.
- Log extra details (or steps) to facilitate troubleshooting under the
debug
level.
It is important to not overload the
info
level with too many details and messages.
- Sensitive information, including the contents of the secrets, passwords, etc.
- Returned errors. If a function returns an error, do not log it inside that function, because it will be logged or handled up in the stack.
- Critical errors, which causes control plane to terminate.
- For non-recoverable or programming errors, we panic.
- An error that is not handled will propagate to the
main
package, where the control plane will print it and immediately terminate.
Logging too frequently can lead to poor performance, because of:
- Too many I/O operations.
- Too many allocations related to preparing the log message.
If we diligently follow the guidelines from this document, the error
and info
log levels will not have many messages
(many thousands per second). At the same time, such a case might still be possible, for example, in the case of a bug.
To prevent that, the default settings in the controller-runtime helpers for zap
configure sampling.
The sampling is based on the message (the msg
parameter) of the log-related methods. Read more about sampling
here.
The logger should be initialized in the main
package and passed to the components as logr.Logger
. In the main
package
we set the parameters of the logger, such as the level, stack trace level, and format. The control-plane components
should
not change them.
For asynchronous components, inject the logger during the component creation, setting up the name of the logger to match the name of the component:
asyncComp := newComponent(logger.WithName("async-component"))
For synchronous components (meaning another component will call it synchronously), inject the logger during the method call, adding extra context when necessary:
l := logger.WithName("sync-component").WithValues("operationID", 1)
syncComp.Process(l)
The Reconciler gets its logger from the controller-runtime. We use it (as opposed to injecting ours) because the runtime adds context to that logger with the group, kind, namespace and name of the resource, and a few more key-value pairs. Note that the runtime creates that logger from the one we inject into it during the runtime initialization (see External Libraries). Also note that logger is compatible with our logging guidelines.
In unit tests, we don't test what we log.
To initialize a logger, use zap.New()
from controller-runtime helpers.
There are two critical libraries for NGF that log:
- controller-runtime.
- The log levels are compatible with our project.
- We inject logger into the library as
logr.Logger
from logr. - See also the logging guidelines of that project.
- client-go.
When adding a new library, evaluate if it does logging and how. If it is possible, configure the library logging to be compatible with NGF logging. If not, document the logs in the user documentation for control plane logging, so that the users are prepared for them.
As NGF evolves, we might change the logging. For example:
- Change the meaning of levels.
- Change the default key-value pairs and their representation.
Such changes should be considered a breaking change and handled according to our release process because they will require users to update their log processing pipelines.
At the same time, changes to individual log messages are not breaking changes.