-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New command: consul debug #4754
Conversation
ff43f60
to
fffe8e7
Compare
0ae460e
to
66f541e
Compare
I think if IPs are potentially sensitive (do we have existing customers who are reluctant to give us logs with IPs in?), we should (optionally) anonymise them instead - just take the first 8 bytes of sha256(IP) in hex or something instead so they are unique identifiers w.h.p but don't leak network information. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks super awesome 👍
I'll answer some more of your questions following in the other tab...
api/agent.go
Outdated
@@ -268,6 +268,23 @@ func (a *Agent) Self() (map[string]map[string]interface{}, error) { | |||
return out, nil | |||
} | |||
|
|||
// Host is used to retrieve information about the host the | |||
// agent is running on such as CPU, memory, and disk |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe note the ACL requirement?
command/debug/debug.go
Outdated
c.flags.Var((*flags.AppendSliceValue)(&c.capture), "capture", | ||
"One or more types of information to capture. This can be used "+ | ||
"to capture a subset of information, and defaults to capturing "+ | ||
"everything available. Possible information for capture: "+ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this missing the possible values?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes thanks, fixed!
command/debug/debug.go
Outdated
|
||
// Write profiles to disk | ||
for output, v := range pprofOutputs { | ||
err = ioutil.WriteFile(fmt.Sprintf("%s/%s.prof", timestampDir, output), v, 0755) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: 644 seems better for a file that's not an executable. (755 for dirs since they need to be executable to cd into them).
command/debug/debug.go
Outdated
written to the relative directory. | ||
|
||
If ACLs are enabled, an agent token must be supplied in order to perform | ||
this operation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggest we make it a token with operator:read
. It's not clear in any case what an "agent token" is as there are a few different token configs related to agents.
command/debug/debug.go
Outdated
$ consul debug | ||
|
||
Flags can be used to customize the duration and interval of the | ||
operation. Note that the duration must be longer than the interval. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you mention what duration and interval mean here maybe?
command/debug/debug.go
Outdated
$ consul debug -interval=20s -duration=1m | ||
|
||
By default, the archive containing the debugging information is | ||
saved to the relative directory as a .tar.gz file. The |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I's s/relative/current/ if I understand correctly?
command/debug/debug.go
Outdated
$ consul debug -output=/foo/bar/my-debugging -archive=false | ||
|
||
Note: Information collected by this command has the potential | ||
to be highly sensitive. We strongly recommend review of the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we qualify that a bit more? If I read this I'd worry it means we are dumping ACL tokens and root keys and panic a bit about even trying it to see. I think it's great we are super conscious of the sensitivity of telemetry and network info but maybe we can give a better hint here to avoid unnecessary alarm!
I think this is OK as it is. So far I'm not aware of customers who have specifically redacted IPs and similar from logs provided (correct me if I'm wrong) they are transmitted securely and treated as confidential under contract etc. I think if it comes up an option to anonymise IP addresses and/or hostnames with hashing makes sense but I don't think it's a show stopper here.
I think you split out reusable chunks into packages already. The bulk of the code is the command logic and I'm not sure it does make sense to have that elsewhere. If the main go file in unweildy it might be reasonable to break out the actual execution of the tests into a separate file but I don't think this is too bad.
Honestly it felt a little strange on first reading but mostly because we typically access those directly via On that note, I'd love to add
Yeah I think it's reasonable as it is.
I mentioned inline but I think this should be |
01296d8
to
1bc0d0f
Compare
My preference would be to collect as much data as possible - and then be filter it manually. In some cases I would submit the data securely to the Consul team by email or other off-github issue |
@jippi That will be the default behavior with this tool -- if you wanted to look through it you might run @banks Thanks for the review and all of your questions/suggestions, I believe I addressed all of it other than the agent address question. |
command/debug/debug.go
Outdated
|
||
Flags can be used to customize the duration and interval of the | ||
operation. Duration is the total time to capture data for from the target | ||
agent, whereas how often to capture dyanmic data for the length of the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: dyanmic. I'd consider rephrasing that to be two simple sentences:
Duration is the total time to capture data for from the target agent
Interval controls how often to dynamic data such as metrics are scraped
Also, I totally missed this distinction on first reading.
Would it make more sense to make duration
optional and just keep capturing until the operator kills the process? Or maybe at least default to 24 hours or something in case they actually forget and leave it running? I know if I was trying to repro a bug I wouldn't want to guess how long it might take and would rather just leave this running until I'm done? If we handled SIGINT and stopped the collection and wrote out final file that seems nicer than making them pick a duration up front.
Also does that mean that Ctrl-C on a debug process that's been running for an hour currently just looses all the output? Maybe not I didn't read closely enough but if some of this is considered already maybe mentioning that right here would be useful?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make more sense to make duration optional and just keep capturing until the operator kills the process? Or maybe at least default to 24 hours or something in case they actually forget and leave it running?
Totally possible. I think in busy clusters you risk disk usage with too long of a default. I do think the current default is maybe far too low though, so perhaps I should bump it to something like 10m?
Also does that mean that Ctrl-C on a debug process that's been running for an hour currently just looses all the output?
It is graceful right now, clarified in the usage section...but if you ctrl-c it will archive (if configured to, default yes) the output before exiting at the current or configured path. It is all written throughout to that same path per interval (and initially on the static stuff). So even if you killed it abruptly or the machine lost power etc. it would all exist on disk in unarchived form.
@@ -98,6 +98,7 @@ type TestServerConfig struct { | |||
VerifyOutgoing bool `json:"verify_outgoing,omitempty"` | |||
EnableScriptChecks bool `json:"enable_script_checks,omitempty"` | |||
Connect map[string]interface{} `json:"connect,omitempty"` | |||
EnableDebug bool `json:"enable_debug,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right now it is not possible to provide a flag or environment variable to set enable_debug
. I would like that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@i0rek Unfortunately it requires restarting the agent with that setting otherwise the application doesn't begin profiling. I'm going to consider adding to this PR us always registering the handlers and just gating it on standard ACLs.
@banks Given you brought this up in your review too, do you see any issues with doing an operator:read
on the pprof endpoints (believe we'd need to duplicate and have handlers where we then call the pprof handler), and just always making them available without enable_debug
being necessary at all? Or maybe just gate them on enable_debug public to all?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pearkes I meant that I would like to enable debug not only from a config file. If you want to turn it on real quick, an environment variable is easier. or commandline flag.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://golang.org/doc/diagnostics.html
Can I profile my production services?
Yes. It is safe to profile programs in production, but enabling some profiles (e.g. the CPU profile) adds cost. You should expect to see performance downgrade. The performance penalty can be estimated by measuring the overhead of the profiler before turning it on in production.
(my emphasis).
Note though that as far as I can tell that is only degrading performance while the profile is running. Actually registering the pprof handlers and including http/pprof doesn't do anything at all so I think it'd reasonable to always add the handlers and then gate access to them.
I guess we'd need to make a little wrapper http.HandlerFunc
that checked the config and optionally also checks for operator:read
ACL in a header/query param and then if OK calls the pprof Handler directly.
We'd still need to default to off in no ACL case somehow though.
We could require that you have ACLs setup and operator:read OR config enabled I guess and then people with no ACLs (hopefully not too many real customers) still have the same experience as now (changing config on disk) but at least a HUP not a start/stop is sufficient. While people with ACLs can have the tool work automatically since they run this with an operator token and it's enabled by default then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
||
For a full list of options and examples, please see the Consul | ||
documentation. | ||
` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great. Is it important to note that for pprof stuff it needs to be enabled or do we just show a useful error if it's not?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It'll just show a useful WARN
message if not, and skip pprof.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Either way we resolve the pprof enable stuff is good with me. Smoke tests should sanity check that performance is not affected but I'm 99% sure it is fine just to register the handlers of nothing s calling them.
This will allow us to avoid a restart of a target agent for profiling by always registering the pprof handlers. Given this is a potentially sensitive path, it is protected with an operator:read ACL and enable debug being set to true on the target agent. enable_debug still requires a restart. If ACLs are disabled, enable_debug is sufficient.
More in line with golang docs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
New pprof stuff looks good to me!
Minor testing suggestion but not blocking merge.
I didn't see any changes related to agent config reload - did you test whether reload will correctly flip the enable_debug
? I feel like it might not since it would be racy to write to the actual runtime config while things are reading from it.
Do we want to do that now? It can be punted as the experience with ACL off is no worse than before this PR but it would be could now we always register pprof that a HUP would allow debugging without full restart?
@banks so I went through and added all the areas I thought would enable reload -- but unfortunately couldn't thread it through where I needed. In the interest of helping with other stuff I'm going to ask for a final look on the latest commit on testing but otherwise merge this as-is and come back around to reloadables. Here's the WIP patch for future someone. diff --git a/agent/agent.go b/agent/agent.go
index 97e14d73b..bbbd93a5c 100644
--- a/agent/agent.go
+++ b/agent/agent.go
@@ -657,7 +657,7 @@ func (a *Agent) listenHTTP() ([]*HTTPServer, error) {
blacklist: NewBlacklist(a.config.HTTPBlockEndpoints),
proto: proto,
}
- srv.Server.Handler = srv.handler(a.config.EnableDebug)
+ srv.Server.Handler = srv.handler()
// This will enable upgrading connections to HTTP/2 as
// part of TLS negotiation.
@@ -3328,6 +3328,10 @@ func (a *Agent) loadLimits(conf *config.RuntimeConfig) {
a.config.RPCMaxBurst = conf.RPCMaxBurst
}
+func (a *Agent) loadDebug(conf *config.RuntimeConfig) {
+ a.config.EnableDebug = conf.EnableDebug
+}
+
func (a *Agent) ReloadConfig(newCfg *config.RuntimeConfig) error {
// Bulk update the services and checks
a.PauseSync()
@@ -3370,6 +3374,8 @@ func (a *Agent) ReloadConfig(newCfg *config.RuntimeConfig) error {
a.loadLimits(newCfg)
+ a.loadDebug(newCfg)
+
// create the config for the rpc server/client
consulCfg, err := a.consulConfig()
if err != nil {
diff --git a/agent/agent_endpoint_test.go b/agent/agent_endpoint_test.go
index f353360bb..7761544a5 100644
--- a/agent/agent_endpoint_test.go
+++ b/agent/agent_endpoint_test.go
@@ -766,6 +766,7 @@ func TestAgent_Reload(t *testing.T) {
t.Parallel()
dc1 := "dc1"
a := NewTestAgent(t.Name(), `
+ enable_debug = false
acl_enforce_version_8 = false
services = [
{
@@ -800,6 +801,7 @@ func TestAgent_Reload(t *testing.T) {
node_id = "` + string(a.Config.NodeID) + `"
node_name = "` + a.Config.NodeName + `"
+ enable_debug = true
acl_enforce_version_8 = false
services = [
{
@@ -828,6 +830,10 @@ func TestAgent_Reload(t *testing.T) {
t.Fatalf("RPC max burst not set correctly. Got %v. Want 200", a.config.RPCMaxBurst)
}
+ if a.config.EnableDebug != true {
+ t.Fatalf("enable_debug not set correctly. Got %v. Want true", a.config.EnableDebug)
+ }
+
for _, wp := range a.watchPlans {
if !wp.IsStopped() {
t.Fatalf("Reloading configs should stop watch plans of the previous configuration")
diff --git a/agent/http.go b/agent/http.go
index 5cb6d1bfa..50f0e3d9f 100644
--- a/agent/http.go
+++ b/agent/http.go
@@ -106,7 +106,7 @@ func (w *wrappedMux) ServeHTTP(resp http.ResponseWriter, req *http.Request) {
}
// handler is used to attach our handlers to the mux
-func (s *HTTPServer) handler(enableDebug bool) http.Handler {
+func (s *HTTPServer) handler() http.Handler {
mux := http.NewServeMux()
// handleFuncMetrics takes the given pattern and handler and wraps to produce
@@ -157,7 +157,7 @@ func (s *HTTPServer) handler(enableDebug bool) http.Handler {
// If enableDebug is not set, and ACLs are disabled, write
// an unauthorized response
- if !enableDebug {
+ if !s.agent.config.EnableDebug {
if s.checkACLDisabled(resp, req) {
return
} |
* agent/debug: add package for debugging, host info * api: add v1/agent/host endpoint * agent: add v1/agent/host endpoint * command/debug: implementation of static capture * command/debug: tests and only configured targets * agent/debug: add basic test for host metrics * command/debug: add methods for dynamic data capture * api: add debug/pprof endpoints * command/debug: add pprof * command/debug: timing, wg, logs to disk * vendor: add gopsutil/disk * command/debug: add a usage section * website: add docs for consul debug * agent/host: require operator:read * api/host: improve docs and no retry timing * command/debug: fail on extra arguments * command/debug: fixup file permissions to 0644 * command/debug: remove server flags * command/debug: improve clarity of usage section * api/debug: add Trace for profiling, fix profile * command/debug: capture profile and trace at the same time * command/debug: add index document * command/debug: use "clusters" in place of members * command/debug: remove address in output * command/debug: improve comment on metrics sleep * command/debug: clarify usage * agent: always register pprof handlers and protect This will allow us to avoid a restart of a target agent for profiling by always registering the pprof handlers. Given this is a potentially sensitive path, it is protected with an operator:read ACL and enable debug being set to true on the target agent. enable_debug still requires a restart. If ACLs are disabled, enable_debug is sufficient. * command/debug: use trace.out instead of .prof More in line with golang docs. * agent: fix comment wording * agent: wrap table driven tests in t.run()
Document /v1/agent/host endpoint which is used by `consul debug`. Originally added in #4754.
Document /v1/agent/host endpoint which is used by `consul debug`. Originally added in #4754.
Document /v1/agent/host endpoint which is used by `consul debug`. Originally added in #4754.
Document /v1/agent/host endpoint which is used by `consul debug`. Originally added in #4754.
This introduces a new
consul debug
command.Providing support for complex issues encountered by Consul operators often requires a large amount of debugging information to be retrieved in order to properly debug the issue. This tool aims to shortcut that information gathering effort and provide a simple workflow for accessing data from the target Consul agent and environment necessary to help resolve incidents and debug issues faster.
Questions
Should the log capture have a size limit, given they are in memory?Example
TODO
Sanitize memberlist output by removing the first n octets of host IPsenable_debug
being disabled and warning appropriately (and not capturing pprof)