-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Write pidfile only if it doesn't exist #2358
Conversation
cmd/telegraf/telegraf.go
Outdated
@@ -271,7 +271,7 @@ func reloadLoop(stop chan struct{}, s service.Service) { | |||
log.Printf("I! Tags enabled: %s", c.ListTags()) | |||
|
|||
if *fPidfile != "" { | |||
f, err := os.Create(*fPidfile) | |||
f, err := os.OpenFile(*fPidfile, os.O_EXCL|os.O_CREATE|os.O_WRONLY, 0666) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mode 0666 means that anyone on the box can write to the file. Thus if you have a service manager that is starting telegraf, and someone changes the contents of the pid file to some other system service, they can trick the service manager into signaling (and killing) the wrong service.
O_CREATE|O_EXCL also has the issue that if the pid file is ever abandoned for any reason (such as telegraf crash), telegraf will refuse to start.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
0666
is what the previous one did too; that's the behaviour of Create
. But I agree; 644
is better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
O_CREATE|O_EXCL also has the issue that if the pid file is ever abandoned for any reason (such as telegraf crash), telegraf will refuse to start.
The alternative is checking if the pid in the pidfile refers to a live process, and if not, reclaiming the file for ourselves. Seems like a lot of pid-fiddling, but I'm happy to go that way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The alternative is checking if the pid in the pidfile refers to a live process, and if not, reclaiming the file for ourselves.
The problem is that with things like PID reuse, that can't be relied upon. It might not be a telegraf process sitting on that PID. And checking that the pid exists is a little difficult to do in a good cross-platform way. The best I know of is to send a signal 0 to the PID, and check response code. But even that can be misleading as you can't signal processes that you don't have perms on. So it gets messy really fast.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So it gets messy really fast.
Agreed. :) I thought about this when I started, and out of lazinesscomplexity avoidance, I went the easier, "operator intervention required" route.
I'm sure there are other apps that won't boot if there's lingering lock after a crash (apt
), but I don't know if that's a good pattern to follow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that the pidfile should be used as a process lock. There are quite a few references on how to handle pidfiles, none seem very canonical, this might be the best one to go with (uses O_RDWR | O_CREAT
): http://www.man7.org/tlpi/code/online/dist/filelock/create_pid_file.c.html
cmd/telegraf/telegraf.go
Outdated
if *fPidfile != "" { | ||
err := os.Remove(*fPidfile) | ||
if err != nil { | ||
log.Fatalf("E! Unable to remove pidfile: %s", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this really need to be fatal? We should already be shutting down, so I don't think we don't need to exit early and abort the shutdown process.
And what if the issue is that the file was already removed? We were just going to remove it anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope, it doesn't. Warning at best.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
put the pidfile removal in a defer
function just after it is created
I'll give this a little more thought in the morning, going to leave it for now. For context, I'm running I'm willing to admit I need to do more looking into the monit side. But I don't think overwriting the pidfile blindly is the best idea. |
Shouldn't it just spawn 1 telegraf? The new process will write a new pid file, which monit will see and go "ok telegraf is now running since the pid file contains a valid pid". I've never used monit, so I'm just asking mostly out of curiosity. |
One would hope so, but that's not happening here. I need to look into it more, but considering I can't repro that problem, it might be a while.
Yes, but I was intending on implementing the stale pidfile checking in a launcher script. Even something really stupid like |
cmd/telegraf/telegraf.go
Outdated
if *fPidfile != "" { | ||
err := os.Remove(*fPidfile) | ||
if err != nil { | ||
log.Fatalf("E! Unable to remove pidfile: %s", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
put the pidfile removal in a defer
function just after it is created
cmd/telegraf/telegraf.go
Outdated
@@ -271,7 +271,7 @@ func reloadLoop(stop chan struct{}, s service.Service) { | |||
log.Printf("I! Tags enabled: %s", c.ListTags()) | |||
|
|||
if *fPidfile != "" { | |||
f, err := os.Create(*fPidfile) | |||
f, err := os.OpenFile(*fPidfile, os.O_EXCL|os.O_CREATE|os.O_WRONLY, 0666) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that the pidfile should be used as a process lock. There are quite a few references on how to handle pidfiles, none seem very canonical, this might be the best one to go with (uses O_RDWR | O_CREAT
): http://www.man7.org/tlpi/code/online/dist/filelock/create_pid_file.c.html
Agree. As such, I'd say the concept behind the PR is flawed. Thanks for all your comments |
I'm going to reopen because I think there are two things from this that we need to fix about the current behavior:
|
4fa0fc8
to
7f064ba
Compare
Also, ensure pidfile perms are 644
7f064ba
to
1a79b61
Compare
I've updated the PR to do this |
thanks @nfirvine |
Also, ensure pidfile perms are 644
And remove pidfile at end of run
I discovered a scenario where if run a second instance telegraf, it will happily write its pid to the pidfile, even if the pidfile contains a pid of a running process, and even if the second instance crashes immediately (due to a port in use for example).
This fixes it by requiring that the pidfile not exist beforehand, and deletes its pidfile when it closes regularly.