-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Auditbeat] Login metricset #9327
Conversation
Pinging @elastic/secops |
|
||
// Save new state to disk | ||
if len(loginRecords) > 0 { | ||
err := ms.utmpReader.saveStateToDisk() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mentioning it mostly for awareness: if I'm not wrong, the Auditbeat report.Event()
doesn't guarantee sending at least once. If Elasticsearch is down, for example, it will retry 3 times, and then drop the event. This means that we can potentially lose login events, despite saving the state. I think we could solve this after the Filebeat refactoring, but I suggest documenting this as a limitation for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, that's a shame. It guess this affects all metricsets, and we might lose process starts, socket connections, password changes, audit events, FIM events, etc.
The internal queue does not keep events until they can be sent (or it overflows)? That would seem desirable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does, and one can also use spooling to disk to make the queue larger. But Filebeat offers an extra guarantee by waiting for the ACK and only afterwards update the state on disk. This is not currently possible in Auditbeat. That’s the difference I wanted to highlight.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, maybe if at some point report.Event()
would return false
if it couldn't write to the queue or otherwise cannot guarantee sending the event eventually - so we can retry later.
We should think through what the experience is on non-Linux systems. I'm thinking two things:
Is the first point already happening? I couldn't test because of the |
That should be what's happening (in login_other.go). I've added some extra
It shouldn't - there's an if/else around it in the
I opened #9368 to address that - it compiles for me after applying that. I've changed to using |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks for addressing the comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will be a useful metricset. I left a few comments that need to be addressed now, and some others that are just suggestions. If you have any questions let me know.
if utAddrV6[1] != 0 || utAddrV6[2] != 0 || utAddrV6[3] != 0 { | ||
// IPv6 | ||
b := make([]byte, 16) | ||
binary.LittleEndian.PutUint32(b[:4], utAddrV6[0]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does this mean for big-endian architectures? Will this code work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, good point and I don't know. Do we support any big-endian systems?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm surprised Go doesn't have a binary.MachineEndian.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
package main
import (
"encoding/binary"
"fmt"
"unsafe"
)
var MachineEndian = getByteOrder()
func getByteOrder() binary.ByteOrder {
var b [2]byte
*((*uint16)(unsafe.Pointer(&b[0]))) = 1
if b[0] == 1 {
return binary.LittleEndian
}
return binary.BigEndian
}
func main() {
var b [4]byte
MachineEndian.PutUint32(b[:], 0x12345678)
fmt.Printf("%x %x %x %x\n", b[0], b[1], b[2], b[3])
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just revisiting this. I'm thinking that it might not be likely that this would be used on a big-endian system that uses UTMP login records. In the interest of keeping the code simple I'd leave it as is. If there's a need to change things we always can.
@adriansr getByteOrder()
looks like a useful function that Go itself should have, indeed. Or maybe we should have it in a shared location within Beats so code like this here can use it easily.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's likely that someone will use it on big-endian. We've had several contributions to go-libaudit to make auditbeat work on big-endian.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I've added support for it. Thanks @adriansr for the code snippet!
// but will return -1 instead. | ||
func lookupUsername(username string) int { | ||
if username != "" { | ||
user, err := user.Lookup(username) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is probably fine for low volumes. But we have seen from the auditd module that this call is expensive. Some caching could be a useful future enhancement.
IP *net.IP | ||
Timestamp time.Time | ||
Origin string | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same, long name and a many exported symbols that we might not need somewhere else.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LoginRecord
is encoded and decoded by gob
so it needs to be exported, unfortunately.
|
||
// FileRecord represents a UTMP file at a point in time. | ||
type FileRecord struct { | ||
Inode Inode |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In filebeat we also store the device id.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think for now it's unlikely that UTMP files would be stored on separate file systems. I.e. it would be mean different parts of /var/log
are on different devices. I'm not even sure that's possible.
} | ||
|
||
// ReadNew returns any new UTMP entries in any files matching the configured pattern. | ||
func (r *UtmpFileReader) ReadNew() ([]LoginRecord, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How big do we expect these files to be? Looks like this function will load all available logs into memory on first run.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So far I've assumed they're usually quite small, but that might not be true in all cases. The /etc/logrotate.conf
on my Ubuntu VM rotates it monthly. On that VM that's getting a number of daily logins/logouts, they're <100K. On a real-world machine with a lot of logins, I assume it could be more, I wonder how many records it would take to be significant.
Nevertheless, I guess we could use a channel between Fetch
and ReadNew
and output records as events as they come.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is more or less what filebeat does as well. Have one go-routine per file and forward event by event. That is, the module/input is more or less active all the time. Not sure how well this fits the 'Fetch' semantics in the metricbeat framework. But a bounded channel can indeed help keeping memory usage in place. Given the complexity this introduces I wonder if that's really a problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've changed to using channels now, and storing offsets of the files (instead of re-reading until the last read record). So hopefully that will make it scalable even if the files are very large.
r.log.Warnf("Unexpectedly, the file %v did not contain the saved login record %v - reading whole file.", | ||
path, *lastKnownRecord) | ||
|
||
return r.readAfter(path, nil) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like we are ignoring all messages read so far if we end up here (e.g. we read while copytruncate like logrotation happens). We still want to publish known records? Alternatively we can just error and return here, but rely on fetch to reset the file offset and read new contents the next time Fetch is executed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, maybe I was a bit too overzealous to make sure we never miss an event here and just returning is better.
However, I'm curious about copytruncate: As I understand it, the logfile is copied (e.g. cp /var/log/wtmp /var/log/wtmp.1
) and then truncated. If that happens, I suspect the metricset as it is at the moment (with the file pattern being /var/log/wtmp*
by default) would read all of /var/log/wtmp.1
(because it'll have a different inode) - even though it has read all of those records already. Does Filebeat handle this in some way?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But can wtmp
be copy-truncated? I thought it's guaranteed to be renamed with moves. I wouldn't want to bring all the complexity of Filebeat's corner cases in here :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rename with move should be ok. But them we don't get into this code path.
Instead of having a tail recursive call we could use a for loop and reset some state and use continue
in order to jump to beginning of the reader and not drop events. But then it's an edge case that normally should not occur.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code is now using offsets and if there is some problem in jumping to the stored offset it will jump back to the beginning and re-read the file. This is maybe a bit overzealous, but I think at least for now I'd prefer it if events were sent twice rather than not at all. If there are any problems we can always change.
a165894
to
fe229c7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, suggesting minor refactor, feel free to do it on a separate PR.
|
||
var byteOrder = getByteOrder() | ||
|
||
func getByteOrder() binary.ByteOrder { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Turns out gosigar has a GetEndian() method for this.
func GetEndian() binary.ByteOrder { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh nice, thanks!
Adds the login metricset to the Auditbeat system module as the last of the six initial metricsets. It only works on Linux, and detects not just user logins and logouts, but also system boots and shutdowns. It works by reading the /var/log/wtmp and /var/log/btmp file (and rotated files) present on Linux systems. In reading a file, it is similar to Filebeat, except that UTMP is a binary format, so reading happens using a binary Go reader. (cherry picked from commit 1566e66)
Adds the login metricset to the Auditbeat system module as the last of the six initial metricsets. It only works on Linux, and detects not just user logins and logouts, but also system boots and shutdowns. It works by reading the /var/log/wtmp and /var/log/btmp file (and rotated files) present on Linux systems. In reading a file, it is similar to Filebeat, except that UTMP is a binary format, so reading happens using a binary Go reader. (cherry picked from commit 1566e66)
This adds the
login
metricset to the Auditbeat system module. It's the last of the six initial metricsets. It only works on Linux, and detects not just user logins and logouts, but also system boots and shutdowns.It works by reading the
/var/log/wtmp
and/var/log/btmp
file (and rotated files) present on Linux systems. In reading a file, it is similar to Filebeat, except that UTMP is a binary format, so reading happens using a binary Go reader. See utmp(5) for the format of that file.The logic is roughly as follows:
login.utmp_file_pattern
andlogin.btmp_file_pattern
will contain the pattern matching the wtmp (good logins, as well as system shutdowns and boots) and btmp (bad/failed logins) files and rotated files (if desired). The defaults are/var/log/wtmp*
and/var/log/btmp*
. These are expanded usingfilepath.Glob
and the files are sorted lexicographically in reverse order (i.e./var/log/wtmp.1
will come before/var/log/wtmp
) so that we read older login records first - reading in order is required for matching login and logout records, see next steps.Fetch
it checks for new entries: Any new files are read from the beginning, while known files are read from a saved offset. To that purpose, the last offset per file is saved and persisted to disk inbeat.db
. A new file is one that has an unknown inode, but files are also read completely if theirnewSize < oldSize
for some reason (that should make it work with any potential inode reuse - very unlikely since this will never read a lot of files but still possible).LoginRecord
in the code). Boot and shutdown events are fairly straightforward, user login and logout events have to be matched using theirtty
- so there is aloginSessions
map that stores logins to enrich the logouts, and is also persisted to disk.Note: This dataset also introduces
event.origin
containing the file the event came from, e.g./var/log/wtmp.1
. In other cases, it would be something likeprocfs
ornetlink
. It's useful to know where information comes from, e.g. to know how reliable it is.