-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Auditbeat] System module: Add entity_id fields #10500
Conversation
Pinging @elastic/secops |
auditbeat/docs/fields.asciidoc
Outdated
-- | ||
type: keyword | ||
|
||
ID uniquely identifying the package. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd say it would be worth providing more details for theentity_id
(not only this one, but for all). Either say how it is computed or what property it guarantees (i.e. unique value per package installation on one host). The former will be more clear, but the latter would be more future proof (we might change how we compute it).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about: This ID uniquely identifies a package on a host. It is computed as a hash of the host ID, package name, and package version.
?
And similar wording for the others.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed that a bit more info should be provided. I think it's important to mention that these IDs are stable across multiple collections of information, and should change only if there's a significant change. Package version above is a good example, PID change if same process dies and restarts, & so on.
But I don't think we need to mention more implementation details than "it's a hash of ..."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That definition should be fine for users. But if we want other developers to be able to send data with compatible entity IDs we'll need to be much more specific about data encoding.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think your proposed wording is a step forward and is enough for now. Too many details would suggest that changing it is a breaking change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm adding a sentence to each of them saying It is computed as a SHA-256 hash of the host ID, {something else}, and {something else}.
.
I believe the description only shows up in the docs, so we're more flexible in changing it later if we want.
binary.Write(h, binary.LittleEndian, []byte(hostID)) | ||
binary.Write(h, binary.LittleEndian, []byte(pkg.Name)) | ||
binary.Write(h, binary.LittleEndian, []byte(pkg.Version)) | ||
binary.Write(h, binary.LittleEndian, int64(pkg.InstallTime.Nanosecond())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note to self: Should not include InstallTime
for the same reasons as outlined in #10508 (comment).
Also, should different versions of a package be counted as two entities?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed InstallTime
. Leaving version
as part of the entity_id
, so two versions of a package will count as two. I think this makes sense because there are cases where two versions are actually installed - e.g. Python 2.6 and 3.x - and they should both be counted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that seems reasonable. A re-installed package on the same version is really the same thing, IMO.
// entityID creates an ID that uniquely identifies this package across machines. | ||
func (pkg Package) entityID(hostID string) string { | ||
h := system.NewEntityHash() | ||
binary.Write(h, binary.LittleEndian, []byte(hostID)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Strings are not represented in an endian specific order so you should be able to write h.Write([]byte(hostID))
which will save a few reflection calls.
This happens a few places throughout the code, but I'm just leaving one comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I've hopefully caught all of them.
auditbeat/docs/fields.asciidoc
Outdated
-- | ||
type: keyword | ||
|
||
ID uniquely identifying the package. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That definition should be fine for users. But if we want other developers to be able to send data with compatible entity IDs we'll need to be much more specific about data encoding.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a few comments, some of which are borderline blockers. A few of my suggestions can be addressed later, however.
auditbeat/docs/fields.asciidoc
Outdated
-- | ||
type: keyword | ||
|
||
ID uniquely identifying the package. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed that a bit more info should be provided. I think it's important to mention that these IDs are stable across multiple collections of information, and should change only if there's a significant change. Package version above is a good example, PID change if same process dies and restarts, & so on.
But I don't think we need to mention more implementation details than "it's a hash of ..."
@@ -40,6 +40,8 @@ func getConfig() map[string]interface{} { | |||
} | |||
|
|||
func TestProcessEvent(t *testing.T) { | |||
ms := mbtest.NewReportingMetricSetV2(t, getConfig()).(*MetricSet) | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would have loved to see an explicit test of a few cases of calculating the entity ID, and the expected result. For example:
- same PID, different collection timestamps => same entity ID (that's the whole point)
- different PID, exact same process start timestamp (unlikely but possible) => different entity ID
- same PID, different collection timestamps, different process start timestamps (stretching it, but possible) => different entity ID
I'm ok if you create an issue to add this as a separate step, given the time pressure to get this in, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same suggestion goes for each flavor of entity_id, too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed this needs some unit testing, but I'm fine with it coming in a follow-up PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, the code is fairly straightforward (in each case one method with <10 lines), so we'd mostly be checking if Golang's hashing works correctly.
But I agree a test would still be nice, and I'll take it down as a follow-up.
auditbeat/docs/fields.asciidoc
Outdated
-- | ||
type: keyword | ||
|
||
ID uniquely identifying the user. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps a warning that this isn't related to centralized user management. But would rather detect a user being deleted, then recreated with the same name/home/etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, let me change to ID uniquely identifying the user on a host.
. That should make it clear this is not across hosts.
h := system.NewEntityHash() | ||
binary.Write(h, binary.LittleEndian, []byte(hostID)) | ||
binary.Write(h, binary.LittleEndian, []byte(pkg.Name)) | ||
binary.Write(h, binary.LittleEndian, []byte(pkg.Version)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we consider the same version being uninstalled and reinstalled as a different entity (installtime)?
If we want to be really strict I think we should. But I'm not sure how helpful that would be. May actually help detect package repo poisoning.
We can always improve that later, however. Perhaps a follow-up improvement after FF?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, we only have a reliable InstallTime
for RPM packages. dpkg does not seem to store it (though at some point we might parse it out of /var/log/dpkg.log
), and it is unreliable for Homebrew (directory mod time which can easily change). So that's why I'm not using it to form entity IDs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok I'm good with this.
I think it's actually valid to have a more solid entity_id that considers the installtime for rpms than for dpkg & brew. The hashes of different families are never meant to be compared to one another.
But I agree this is for later, in any case.
h.Write(s.LocalIP) | ||
h.Write(s.RemoteIP) | ||
binary.Write(h, binary.LittleEndian, int64(s.LocalPort)) | ||
binary.Write(h, binary.LittleEndian, int64(s.RemotePort)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you compared with what goes into the community_id
hash, for this one?
I don't think we should use community_id. I'm just suggesting as a sanity check on what gets considered to calculate our entity_id.
I think this is fine, though. I like the inclusion of the inode.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From what I can see the community ID is using source and dest IP/port, and protocol. For us, we would want to be host specific though, since even if we have both sockets of a connection in the data, we would still want them to have different entity IDs (since they are different entities, owned by different processes and users). That said, we should probably add community ID to the socket fields, to allow correlation between flows and sockets. I think @andrewkroh has suggested it in the past.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having the community ID populated wherever possible will make correlation so much easier. So +1 to adding it for the socket dataset.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is extra goodness, not a breaking change, though. So up to @cwurm to include now vs wait for 7.1? I think both are fine, it's just a question of time management. I think it's pretty late to add community_id, unless code reuse is more straightforward than I assume :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I'll put it in the backlog.
h := system.NewEntityHash() | ||
binary.Write(h, binary.LittleEndian, []byte(hostID)) | ||
binary.Write(h, binary.LittleEndian, []byte(u.Name)) | ||
binary.Write(h, binary.LittleEndian, []byte(u.UID)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would absolutely include user creation date here.
UID can be specified when creating a user. So delete user, re-create with same name/uid needs to be detected. I think user creation date should be enough. Actually do you have access to that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't have the user creation date, unfortunately. Same as for dpkg and its /var/lib/dpkg/status
, Linux does not store the creation time in /etc/passwd
. The only way I can see of getting it is through parsing /var/log/auth.log
. Something for the future though.
@@ -58,8 +60,8 @@ def test_metricset_process(self): | |||
process metricset collects information about processes running on a system. | |||
""" | |||
|
|||
fields = ["process.pid", "process.ppid", "process.name", "process.executable", "process.args", | |||
"process.start", "process.working_directory", "user.id", "user.group.id"] | |||
fields = ["process.entity_id", "process.pid", "process.ppid", "process.name", "process.executable", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You calculate it for Windows too? 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, we have the data :)
Thanks @andrewkroh and @webmat. I've made a code change for hashing strings, and answered hopefully all comments. If there are no more code changes, I would rather merge this now, and address any changes to the docs in a follow-up PR. What do you think? |
I’m good with plan. ^ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
c486cd5
to
d6b2a49
Compare
d6b2a49
to
4ac4a66
Compare
Implements `{entity}.entity_id` as a SHA-256 hash as proposed in elastic#10463. Closes elastic#10463. (cherry picked from commit c047ef7)
Implements
{entity}.entity_id
as a SHA-256 hash as proposed in #10463.The new fields and what goes in the hash:
system.audit.package.entity_id
host.id + name + version
process.entity_id
host.id + PID + StartTime
socket.entity_id
host.id + inode + LocalIP + RemoteIP + LocalPort + RemotePort
user.entity_id
host.id + UID + username
Note:
socket
is a net new top-level object, I just didn't see where else to put it. Open to suggestions.host.id
is retrieved when the system module is created and stored so the individual datasets don't have to re-fetch it. It's exposed to all through a newSystemMetricSet
.Closes #10463.