Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Auditbeat] System module: Add entity_id fields #10500

Merged
merged 11 commits into from
Feb 5, 2019

Conversation

cwurm
Copy link
Contributor

@cwurm cwurm commented Feb 1, 2019

Implements {entity}.entity_id as a SHA-256 hash as proposed in #10463.

The new fields and what goes in the hash:

Field Hash components
system.audit.package.entity_id host.id + name + version
process.entity_id host.id + PID + StartTime
socket.entity_id host.id + inode + LocalIP + RemoteIP + LocalPort + RemotePort
user.entity_id host.id + UID + username

Note: socket is a net new top-level object, I just didn't see where else to put it. Open to suggestions.

host.id is retrieved when the system module is created and stored so the individual datasets don't have to re-fetch it. It's exposed to all through a new SystemMetricSet.

Closes #10463.

@cwurm cwurm added review needs_backport PR is waiting to be backported to other branches. Auditbeat SecOps labels Feb 1, 2019
@cwurm cwurm requested a review from a team February 1, 2019 19:08
@elasticmachine
Copy link
Collaborator

Pinging @elastic/secops

@cwurm cwurm requested a review from a team as a code owner February 1, 2019 19:08
--
type: keyword

ID uniquely identifying the package.
Copy link
Contributor

@tsg tsg Feb 3, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd say it would be worth providing more details for theentity_id (not only this one, but for all). Either say how it is computed or what property it guarantees (i.e. unique value per package installation on one host). The former will be more clear, but the latter would be more future proof (we might change how we compute it).

Copy link
Contributor Author

@cwurm cwurm Feb 3, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about: This ID uniquely identifies a package on a host. It is computed as a hash of the host ID, package name, and package version.?

And similar wording for the others.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed that a bit more info should be provided. I think it's important to mention that these IDs are stable across multiple collections of information, and should change only if there's a significant change. Package version above is a good example, PID change if same process dies and restarts, & so on.

But I don't think we need to mention more implementation details than "it's a hash of ..."

Copy link
Member

@andrewkroh andrewkroh Feb 4, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That definition should be fine for users. But if we want other developers to be able to send data with compatible entity IDs we'll need to be much more specific about data encoding.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think your proposed wording is a step forward and is enough for now. Too many details would suggest that changing it is a breaking change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm adding a sentence to each of them saying It is computed as a SHA-256 hash of the host ID, {something else}, and {something else}..

I believe the description only shows up in the docs, so we're more flexible in changing it later if we want.

binary.Write(h, binary.LittleEndian, []byte(hostID))
binary.Write(h, binary.LittleEndian, []byte(pkg.Name))
binary.Write(h, binary.LittleEndian, []byte(pkg.Version))
binary.Write(h, binary.LittleEndian, int64(pkg.InstallTime.Nanosecond()))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to self: Should not include InstallTime for the same reasons as outlined in #10508 (comment).

Also, should different versions of a package be counted as two entities?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed InstallTime. Leaving version as part of the entity_id, so two versions of a package will count as two. I think this makes sense because there are cases where two versions are actually installed - e.g. Python 2.6 and 3.x - and they should both be counted.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that seems reasonable. A re-installed package on the same version is really the same thing, IMO.

// entityID creates an ID that uniquely identifies this package across machines.
func (pkg Package) entityID(hostID string) string {
h := system.NewEntityHash()
binary.Write(h, binary.LittleEndian, []byte(hostID))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Strings are not represented in an endian specific order so you should be able to write h.Write([]byte(hostID)) which will save a few reflection calls.

This happens a few places throughout the code, but I'm just leaving one comment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I've hopefully caught all of them.

--
type: keyword

ID uniquely identifying the package.
Copy link
Member

@andrewkroh andrewkroh Feb 4, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That definition should be fine for users. But if we want other developers to be able to send data with compatible entity IDs we'll need to be much more specific about data encoding.

Copy link
Contributor

@webmat webmat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few comments, some of which are borderline blockers. A few of my suggestions can be addressed later, however.

--
type: keyword

ID uniquely identifying the package.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed that a bit more info should be provided. I think it's important to mention that these IDs are stable across multiple collections of information, and should change only if there's a significant change. Package version above is a good example, PID change if same process dies and restarts, & so on.

But I don't think we need to mention more implementation details than "it's a hash of ..."

@@ -40,6 +40,8 @@ func getConfig() map[string]interface{} {
}

func TestProcessEvent(t *testing.T) {
ms := mbtest.NewReportingMetricSetV2(t, getConfig()).(*MetricSet)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would have loved to see an explicit test of a few cases of calculating the entity ID, and the expected result. For example:

  • same PID, different collection timestamps => same entity ID (that's the whole point)
  • different PID, exact same process start timestamp (unlikely but possible) => different entity ID
  • same PID, different collection timestamps, different process start timestamps (stretching it, but possible) => different entity ID

I'm ok if you create an issue to add this as a separate step, given the time pressure to get this in, though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same suggestion goes for each flavor of entity_id, too

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed this needs some unit testing, but I'm fine with it coming in a follow-up PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, the code is fairly straightforward (in each case one method with <10 lines), so we'd mostly be checking if Golang's hashing works correctly.

But I agree a test would still be nice, and I'll take it down as a follow-up.

--
type: keyword

ID uniquely identifying the user.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps a warning that this isn't related to centralized user management. But would rather detect a user being deleted, then recreated with the same name/home/etc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, let me change to ID uniquely identifying the user on a host.. That should make it clear this is not across hosts.

h := system.NewEntityHash()
binary.Write(h, binary.LittleEndian, []byte(hostID))
binary.Write(h, binary.LittleEndian, []byte(pkg.Name))
binary.Write(h, binary.LittleEndian, []byte(pkg.Version))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we consider the same version being uninstalled and reinstalled as a different entity (installtime)?

If we want to be really strict I think we should. But I'm not sure how helpful that would be. May actually help detect package repo poisoning.

We can always improve that later, however. Perhaps a follow-up improvement after FF?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, we only have a reliable InstallTime for RPM packages. dpkg does not seem to store it (though at some point we might parse it out of /var/log/dpkg.log), and it is unreliable for Homebrew (directory mod time which can easily change). So that's why I'm not using it to form entity IDs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I'm good with this.

I think it's actually valid to have a more solid entity_id that considers the installtime for rpms than for dpkg & brew. The hashes of different families are never meant to be compared to one another.

But I agree this is for later, in any case.

h.Write(s.LocalIP)
h.Write(s.RemoteIP)
binary.Write(h, binary.LittleEndian, int64(s.LocalPort))
binary.Write(h, binary.LittleEndian, int64(s.RemotePort))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you compared with what goes into the community_id hash, for this one?

I don't think we should use community_id. I'm just suggesting as a sanity check on what gets considered to calculate our entity_id.

I think this is fine, though. I like the inclusion of the inode.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what I can see the community ID is using source and dest IP/port, and protocol. For us, we would want to be host specific though, since even if we have both sockets of a connection in the data, we would still want them to have different entity IDs (since they are different entities, owned by different processes and users). That said, we should probably add community ID to the socket fields, to allow correlation between flows and sockets. I think @andrewkroh has suggested it in the past.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having the community ID populated wherever possible will make correlation so much easier. So +1 to adding it for the socket dataset.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is extra goodness, not a breaking change, though. So up to @cwurm to include now vs wait for 7.1? I think both are fine, it's just a question of time management. I think it's pretty late to add community_id, unless code reuse is more straightforward than I assume :-)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I'll put it in the backlog.

h := system.NewEntityHash()
binary.Write(h, binary.LittleEndian, []byte(hostID))
binary.Write(h, binary.LittleEndian, []byte(u.Name))
binary.Write(h, binary.LittleEndian, []byte(u.UID))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would absolutely include user creation date here.

UID can be specified when creating a user. So delete user, re-create with same name/uid needs to be detected. I think user creation date should be enough. Actually do you have access to that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't have the user creation date, unfortunately. Same as for dpkg and its /var/lib/dpkg/status, Linux does not store the creation time in /etc/passwd. The only way I can see of getting it is through parsing /var/log/auth.log. Something for the future though.

@@ -58,8 +60,8 @@ def test_metricset_process(self):
process metricset collects information about processes running on a system.
"""

fields = ["process.pid", "process.ppid", "process.name", "process.executable", "process.args",
"process.start", "process.working_directory", "user.id", "user.group.id"]
fields = ["process.entity_id", "process.pid", "process.ppid", "process.name", "process.executable",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You calculate it for Windows too? 👍

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, we have the data :)

@cwurm
Copy link
Contributor Author

cwurm commented Feb 4, 2019

Thanks @andrewkroh and @webmat. I've made a code change for hashing strings, and answered hopefully all comments. If there are no more code changes, I would rather merge this now, and address any changes to the docs in a follow-up PR. What do you think?

@andrewkroh
Copy link
Member

I’m good with plan. ^

Copy link
Contributor

@webmat webmat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@cwurm cwurm force-pushed the system_module_entity_id branch from c486cd5 to d6b2a49 Compare February 5, 2019 09:52
@cwurm cwurm force-pushed the system_module_entity_id branch from d6b2a49 to 4ac4a66 Compare February 5, 2019 12:08
@cwurm cwurm merged commit c047ef7 into elastic:master Feb 5, 2019
@cwurm cwurm removed the needs_backport PR is waiting to be backported to other branches. label Feb 5, 2019
cwurm pushed a commit to cwurm/beats that referenced this pull request Feb 5, 2019
Implements `{entity}.entity_id` as a SHA-256 hash as proposed in elastic#10463.

Closes elastic#10463.

(cherry picked from commit c047ef7)
@cwurm cwurm added the v6.7.0 label Feb 5, 2019
@cwurm cwurm deleted the system_module_entity_id branch February 5, 2019 14:45
cwurm pushed a commit that referenced this pull request Feb 5, 2019
Implements `{entity}.entity_id` as a SHA-256 hash as proposed in #10463.

Closes #10463.

(cherry picked from commit c047ef7)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants