Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] E-Mail #999

Merged
merged 7 commits into from
Nov 30, 2020
Merged

[RFC] E-Mail #999

merged 7 commits into from
Nov 30, 2020

Conversation

jamiehynds
Copy link
Contributor

@jamiehynds jamiehynds commented Oct 5, 2020

  • Have you signed the contributor license agreement?
  • Have you followed the contributor guidelines?
  • For proposing substantial changes or additions to the schema, have you reviewed the RFC process?
  • If submitting code/script changes, have you verified all tests pass locally using make test?
  • If submitting schema/fields updates, have you generated new artifacts by running make and committed those changes?
  • Is your pull request against master? Unless there is a good reason otherwise, we prefer pull requests against master and will backport as needed.
  • Have you added an entry to the CHANGELOG.next.md?

RFC Preview

@jamiehynds jamiehynds changed the title Create 0008-email.md E-Mail RFC Oct 5, 2020
@ebeahan ebeahan added the RFC label Oct 5, 2020
@ebeahan ebeahan changed the title E-Mail RFC [RFC] E-Mail RFC Oct 5, 2020
@ebeahan ebeahan changed the title [RFC] E-Mail RFC [RFC] E-Mail Oct 5, 2020
Copy link

@vpiserchia vpiserchia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A series of question

| `email.process` | keyword | Name of the executable that carried out the transaction, e.g. outlook, sendmail |
| `email.protocol` | keyword | The email protocol used, e.g. SMTP, IMAP |
| `email.reply_to.address` | keyword | Reply-to address |
| `object.return.address` | keyword | The return address for the message |

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should not be email.return.address?

| `email.cc.address` | keyword | Addresses of Cc's |
| `email.cc.domain` | keyword | Domains of Cc addresses |
| `email.cipher` | keyword | Cipher used e.g. TLS |
| `email.file.count` | value | Number of attachments included in the message |

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not allowing multiple files as in an array ?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would use the term 'attachment' instead of 'file'.
Schema should support multiple attachments.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

file already exists as ECS field, it would be great to reuse it

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed that attachment is more idiomatic when referring to email than file. Also agreed we will need to consider that a single email can have multiple attachments.

The file.* fieldset in ECS is defined around a file that is created or exists on a filesystem, so many of the fields underneath file.* will probably not apply to an email attachment. Though this will probably vary based on the source of the events.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with the name "attachments", and that it should support capturing multiple attachments. Note that I think it should be pluralized, since there can be many. This would be in line with how we named dns.answers.*

Here's a few suggestions on changes we could make to the attachment fields as we currently have them:

  • Rename all mentions of email.file. to email.attachments.
  • Rename email.file.count to email.attachments_count, because everything after the . will be present once for each entry in the array (for each attachment)
  • Make email.attachments_count a long
  • Make email.attachments.size a long
  • Make email.attachments.name a wildcard

This will be an array of objects. In order for Elasticsearch to be able to index it so that we can query on multiple attachment attributes at a time (e.g. extension:mp3 AND size < 100000) and get the expected results, would need to make email.attachment into a nested field.

Querying nested fields is slightly different than normal fields, however (API, KQL).

It looks to me like it has good support across the stack all the way to KQL, which is great. But since this will be the first use of this type in ECS, and since querying these fields is a bit different than usual, I think this would be worth a mention in the Concerns section.

I would still adjust the field listing assuming we'll use nested, since it looks like the best approach. So let's add an extra row here for it:

| `email.attachments` | nested | Array of objects containing information about each email attachment.

I like @vpiserchia's suggestion of reusing file here. That could be an option, given that we can now reuse as another name (reuse file as email.attachments). However I agree with @ebeahan, that this would bring in way too many fields that are not useful in this context (consider the fields reused under file on top of this). So I think we could err on the side of defining the sub-fields explicitly for now, and trying to keep them consistent with their cousins as defined under file.*. I do think it's worth capturing this as another point in the Concerns section, though.

| `email.size` | keyword | Total size of the message, in bytes, including attachments |
| `email.subject` | keyword | Subject of the message |
| `email.to` | keyword | Recipieint address |
| `email.to.domain` | keyword | Recipient domain |

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

email.to is already a field of type keyword, how can you define email.to as an object having domain as subfield? sorry for the silly question

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! 👀

I believe email.to will be corrected to email.to.address.

| `email.from.domain` | keyword | Senders domain |
| `email.latency` | keyword | The time, in milliseconds, the delivery attempt took |
| `email.message_id` | keyword | Internet message ID of the message |
| `email.process` | keyword | Name of the executable that carried out the transaction, e.g. outlook, sendmail |

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MTA?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call out.

Although I wonder what the intent is, with this field? Is it:

  1. to capture simply the name of the event source under management (e.g. I'm parsing sendmail logs, therefore email.process is hardcoded to "sendmail" in my pipeline),
  2. or to capture the series of agents that took part in transmission of the email (e.g. from each of the Received header).

I'm curious what folks would like to have here. Would both of these fields be useful?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why add a field for this, i.m.o this can be captured in process.* fields. Specifically the process.name field.

This would be in line with the statement on email.action

Copy link
Member

@P1llus P1llus Nov 3, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @SHolzhauer on this, I think the name of the process should follow the current ECS format.

When it comes to question number 2 from @webmat it is something that might be useful in a separate field.

| `object.return.address` | keyword | The return address for the message |
| `email.size` | keyword | Total size of the message, in bytes, including attachments |
| `email.subject` | keyword | Subject of the message |
| `email.to` | keyword | Recipieint address |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

email.to.address


<!-- An RFC should link to the PRs for each of it stage advancements. -->

* Stage 0: https://github.com/elastic/ecs/pull/NNN
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Stage 0: https://github.com/elastic/ecs/pull/NNN
* Stage 0: https://github.com/elastic/ecs/pull/999

| `object.return.address` | keyword | The return address for the message |
| `email.size` | keyword | Total size of the message, in bytes, including attachments |
| `email.subject` | keyword | Subject of the message |
| `email.to` | keyword | Recipieint address |
Copy link

@jeffrysleddens jeffrysleddens Oct 6, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think 'recipient' is a more common term when it comes to email than 'to'.
Also the schema should support multiple recipients. Could also add a 'type' to 'recipient' to indicate if recipient was added to 'to', 'cc' or 'bcc'.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this really depends if you want to capture the semantics of the envelope/smtp and/or the email headers. This applies to other fields as well (to, cc, bcc, from)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree there's nuances in recipients, and in how closely we want to represent each protocol.

After looking at a few raw emails, I'm starting to think we should keep the design of this field set pretty high level.

If capturing each protocol's nuances is desired or necessary, then I think we should consider working on a specific breakdown per protocol, as a separate step.

So I think the guiding principle we could use here is to capture the commonalities in email.*, even if it means merging some concepts, sometimes.

One point you're raising is pretty interesting, however. Should we capture each "type" of recipient in a different field, or in one parent field with an additional label to indicate which type of recipient this is?

The current proposal takes the former approach:

{ "email": {
  "to": [ {"address": "alice@example.com", ...} ],
  "cc": [ {"address": "bob@example.com", ...} ],
} }

The "recipient" suggestion would look like:

{ "email": {
  "recipients": [
    { "type": "to", "address": "alice@example.com", ...},
    { "type": "cc", "address": "bob@example.com", ...},
  ]
} }

🤔

Copy link

@vpiserchia vpiserchia Oct 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like this one:

{ "email": { "to": [ {"address": "alice@example.com", ...} ], "cc": [ {"address": "bob@example.com", ...} ], } }

this avoids the need for nested objects. And It also opens to a new one in the "related" field:

related.emails: ["alice@example.com", "bob@example.com"]

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From an ingest perspective and a visualization perspective I think its better to keep the "to, cc, bcc" field structure over type.
The point from @vpiserchia is a good one, because once we start having list of objects we will have issues with the internal structure when it comes to visualizations, parsing and aggregations.

| `email.process` | keyword | Name of the executable that carried out the transaction, e.g. outlook, sendmail |
| `email.protocol` | keyword | The email protocol used, e.g. SMTP, IMAP |
| `email.reply_to.address` | keyword | Reply-to address |
| `object.return.address` | keyword | The return address for the message |
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.address exits for email.reply_to and object.return, but not .domain, should be constant and have .domain as an additional field.

| `email.file.name` | keyword | File name of attachements |
| `email.file.size` | keyword | Total size of all attachements in bytes |
| `email.direction` | keyword | Direction of the message based on the sending and receving domains |
| `email.from.address` | keyword | Senders email address |

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The schema should support and distinguish both the envelope/smtp 'from' and the header/mime 'from'.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jeffrysleddens Do you think it would be worth considering a specific section in the schema for a breakdown of protocols like SMTP and POP3?

Or do you think a general purpose email.* field set is enough?

On a related note, I think we should also consider adding email.reply_to.address and email.reply_to.domain, which can be different than the "from" address.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would we distinguish which hash is for which file, and the same with size?

Looking at the comment above about changing file to attachments, we will still need to for example have a list of objects if we want to keep track about size, hash, extension belonging to a single file/attachment.
This kinda goes against the current features we normally support (lists have very limited possibilities in terms of visualizations and SIEM/Alert rules)

Are we thinking all of these fields should just be an array?

The goal here is to research and understand the impact of these changes on users in the community and development teams across Elastic. 2-5 sentences each.
-->

## Concerns
Copy link

@BenB196 BenB196 Oct 7, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When looking at the current fields provided, one of my concerns is it appears that they don't fit well with the rest of ECS. I think this can be partially fixed with the use of aliases, though, I don't believe aliases are standard/common in ECS.

Examples:

email.from -> source
email.to|cc|bcc -> destination
email.latency -> event.duration

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

latency might be redundant depending on the specific action being recorded, but I wouldn't equate email.to|from with source and destination (or client/server) network entities

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, I think email has enough subtleties in both the "senders" (sender, reply_to, return_path) and the receivers (to, cc, bcc) that I don't think it makes sense to put them in [source|destination].user.email. Also, these aren't array fields anyway, so we couldn't capture everything from the get go.

If an email server logs the source.ip from the MX that sent them an email though, that's totally appropriate to capture in that field.

# 0008: Email
<!-- Leave this ID at 0000. The ECS team will assign a unique, contiguous RFC number upon merging the initial stage of this RFC. -->

- Stage: **1** <!-- Update to reflect target stage. See https://elastic.github.io/ecs/stages.html -->
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- Stage: **1** <!-- Update to reflect target stage. See https://elastic.github.io/ecs/stages.html -->
- Stage: **1 (proposal)** <!-- Update to reflect target stage. See https://elastic.github.io/ecs/stages.html -->

Not required, but including the stage name with the stage number has become an informal RFC convention 😄

- Stage: **1** <!-- Update to reflect target stage. See https://elastic.github.io/ecs/stages.html -->
- Date: **Oct 5th 2020** <!-- The ECS team sets this date at merge time. This is the date of the latest stage advancement. -->

This RFC proposes a new top-level field to facilitate email use cases.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Email has similar challenges to web browsing when you start looking at how to model it. Obviously, what is generally just referred to as "email" is actually a complex system of processes and protocols: SMTP, IMAP, POP3, SPF, DMARC, DKIM, DNS, TLS, x509, etc. etc.

I think having the top-level fieldset for email.* makes sense as the starting point. However, I can also see need for additional top-level fields depending on the data source (e.g. if Packetbeat/Elastic Agent added support for SMTP, perhaps a smtp.* field becomes necessary).

I mostly bring this up to set the expectations that to accurate capture the different facets of email in ECS, we'll likely be adding several new fieldsets over time. 😄

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like us to capture two aspects of Eric's point in the Concerns section for now, so let's add two subsections there:

1 - There are many types of legitimate "email events" that could be captured. This RFC is currently focusing on a subset. But we'll want to keep in mind the whole list of possible types of email events, to make sure there's room in the schema for all. Here's the list:

  • spam filter events
  • email server log events
  • deliverability events from an MTA (deferred, delivered, bounce, dropped, etc)
  • engagement events (opens, clicks, unsubscribe, spam report)
  • Email infrastructure monitoring (think dmarcian)

2 - One design decision we'll have to make is whether we also introduce fields for the 3 main "email protocols" (SMTP, IMAP and POP3), or do we try to fit most things under email.*?

Copy link
Member

@P1llus P1llus Nov 3, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@webmat Would there be any reason to discuss having the protocols under email rather than a top field level?

We could for example treat smtp, imap and pop3 as we do with for example user.*, where it is expected to be nested under email.* but could meet other usecases moving forward.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quick question about these comments. Is email authentication (i.e. SPF/DKIM/DMARC) in scope for this RFC? Mainly wondering because they're really a product of alignment on Return-Path (MAIL FROM), From, and DKIM-Signature mail headers. Not that I recommend all implementations that can capture these headers attempt to validate SPF/DKIM/DMARC, but it's totally possible given only the message content (and some fun DNS lookups).

I think that they fall under a pretty different realm than say spam/reputation scoring, analytics reports, MTA actions, or even email protocols that were mentioned.

| `email.cc.address` | keyword | Addresses of Cc's |
| `email.cc.domain` | keyword | Domains of Cc addresses |
| `email.cipher` | keyword | Cipher used e.g. TLS |
| `email.file.count` | value | Number of attachments included in the message |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed that attachment is more idiomatic when referring to email than file. Also agreed we will need to consider that a single email can have multiple attachments.

The file.* fieldset in ECS is defined around a file that is created or exists on a filesystem, so many of the fields underneath file.* will probably not apply to an email attachment. Though this will probably vary based on the source of the events.

| `email.file.size` | keyword | Total size of all attachements in bytes |
| `email.direction` | keyword | Direction of the message based on the sending and receving domains |
| `email.from.address` | keyword | Senders email address |
| `email.from.domain` | keyword | Senders domain |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just making a note that as we refine and iterate on this list of fields, we will want to consider which would be good candidates for using wildcard over keyword (the .name and .domain fields are a couple of places).

| `email.size` | keyword | Total size of the message, in bytes, including attachments |
| `email.subject` | keyword | Subject of the message |
| `email.to` | keyword | Recipieint address |
| `email.to.domain` | keyword | Recipient domain |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! 👀

I believe email.to will be corrected to email.to.address.

Copy link
Contributor

@webmat webmat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent start, @jamiehynds.

Thanks everyone so far for chiming in. Looks like there's a lot of interest in getting support for email in ECS 😃

I think there's a lot of straightforward suggestions already made (in my review and prior reviews) that we could apply to the current RFC.

There's also many big discussions that don't have an obvious answer. Those should be captured in the "Concerns" section, so that these problem statements are part of the RFC document itself, and that we carry to the next RFC stages. I've tried to point them out in my review.

- Stage: **1** <!-- Update to reflect target stage. See https://elastic.github.io/ecs/stages.html -->
- Date: **Oct 5th 2020** <!-- The ECS team sets this date at merge time. This is the date of the latest stage advancement. -->

This RFC proposes a new top-level field to facilitate email use cases.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like us to capture two aspects of Eric's point in the Concerns section for now, so let's add two subsections there:

1 - There are many types of legitimate "email events" that could be captured. This RFC is currently focusing on a subset. But we'll want to keep in mind the whole list of possible types of email events, to make sure there's room in the schema for all. Here's the list:

  • spam filter events
  • email server log events
  • deliverability events from an MTA (deferred, delivered, bounce, dropped, etc)
  • engagement events (opens, clicks, unsubscribe, spam report)
  • Email infrastructure monitoring (think dmarcian)

2 - One design decision we'll have to make is whether we also introduce fields for the 3 main "email protocols" (SMTP, IMAP and POP3), or do we try to fit most things under email.*?


| field | type | description |
| --- | --- | --- |
| `email.action` | keyword | Action take by the source device, e.g. delivered, blocked, quarantined, deleted |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rather have integrations use event.action than introduce a new action field.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with this as well, we should reuse the ECS fields we already have when we can.

| `email.cc.address` | keyword | Addresses of Cc's |
| `email.cc.domain` | keyword | Domains of Cc addresses |
| `email.cipher` | keyword | Cipher used e.g. TLS |
| `email.file.count` | value | Number of attachments included in the message |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with the name "attachments", and that it should support capturing multiple attachments. Note that I think it should be pluralized, since there can be many. This would be in line with how we named dns.answers.*

Here's a few suggestions on changes we could make to the attachment fields as we currently have them:

  • Rename all mentions of email.file. to email.attachments.
  • Rename email.file.count to email.attachments_count, because everything after the . will be present once for each entry in the array (for each attachment)
  • Make email.attachments_count a long
  • Make email.attachments.size a long
  • Make email.attachments.name a wildcard

This will be an array of objects. In order for Elasticsearch to be able to index it so that we can query on multiple attachment attributes at a time (e.g. extension:mp3 AND size < 100000) and get the expected results, would need to make email.attachment into a nested field.

Querying nested fields is slightly different than normal fields, however (API, KQL).

It looks to me like it has good support across the stack all the way to KQL, which is great. But since this will be the first use of this type in ECS, and since querying these fields is a bit different than usual, I think this would be worth a mention in the Concerns section.

I would still adjust the field listing assuming we'll use nested, since it looks like the best approach. So let's add an extra row here for it:

| `email.attachments` | nested | Array of objects containing information about each email attachment.

I like @vpiserchia's suggestion of reusing file here. That could be an option, given that we can now reuse as another name (reuse file as email.attachments). However I agree with @ebeahan, that this would bring in way too many fields that are not useful in this context (consider the fields reused under file on top of this). So I think we could err on the side of defining the sub-fields explicitly for now, and trying to keep them consistent with their cousins as defined under file.*. I do think it's worth capturing this as another point in the Concerns section, though.

Comment on lines 27 to 30
| `email.bcc.address` | keyword | Addresses of Bcc's |
| `email.bcc.domain` | keyword | Domains of the Bcc's |
| `email.cc.address` | keyword | Addresses of Cc's |
| `email.cc.domain` | keyword | Domains of Cc addresses |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good first start, for capturing each array of emails. I would make all of them plural, however.

Looking at this, I wonder if we should consider the nested type here as well? I could see a desire to break down the domains some more, like elsewhere in ECS. domain => registered_domain => top_level_domain. I'm not 100% sure if such a breakdown per email address is actually useful, though *.

If not, perhaps two distinct arrays of keywords are totally fine, like

  "email": {
    "cc": {
      "addresses": ["alice@example.com", "bob@example.com", "cam@example2.com"],
      "domains": ["example.com", "example2.com"]
    }
  }

* My feeling is that in most cases, the sender's email address would be a suspicious one we want to break down all the way, but recipients are not "suspicious", they're the unwitting recipients for the email :-)

Copy link
Member

@P1llus P1llus Nov 3, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@webmat I think it would make sense to have the structure you have above, but in the sense of looking for a specific address or domain, we should also accompany this with related.email and related.domain

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Under the current guidance with the related.* fields, email addresses are captured in related.user and domains in related.hosts.

Rather we continue with this guidance or adjust and add additional related.* fields is a good conversation to have. 🤔

| `email.file.name` | keyword | File name of attachements |
| `email.file.size` | keyword | Total size of all attachements in bytes |
| `email.direction` | keyword | Direction of the message based on the sending and receving domains |
| `email.from.address` | keyword | Senders email address |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jeffrysleddens Do you think it would be worth considering a specific section in the schema for a breakdown of protocols like SMTP and POP3?

Or do you think a general purpose email.* field set is enough?

On a related note, I think we should also consider adding email.reply_to.address and email.reply_to.domain, which can be different than the "from" address.

| `email.from.domain` | keyword | Senders domain |
| `email.latency` | keyword | The time, in milliseconds, the delivery attempt took |
| `email.message_id` | keyword | Internet message ID of the message |
| `email.process` | keyword | Name of the executable that carried out the transaction, e.g. outlook, sendmail |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call out.

Although I wonder what the intent is, with this field? Is it:

  1. to capture simply the name of the event source under management (e.g. I'm parsing sendmail logs, therefore email.process is hardcoded to "sendmail" in my pipeline),
  2. or to capture the series of agents that took part in transmission of the email (e.g. from each of the Received header).

I'm curious what folks would like to have here. Would both of these fields be useful?

| `email.latency` | keyword | The time, in milliseconds, the delivery attempt took |
| `email.message_id` | keyword | Internet message ID of the message |
| `email.process` | keyword | Name of the executable that carried out the transaction, e.g. outlook, sendmail |
| `email.protocol` | keyword | The email protocol used, e.g. SMTP, IMAP |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking we should favor network.protocol here instead.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for this.

| `email.from.address` | keyword | Senders email address |
| `email.from.domain` | keyword | Senders domain |
| `email.latency` | keyword | The time, in milliseconds, the delivery attempt took |
| `email.message_id` | keyword | Internet message ID of the message |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Message IDs can be pretty creative. For example one of the message IDs for this PR's email notifications was <elastic/ecs/pull/999/review/503143839@github.com>.

So I would make this one wildcard.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nevertheless the message_id captures the uniqueness of a mail.
I can see that different mail servers have specific ways of building this Message ID and could be interesting (for identification purposes) capturing such behaviour (and spot the anomalies). With this said, a multi-field mapping would make sense here:

 | `email.message_id` | keyword | Internet message ID of the message |
| `email.message_id.text` | text | Internet message ID of the message for full text search |

| `object.return.address` | keyword | The return address for the message |
| `email.size` | keyword | Total size of the message, in bytes, including attachments |
| `email.subject` | keyword | Subject of the message |
| `email.to` | keyword | Recipieint address |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree there's nuances in recipients, and in how closely we want to represent each protocol.

After looking at a few raw emails, I'm starting to think we should keep the design of this field set pretty high level.

If capturing each protocol's nuances is desired or necessary, then I think we should consider working on a specific breakdown per protocol, as a separate step.

So I think the guiding principle we could use here is to capture the commonalities in email.*, even if it means merging some concepts, sometimes.

One point you're raising is pretty interesting, however. Should we capture each "type" of recipient in a different field, or in one parent field with an additional label to indicate which type of recipient this is?

The current proposal takes the former approach:

{ "email": {
  "to": [ {"address": "alice@example.com", ...} ],
  "cc": [ {"address": "bob@example.com", ...} ],
} }

The "recipient" suggestion would look like:

{ "email": {
  "recipients": [
    { "type": "to", "address": "alice@example.com", ...},
    { "type": "cc", "address": "bob@example.com", ...},
  ]
} }

🤔

Stage 0: Provide a high level summary of the premise of these changes. Briefly describe the nature, purpose, and impact of the changes. ~2-5 sentences.
-->

## Fields
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another item that came to mind, and I think makes sense to capture for further discussion later:

Would email be worth considering as an additional allowed value for event.category?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say +1 on this.

@ebeahan if we would create this new event.category we should also update which event.type we can combine with it and maybe see if we need new ones?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's mention this in the RFC "Fields" section.

@ebeahan ebeahan mentioned this pull request Oct 30, 2020
5 tasks
@P1llus
Copy link
Member

P1llus commented Nov 9, 2020

@ebeahan @webmat

Just added the first iteration of the changes we discussed, hopefully this commit should cover all the above comments for now, some which we could investigate for later.
I added one more to the list, which was attachments_count, outside of the attachments flattened fields. This will help for visualisations and parser writing.

Could you please let me know if my commit covered all of the topics we wanted to cover for now?

| `email.sender.top_level_domain` | keyword | Senders email address |
| `email.message_id` | keyword | Internet message ID of the message |
| `email.reply_to.address` | wildcard | Reply-to address |
| `email.return.address` | wildcard | The return address for the message |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about renaming this to return_path -- it's a bit more descriptive of what I think you're actually going for here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree on this point as well, will update it in the next commit.

Copy link
Member

@ebeahan ebeahan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @P1llus!

I think we should still list some of the discussed points in the Concerns section. We don't have to have resolutions at this point. We simply want to capture them now for further discussion later:

  • The two items Mat captured in an earlier review thread: [RFC] E-Mail #999 (comment)
  • Possibility of needing an email event.category and possibly new event.type values.
  • Maybe worth noting that if the flattened data type is used for email.* fields, it would be the first time using this type in ECS?

rfcs/text/0008-email.md Outdated Show resolved Hide resolved
rfcs/text/0008-email.md Outdated Show resolved Hide resolved
rfcs/text/0008-email.md Show resolved Hide resolved
@P1llus
Copy link
Member

P1llus commented Nov 10, 2020

Adding in the latest changes @ebeahan 👍

| `email.size` | keyword | Total size of the message, in bytes, including attachments |
| `email.subject` | wildcard | Subject of the message |
| `email.recipients.addresses` | keyword | Recipient addresses |
| `email.domains` | keyword | domains related to the email |
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This field really feels like it should be part of the related fields. Something like related.domains (though it currently doesn't exist, so it might be worth keeping here)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah this field is the outcome of a current discussion we have. Instead of having domain fields for bcc, cc, recipients etc, we decided currently to have them all as an array under one field. This might change in the upcoming stages. Thanks for the pointer, always happy to get feedback

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, my concern mainly with a related field is that you lose the directionality of the value. Which might be useful for some use-cases.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do have the email.direction though, would that be sufficient? We would calculate the direction before moving the different domains into email.domains for example.

rfcs/text/0008-email.md Outdated Show resolved Hide resolved
rfcs/text/0008-email.md Outdated Show resolved Hide resolved
| `email.cc.addresses` | wildcard | Addresses of Cc's |
| `email.attachments_count` | long | A field outside the flattened structure to control how many attachments are included in the email |
| `email.attachments` | flattened | A flattened field for anything related to attachments. This allows objects being stored with all information for each file when you have multiple attachments |
| `email.direction` | keyword | Direction of the message based on the sending and receving domains |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do we for see as values for this?, and from which address fields (to, cc, bcc) would it be categorized on?
It seems to me like something that could potentially be difficult to implement, and not sure of the value for visualizations (but I could easily be missing something obvious, its been one of those days...)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question, @dainperkins.

I assume the allowed values in there should be "inbound" and "outbound". Perhaps also "unknown" in the case of relays? Actually just like network.direction, "internal" is another class of emails that has a different threat profile. I wonder if there's a need for the value "external" (as in, I'm just an exchange, relaying between Yahoo and Gmail)?

I agree populating this consistently may not be obvious in all scenarios.

I don't think as a third party, our solutions can determine between "inbound", "outbound" and "internal" without specific configuration that says what are "my domains".

But once we know that, I assume the heuristic is pretty straightforward:

  • direction = inbound when from is not one of "my domains"
  • direction = outbound when from = "my domains" and at least one receiver (to, cc, bcc) contains addresses not in "my domains"
  • direction = outbound when from = "my domains" and all receivers are "my domains"

So I'm +1 on adding the field. I think it makes sense. And unless I'm missing something, I think the heuristics are reasonable; and actually, perhaps some of the email-related event sources already provide such values? It's certainly useful for a spam filter to know which emails to filter. Not sure if it shows up in their logs though.

Action item for the RFC, though: let's start listing expected values for this field. I'm providing ideas above as a strawperson, based on what we have in network.direction. But if email data sources have other values for this, let's bring them to the table as well.

Copy link
Contributor

@webmat webmat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This RFC has by far surpassed the criteria for stage 1. We have a new author (thanks for picking it up, Marius) and a sponsor. And the fields of the proposal are looking really good already.

Of course on such an interesting subject, I couldn't resist providing some detailed technical feedback below.

But I think we should merge this as a stage 1 after a few minor adjustments which I'll outline in the body of this review, and we should address the rest of the technical feedback in a stage 2 PR.

The only points I'd like to address before merging this at stage 1 are:

  • adjusting the "People" heading
  • making the list of people into a bullet list
  • adjusting the "stage 1" link at the very bottom
  • and populating the "references" section

Everything else in my review can be addressed in the stage 2 PR. If you have time to incorporate more feedback in your next session on this, all the better. But feel free to just address the points above so we can move forward.

| `email.bcc.addresses` | wildcard | Addresses of Bcc's |
| `email.cc.addresses` | wildcard | Addresses of Cc's |
| `email.attachments_count` | long | A field outside the flattened structure to control how many attachments are included in the email |
| `email.attachments` | flattened | A flattened field for anything related to attachments. This allows objects being stored with all information for each file when you have multiple attachments |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The .attachments.* fields should follow the file.* fields. We can state this approach in the description for now.

We can see later about the implementation, whether it's full reuse, or explicitly defining the fields that make sense for attachments.

| `email.cc.addresses` | wildcard | Addresses of Cc's |
| `email.attachments_count` | long | A field outside the flattened structure to control how many attachments are included in the email |
| `email.attachments` | flattened | A flattened field for anything related to attachments. This allows objects being stored with all information for each file when you have multiple attachments |
| `email.direction` | keyword | Direction of the message based on the sending and receving domains |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question, @dainperkins.

I assume the allowed values in there should be "inbound" and "outbound". Perhaps also "unknown" in the case of relays? Actually just like network.direction, "internal" is another class of emails that has a different threat profile. I wonder if there's a need for the value "external" (as in, I'm just an exchange, relaying between Yahoo and Gmail)?

I agree populating this consistently may not be obvious in all scenarios.

I don't think as a third party, our solutions can determine between "inbound", "outbound" and "internal" without specific configuration that says what are "my domains".

But once we know that, I assume the heuristic is pretty straightforward:

  • direction = inbound when from is not one of "my domains"
  • direction = outbound when from = "my domains" and at least one receiver (to, cc, bcc) contains addresses not in "my domains"
  • direction = outbound when from = "my domains" and all receivers are "my domains"

So I'm +1 on adding the field. I think it makes sense. And unless I'm missing something, I think the heuristics are reasonable; and actually, perhaps some of the email-related event sources already provide such values? It's certainly useful for a spam filter to know which emails to filter. Not sure if it shows up in their logs though.

Action item for the RFC, though: let's start listing expected values for this field. I'm providing ideas above as a strawperson, based on what we have in network.direction. But if email data sources have other values for this, let's bring them to the table as well.

rfcs/text/0008-email.md Outdated Show resolved Hide resolved
| `email.attachments_count` | long | A field outside the flattened structure to control how many attachments are included in the email |
| `email.attachments` | flattened | A flattened field for anything related to attachments. This allows objects being stored with all information for each file when you have multiple attachments |
| `email.direction` | keyword | Direction of the message based on the sending and receving domains |
| `email.sender.address` | wildcard | Senders email address |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I remember correctly, address will contain the full Person Name <person@example.com>.

We're defining the domain breakdown fields here because the sender is potentially a threat, and this is where we'll be looking for known bad domains/TLDs and so on.

But looking at the fields, I wonder if we should do the same with email.reply_to.address and email.return_path.address? They're also relevant to the sender.

We can hold off on adding them for now, but I'm floating the idea to get feedback on whether there's a need for them.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think its worth exploring in the upcoming stage for sure if that is appropriate.

The goal here is to research and understand the impact of these changes on users in the community and development teams across Elastic. 2-5 sentences each.
-->

## Concerns
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, I think email has enough subtleties in both the "senders" (sender, reply_to, return_path) and the receivers (to, cc, bcc) that I don't think it makes sense to put them in [source|destination].user.email. Also, these aren't array fields anyway, so we couldn't capture everything from the get go.

If an email server logs the source.ip from the MX that sent them an email though, that's totally appropriate to capture in that field.

rfcs/text/0008-email.md Outdated Show resolved Hide resolved
rfcs/text/0008-email.md Outdated Show resolved Hide resolved
rfcs/text/0008-email.md Show resolved Hide resolved
rfcs/text/0008-email.md Outdated Show resolved Hide resolved
Stage 0: Provide a high level summary of the premise of these changes. Briefly describe the nature, purpose, and impact of the changes. ~2-5 sentences.
-->

## Fields
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's mention this in the RFC "Fields" section.

@webmat webmat added the 1.8.0 label Nov 23, 2020
@P1llus
Copy link
Member

P1llus commented Nov 26, 2020

Added the required changes you mentioned + a few more tweaks based on some changes that could wait for stage 2 @webmat . Anything else we should do at this point?

webmat
webmat previously approved these changes Nov 30, 2020
Copy link
Contributor

@webmat webmat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, thanks for the adjustments, @P1llus! I'll adjust the date and merge.

rfcs/text/0008-email.md Outdated Show resolved Hide resolved
@webmat webmat merged commit 0b47844 into elastic:master Nov 30, 2020
@ebeahan ebeahan removed the 1.8.0 label Dec 14, 2020
@peasead peasead mentioned this pull request Aug 2, 2021
8 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants