[RFC] E-Mail #999

jamiehynds · 2020-10-05T14:21:48Z

Have you signed the contributor license agreement?
Have you followed the contributor guidelines?
For proposing substantial changes or additions to the schema, have you reviewed the RFC process?
If submitting code/script changes, have you verified all tests pass locally using make test?
If submitting schema/fields updates, have you generated new artifacts by running make and committed those changes?
Is your pull request against master? Unless there is a good reason otherwise, we prefer pull requests against master and will backport as needed.
Have you added an entry to the CHANGELOG.next.md?

vpiserchia

A series of question

vpiserchia · 2020-10-06T16:21:47Z

rfcs/text/0008-email.md

+| `email.process` | keyword | Name of the executable that carried out the transaction, e.g. outlook, sendmail |
+| `email.protocol` | keyword | The email protocol used, e.g. SMTP, IMAP |
+| `email.reply_to.address` | keyword | Reply-to address |
+| `object.return.address` | keyword | The return address for the message |


should not be email.return.address?

vpiserchia · 2020-10-06T16:22:16Z

rfcs/text/0008-email.md

+| `email.cc.address` | keyword | Addresses of Cc's |
+| `email.cc.domain` | keyword | Domains of Cc addresses |
+| `email.cipher` | keyword | Cipher used e.g. TLS |
+| `email.file.count` | value | Number of attachments included in the message |


why not allowing multiple files as in an array ?

I would use the term 'attachment' instead of 'file'.
Schema should support multiple attachments.

file already exists as ECS field, it would be great to reuse it

Agreed that attachment is more idiomatic when referring to email than file. Also agreed we will need to consider that a single email can have multiple attachments.

The file.* fieldset in ECS is defined around a file that is created or exists on a filesystem, so many of the fields underneath file.* will probably not apply to an email attachment. Though this will probably vary based on the source of the events.

I agree with the name "attachments", and that it should support capturing multiple attachments. Note that I think it should be pluralized, since there can be many. This would be in line with how we named dns.answers.*

Here's a few suggestions on changes we could make to the attachment fields as we currently have them:

Rename all mentions of email.file. to email.attachments.

Rename email.file.count to email.attachments_count, because everything after the . will be present once for each entry in the array (for each attachment)

Make email.attachments_count a long

Make email.attachments.size a long

Make email.attachments.name a wildcard

This will be an array of objects. In order for Elasticsearch to be able to index it so that we can query on multiple attachment attributes at a time (e.g. extension:mp3 AND size < 100000) and get the expected results, would need to make email.attachment into a nested field.

Querying nested fields is slightly different than normal fields, however (API, KQL).

It looks to me like it has good support across the stack all the way to KQL, which is great. But since this will be the first use of this type in ECS, and since querying these fields is a bit different than usual, I think this would be worth a mention in the Concerns section.

I would still adjust the field listing assuming we'll use nested, since it looks like the best approach. So let's add an extra row here for it:

| `email.attachments` | nested | Array of objects containing information about each email attachment.

I like @vpiserchia's suggestion of reusing file here. That could be an option, given that we can now reuse as another name (reuse file as email.attachments). However I agree with @ebeahan, that this would bring in way too many fields that are not useful in this context (consider the fields reused under file on top of this). So I think we could err on the side of defining the sub-fields explicitly for now, and trying to keep them consistent with their cousins as defined under file.*. I do think it's worth capturing this as another point in the Concerns section, though.

vpiserchia · 2020-10-06T16:23:24Z

rfcs/text/0008-email.md

+| `email.size` | keyword | Total size of the message, in bytes, including attachments |
+| `email.subject` | keyword | Subject of the message |
+| `email.to` | keyword | Recipieint address |
+| `email.to.domain` | keyword | Recipient domain |


email.to is already a field of type keyword, how can you define email.to as an object having domain as subfield? sorry for the silly question

Good catch! 👀

I believe email.to will be corrected to email.to.address.

vpiserchia · 2020-10-06T16:24:46Z

rfcs/text/0008-email.md

+| `email.from.domain` | keyword | Senders domain |
+| `email.latency` | keyword | The time, in milliseconds, the delivery attempt took |
+| `email.message_id` | keyword | Internet message ID of the message |
+| `email.process` | keyword | Name of the executable that carried out the transaction, e.g. outlook, sendmail |


Good call out.

Although I wonder what the intent is, with this field? Is it:

to capture simply the name of the event source under management (e.g. I'm parsing sendmail logs, therefore email.process is hardcoded to "sendmail" in my pipeline),

or to capture the series of agents that took part in transmission of the email (e.g. from each of the Received header).

I'm curious what folks would like to have here. Would both of these fields be useful?

Why add a field for this, i.m.o this can be captured in process.* fields. Specifically the process.name field.

This would be in line with the statement on email.action

I agree with @SHolzhauer on this, I think the name of the process should follow the current ECS format.

When it comes to question number 2 from @webmat it is something that might be useful in a separate field.

webmat · 2020-10-06T17:01:48Z

rfcs/text/0008-email.md

+| `object.return.address` | keyword | The return address for the message |
+| `email.size` | keyword | Total size of the message, in bytes, including attachments |
+| `email.subject` | keyword | Subject of the message |
+| `email.to` | keyword | Recipieint address |


email.to.address

webmat · 2020-10-06T17:02:07Z

rfcs/text/0008-email.md

+
+<!-- An RFC should link to the PRs for each of it stage advancements. -->
+
+* Stage 0: https://github.com/elastic/ecs/pull/NNN


Suggested change

* Stage 0: https://github.com/elastic/ecs/pull/NNN

* Stage 0: https://github.com/elastic/ecs/pull/999

jeffrysleddens · 2020-10-06T18:25:53Z

rfcs/text/0008-email.md

+| `object.return.address` | keyword | The return address for the message |
+| `email.size` | keyword | Total size of the message, in bytes, including attachments |
+| `email.subject` | keyword | Subject of the message |
+| `email.to` | keyword | Recipieint address |


I think 'recipient' is a more common term when it comes to email than 'to'.
Also the schema should support multiple recipients. Could also add a 'type' to 'recipient' to indicate if recipient was added to 'to', 'cc' or 'bcc'.

this really depends if you want to capture the semantics of the envelope/smtp and/or the email headers. This applies to other fields as well (to, cc, bcc, from)

I agree there's nuances in recipients, and in how closely we want to represent each protocol.

After looking at a few raw emails, I'm starting to think we should keep the design of this field set pretty high level.

If capturing each protocol's nuances is desired or necessary, then I think we should consider working on a specific breakdown per protocol, as a separate step.

So I think the guiding principle we could use here is to capture the commonalities in email.*, even if it means merging some concepts, sometimes.

One point you're raising is pretty interesting, however. Should we capture each "type" of recipient in a different field, or in one parent field with an additional label to indicate which type of recipient this is?

The current proposal takes the former approach:

{ "email": { "to": [ {"address": "alice@example.com", ...} ], "cc": [ {"address": "bob@example.com", ...} ], } }

The "recipient" suggestion would look like:

{ "email": { "recipients": [ { "type": "to", "address": "alice@example.com", ...}, { "type": "cc", "address": "bob@example.com", ...}, ] } }

🤔

I really like this one:

{ "email": { "to": [ {"address": "alice@example.com", ...} ], "cc": [ {"address": "bob@example.com", ...} ], } }

this avoids the need for nested objects. And It also opens to a new one in the "related" field:

related.emails: ["alice@example.com", "bob@example.com"]

From an ingest perspective and a visualization perspective I think its better to keep the "to, cc, bcc" field structure over type.
The point from @vpiserchia is a good one, because once we start having list of objects we will have issues with the internal structure when it comes to visualizations, parsing and aggregations.

BenB196 · 2020-10-06T21:11:20Z

rfcs/text/0008-email.md

+| `email.process` | keyword | Name of the executable that carried out the transaction, e.g. outlook, sendmail |
+| `email.protocol` | keyword | The email protocol used, e.g. SMTP, IMAP |
+| `email.reply_to.address` | keyword | Reply-to address |
+| `object.return.address` | keyword | The return address for the message |


.address exits for email.reply_to and object.return, but not .domain, should be constant and have .domain as an additional field.

jeffrysleddens · 2020-10-07T05:28:04Z

rfcs/text/0008-email.md

+| `email.file.name` | keyword | File name of attachements |
+| `email.file.size` | keyword | Total size of all attachements in bytes |
+| `email.direction` | keyword | Direction of the message based on the sending and receving domains |
+| `email.from.address` | keyword | Senders email address |


The schema should support and distinguish both the envelope/smtp 'from' and the header/mime 'from'.

@jeffrysleddens Do you think it would be worth considering a specific section in the schema for a breakdown of protocols like SMTP and POP3?

Or do you think a general purpose email.* field set is enough?

On a related note, I think we should also consider adding email.reply_to.address and email.reply_to.domain, which can be different than the "from" address.

How would we distinguish which hash is for which file, and the same with size?

Looking at the comment above about changing file to attachments, we will still need to for example have a list of objects if we want to keep track about size, hash, extension belonging to a single file/attachment.
This kinda goes against the current features we normally support (lists have very limited possibilities in terms of visualizations and SIEM/Alert rules)

Are we thinking all of these fields should just be an array?

BenB196 · 2020-10-07T12:22:20Z

rfcs/text/0008-email.md

+The goal here is to research and understand the impact of these changes on users in the community and development teams across Elastic. 2-5 sentences each.
+-->
+
+## Concerns


When looking at the current fields provided, one of my concerns is it appears that they don't fit well with the rest of ECS. I think this can be partially fixed with the use of aliases, though, I don't believe aliases are standard/common in ECS.

Examples:

email.from -> source
email.to|cc|bcc -> destination
email.latency -> event.duration

latency might be redundant depending on the specific action being recorded, but I wouldn't equate email.to|from with source and destination (or client/server) network entities

Agreed, I think email has enough subtleties in both the "senders" (sender, reply_to, return_path) and the receivers (to, cc, bcc) that I don't think it makes sense to put them in [source|destination].user.email. Also, these aren't array fields anyway, so we couldn't capture everything from the get go.

If an email server logs the source.ip from the MX that sent them an email though, that's totally appropriate to capture in that field.

ebeahan · 2020-10-16T19:33:57Z

rfcs/text/0008-email.md

+# 0008: Email
+<!-- Leave this ID at 0000. The ECS team will assign a unique, contiguous RFC number upon merging the initial stage of this RFC. -->
+
+- Stage: **1** <!-- Update to reflect target stage. See https://elastic.github.io/ecs/stages.html -->


Suggested change

- Stage: **1** 

- Stage: **1 (proposal)** 

Not required, but including the stage name with the stage number has become an informal RFC convention 😄

ebeahan · 2020-10-16T20:27:31Z

rfcs/text/0008-email.md

+- Stage: **1** <!-- Update to reflect target stage. See https://elastic.github.io/ecs/stages.html -->
+- Date: **Oct 5th 2020** <!-- The ECS team sets this date at merge time. This is the date of the latest stage advancement. -->
+
+This RFC proposes a new top-level field to facilitate email use cases. 


Email has similar challenges to web browsing when you start looking at how to model it. Obviously, what is generally just referred to as "email" is actually a complex system of processes and protocols: SMTP, IMAP, POP3, SPF, DMARC, DKIM, DNS, TLS, x509, etc. etc.

I think having the top-level fieldset for email.* makes sense as the starting point. However, I can also see need for additional top-level fields depending on the data source (e.g. if Packetbeat/Elastic Agent added support for SMTP, perhaps a smtp.* field becomes necessary).

I mostly bring this up to set the expectations that to accurate capture the different facets of email in ECS, we'll likely be adding several new fieldsets over time. 😄

I'd like us to capture two aspects of Eric's point in the Concerns section for now, so let's add two subsections there:

1 - There are many types of legitimate "email events" that could be captured. This RFC is currently focusing on a subset. But we'll want to keep in mind the whole list of possible types of email events, to make sure there's room in the schema for all. Here's the list:

spam filter events

email server log events

deliverability events from an MTA (deferred, delivered, bounce, dropped, etc)

engagement events (opens, clicks, unsubscribe, spam report)

Email infrastructure monitoring (think dmarcian)

2 - One design decision we'll have to make is whether we also introduce fields for the 3 main "email protocols" (SMTP, IMAP and POP3), or do we try to fit most things under email.*?

@webmat Would there be any reason to discuss having the protocols under email rather than a top field level?

We could for example treat smtp, imap and pop3 as we do with for example user.*, where it is expected to be nested under email.* but could meet other usecases moving forward.

Quick question about these comments. Is email authentication (i.e. SPF/DKIM/DMARC) in scope for this RFC? Mainly wondering because they're really a product of alignment on Return-Path (MAIL FROM), From, and DKIM-Signature mail headers. Not that I recommend all implementations that can capture these headers attempt to validate SPF/DKIM/DMARC, but it's totally possible given only the message content (and some fun DNS lookups).

I think that they fall under a pretty different realm than say spam/reputation scoring, analytics reports, MTA actions, or even email protocols that were mentioned.

ebeahan · 2020-10-16T20:39:21Z

rfcs/text/0008-email.md

+| `email.cc.address` | keyword | Addresses of Cc's |
+| `email.cc.domain` | keyword | Domains of Cc addresses |
+| `email.cipher` | keyword | Cipher used e.g. TLS |
+| `email.file.count` | value | Number of attachments included in the message |


Agreed that attachment is more idiomatic when referring to email than file. Also agreed we will need to consider that a single email can have multiple attachments.

The file.* fieldset in ECS is defined around a file that is created or exists on a filesystem, so many of the fields underneath file.* will probably not apply to an email attachment. Though this will probably vary based on the source of the events.

ebeahan · 2020-10-16T20:50:19Z

rfcs/text/0008-email.md

+| `email.file.size` | keyword | Total size of all attachements in bytes |
+| `email.direction` | keyword | Direction of the message based on the sending and receving domains |
+| `email.from.address` | keyword | Senders email address |
+| `email.from.domain` | keyword | Senders domain |


Just making a note that as we refine and iterate on this list of fields, we will want to consider which would be good candidates for using wildcard over keyword (the .name and .domain fields are a couple of places).

ebeahan · 2020-10-16T20:58:36Z

rfcs/text/0008-email.md

+| `email.size` | keyword | Total size of the message, in bytes, including attachments |
+| `email.subject` | keyword | Subject of the message |
+| `email.to` | keyword | Recipieint address |
+| `email.to.domain` | keyword | Recipient domain |


Good catch! 👀

I believe email.to will be corrected to email.to.address.

webmat

Excellent start, @jamiehynds.

Thanks everyone so far for chiming in. Looks like there's a lot of interest in getting support for email in ECS 😃

I think there's a lot of straightforward suggestions already made (in my review and prior reviews) that we could apply to the current RFC.

There's also many big discussions that don't have an obvious answer. Those should be captured in the "Concerns" section, so that these problem statements are part of the RFC document itself, and that we carry to the next RFC stages. I've tried to point them out in my review.

webmat · 2020-10-26T20:22:37Z

rfcs/text/0008-email.md

+- Stage: **1** <!-- Update to reflect target stage. See https://elastic.github.io/ecs/stages.html -->
+- Date: **Oct 5th 2020** <!-- The ECS team sets this date at merge time. This is the date of the latest stage advancement. -->
+
+This RFC proposes a new top-level field to facilitate email use cases. 


I'd like us to capture two aspects of Eric's point in the Concerns section for now, so let's add two subsections there:

1 - There are many types of legitimate "email events" that could be captured. This RFC is currently focusing on a subset. But we'll want to keep in mind the whole list of possible types of email events, to make sure there's room in the schema for all. Here's the list:

spam filter events

email server log events

deliverability events from an MTA (deferred, delivered, bounce, dropped, etc)

engagement events (opens, clicks, unsubscribe, spam report)

Email infrastructure monitoring (think dmarcian)

2 - One design decision we'll have to make is whether we also introduce fields for the 3 main "email protocols" (SMTP, IMAP and POP3), or do we try to fit most things under email.*?

webmat · 2020-10-26T20:37:42Z

rfcs/text/0008-email.md

+
+| field | type | description |
+| --- | --- | --- |
+| `email.action` | keyword | Action take by the source device, e.g. delivered, blocked, quarantined, deleted |


I would rather have integrations use event.action than introduce a new action field.

I agree with this as well, we should reuse the ECS fields we already have when we can.

webmat · 2020-10-27T18:35:52Z

rfcs/text/0008-email.md

+| `email.cc.address` | keyword | Addresses of Cc's |
+| `email.cc.domain` | keyword | Domains of Cc addresses |
+| `email.cipher` | keyword | Cipher used e.g. TLS |
+| `email.file.count` | value | Number of attachments included in the message |


I agree with the name "attachments", and that it should support capturing multiple attachments. Note that I think it should be pluralized, since there can be many. This would be in line with how we named dns.answers.*

Here's a few suggestions on changes we could make to the attachment fields as we currently have them:

Rename all mentions of email.file. to email.attachments.

Rename email.file.count to email.attachments_count, because everything after the . will be present once for each entry in the array (for each attachment)

Make email.attachments_count a long

Make email.attachments.size a long

Make email.attachments.name a wildcard

This will be an array of objects. In order for Elasticsearch to be able to index it so that we can query on multiple attachment attributes at a time (e.g. extension:mp3 AND size < 100000) and get the expected results, would need to make email.attachment into a nested field.

Querying nested fields is slightly different than normal fields, however (API, KQL).

It looks to me like it has good support across the stack all the way to KQL, which is great. But since this will be the first use of this type in ECS, and since querying these fields is a bit different than usual, I think this would be worth a mention in the Concerns section.

I would still adjust the field listing assuming we'll use nested, since it looks like the best approach. So let's add an extra row here for it:

| `email.attachments` | nested | Array of objects containing information about each email attachment.

I like @vpiserchia's suggestion of reusing file here. That could be an option, given that we can now reuse as another name (reuse file as email.attachments). However I agree with @ebeahan, that this would bring in way too many fields that are not useful in this context (consider the fields reused under file on top of this). So I think we could err on the side of defining the sub-fields explicitly for now, and trying to keep them consistent with their cousins as defined under file.*. I do think it's worth capturing this as another point in the Concerns section, though.

webmat · 2020-10-27T18:46:55Z

rfcs/text/0008-email.md

+| `email.bcc.address` | keyword | Addresses of Bcc's |
+| `email.bcc.domain` | keyword | Domains of the Bcc's |
+| `email.cc.address` | keyword | Addresses of Cc's |
+| `email.cc.domain` | keyword | Domains of Cc addresses |


This is a good first start, for capturing each array of emails. I would make all of them plural, however.

Looking at this, I wonder if we should consider the nested type here as well? I could see a desire to break down the domains some more, like elsewhere in ECS. domain => registered_domain => top_level_domain. I'm not 100% sure if such a breakdown per email address is actually useful, though *.

If not, perhaps two distinct arrays of keywords are totally fine, like

"email": { "cc": { "addresses": ["alice@example.com", "bob@example.com", "cam@example2.com"], "domains": ["example.com", "example2.com"] } }

* My feeling is that in most cases, the sender's email address would be a suspicious one we want to break down all the way, but recipients are not "suspicious", they're the unwitting recipients for the email :-)

@webmat I think it would make sense to have the structure you have above, but in the sense of looking for a specific address or domain, we should also accompany this with related.email and related.domain

Under the current guidance with the related.* fields, email addresses are captured in related.user and domains in related.hosts.

Rather we continue with this guidance or adjust and add additional related.* fields is a good conversation to have. 🤔

webmat · 2020-10-27T20:20:03Z

rfcs/text/0008-email.md

+| `email.file.name` | keyword | File name of attachements |
+| `email.file.size` | keyword | Total size of all attachements in bytes |
+| `email.direction` | keyword | Direction of the message based on the sending and receving domains |
+| `email.from.address` | keyword | Senders email address |


@jeffrysleddens Do you think it would be worth considering a specific section in the schema for a breakdown of protocols like SMTP and POP3?

Or do you think a general purpose email.* field set is enough?

On a related note, I think we should also consider adding email.reply_to.address and email.reply_to.domain, which can be different than the "from" address.

webmat · 2020-10-27T20:36:13Z

rfcs/text/0008-email.md

+| `email.from.domain` | keyword | Senders domain |
+| `email.latency` | keyword | The time, in milliseconds, the delivery attempt took |
+| `email.message_id` | keyword | Internet message ID of the message |
+| `email.process` | keyword | Name of the executable that carried out the transaction, e.g. outlook, sendmail |


Good call out.

Although I wonder what the intent is, with this field? Is it:

to capture simply the name of the event source under management (e.g. I'm parsing sendmail logs, therefore email.process is hardcoded to "sendmail" in my pipeline),

or to capture the series of agents that took part in transmission of the email (e.g. from each of the Received header).

I'm curious what folks would like to have here. Would both of these fields be useful?

webmat · 2020-10-27T20:36:49Z

rfcs/text/0008-email.md

+| `email.latency` | keyword | The time, in milliseconds, the delivery attempt took |
+| `email.message_id` | keyword | Internet message ID of the message |
+| `email.process` | keyword | Name of the executable that carried out the transaction, e.g. outlook, sendmail |
+| `email.protocol` | keyword | The email protocol used, e.g. SMTP, IMAP |


I'm thinking we should favor network.protocol here instead.

+1 for this.

webmat · 2020-10-27T20:41:16Z

rfcs/text/0008-email.md

+| `email.from.address` | keyword | Senders email address |
+| `email.from.domain` | keyword | Senders domain |
+| `email.latency` | keyword | The time, in milliseconds, the delivery attempt took |
+| `email.message_id` | keyword | Internet message ID of the message |


Message IDs can be pretty creative. For example one of the message IDs for this PR's email notifications was <elastic/ecs/pull/999/review/503143839@github.com>.

So I would make this one wildcard.

nevertheless the message_id captures the uniqueness of a mail.
I can see that different mail servers have specific ways of building this Message ID and could be interesting (for identification purposes) capturing such behaviour (and spot the anomalies). With this said, a multi-field mapping would make sense here:

| `email.message_id` | keyword | Internet message ID of the message | | `email.message_id.text` | text | Internet message ID of the message for full text search |

webmat · 2020-10-27T21:12:20Z

rfcs/text/0008-email.md

+| `object.return.address` | keyword | The return address for the message |
+| `email.size` | keyword | Total size of the message, in bytes, including attachments |
+| `email.subject` | keyword | Subject of the message |
+| `email.to` | keyword | Recipieint address |


I agree there's nuances in recipients, and in how closely we want to represent each protocol.

After looking at a few raw emails, I'm starting to think we should keep the design of this field set pretty high level.

If capturing each protocol's nuances is desired or necessary, then I think we should consider working on a specific breakdown per protocol, as a separate step.

So I think the guiding principle we could use here is to capture the commonalities in email.*, even if it means merging some concepts, sometimes.

One point you're raising is pretty interesting, however. Should we capture each "type" of recipient in a different field, or in one parent field with an additional label to indicate which type of recipient this is?

The current proposal takes the former approach:

{ "email": { "to": [ {"address": "alice@example.com", ...} ], "cc": [ {"address": "bob@example.com", ...} ], } }

The "recipient" suggestion would look like:

{ "email": { "recipients": [ { "type": "to", "address": "alice@example.com", ...}, { "type": "cc", "address": "bob@example.com", ...}, ] } }

🤔

ebeahan · 2020-10-30T20:20:52Z

rfcs/text/0008-email.md

+Stage 0: Provide a high level summary of the premise of these changes. Briefly describe the nature, purpose, and impact of the changes. ~2-5 sentences.
+-->
+
+## Fields


Another item that came to mind, and I think makes sense to capture for further discussion later:

Would email be worth considering as an additional allowed value for event.category?

I would say +1 on this.

@ebeahan if we would create this new event.category we should also update which event.type we can combine with it and maybe see if we need new ones?

Let's mention this in the RFC "Fields" section.

P1llus · 2020-11-09T16:21:59Z

@ebeahan @webmat

Just added the first iteration of the changes we discussed, hopefully this commit should cover all the above comments for now, some which we could investigate for later.
I added one more to the list, which was attachments_count, outside of the attachments flattened fields. This will help for visualisations and parser writing.

Could you please let me know if my commit covered all of the topics we wanted to cover for now?

andrewstucki · 2020-11-09T20:33:01Z

rfcs/text/0008-email.md

+| `email.sender.top_level_domain` | keyword | Senders email address |
+| `email.message_id` | keyword | Internet message ID of the message |
+| `email.reply_to.address` | wildcard | Reply-to address |
+| `email.return.address` | wildcard | The return address for the message |


What about renaming this to return_path -- it's a bit more descriptive of what I think you're actually going for here.

I agree on this point as well, will update it in the next commit.

ebeahan

Thanks @P1llus!

I think we should still list some of the discussed points in the Concerns section. We don't have to have resolutions at this point. We simply want to capture them now for further discussion later:

The two items Mat captured in an earlier review thread: [RFC] E-Mail #999 (comment)
Possibility of needing an email event.category and possibly new event.type values.
Maybe worth noting that if the flattened data type is used for email.* fields, it would be the first time using this type in ECS?

rfcs/text/0008-email.md

P1llus · 2020-11-10T20:56:13Z

Adding in the latest changes @ebeahan 👍

BenB196 · 2020-11-11T01:23:47Z

rfcs/text/0008-email.md

+| `email.size` | keyword | Total size of the message, in bytes, including attachments |
+| `email.subject` | wildcard | Subject of the message |
+| `email.recipients.addresses` | keyword | Recipient addresses |
+| `email.domains` | keyword | domains related to the email |


This field really feels like it should be part of the related fields. Something like related.domains (though it currently doesn't exist, so it might be worth keeping here)

Yeah this field is the outcome of a current discussion we have. Instead of having domain fields for bcc, cc, recipients etc, we decided currently to have them all as an array under one field. This might change in the upcoming stages. Thanks for the pointer, always happy to get feedback

Yeah, my concern mainly with a related field is that you lose the directionality of the value. Which might be useful for some use-cases.

We do have the email.direction though, would that be sufficient? We would calculate the direction before moving the different domains into email.domains for example.

rfcs/text/0008-email.md

dainperkins · 2020-11-16T23:43:42Z

rfcs/text/0008-email.md

+| `email.cc.addresses` | wildcard | Addresses of Cc's |
+| `email.attachments_count` | long | A field outside the flattened structure to control how many attachments are included in the email |
+| `email.attachments` | flattened | A flattened field for anything related to attachments. This allows objects being stored with all information for each file when you have multiple attachments |
+| `email.direction` | keyword | Direction of the message based on the sending and receving domains |


What do we for see as values for this?, and from which address fields (to, cc, bcc) would it be categorized on?
It seems to me like something that could potentially be difficult to implement, and not sure of the value for visualizations (but I could easily be missing something obvious, its been one of those days...)

Good question, @dainperkins.

I assume the allowed values in there should be "inbound" and "outbound". Perhaps also "unknown" in the case of relays? Actually just like network.direction, "internal" is another class of emails that has a different threat profile. I wonder if there's a need for the value "external" (as in, I'm just an exchange, relaying between Yahoo and Gmail)?

I agree populating this consistently may not be obvious in all scenarios.

I don't think as a third party, our solutions can determine between "inbound", "outbound" and "internal" without specific configuration that says what are "my domains".

But once we know that, I assume the heuristic is pretty straightforward:

direction = inbound when from is not one of "my domains"

direction = outbound when from = "my domains" and at least one receiver (to, cc, bcc) contains addresses not in "my domains"

direction = outbound when from = "my domains" and all receivers are "my domains"

So I'm +1 on adding the field. I think it makes sense. And unless I'm missing something, I think the heuristics are reasonable; and actually, perhaps some of the email-related event sources already provide such values? It's certainly useful for a spam filter to know which emails to filter. Not sure if it shows up in their logs though.

Action item for the RFC, though: let's start listing expected values for this field. I'm providing ideas above as a strawperson, based on what we have in network.direction. But if email data sources have other values for this, let's bring them to the table as well.

webmat

This RFC has by far surpassed the criteria for stage 1. We have a new author (thanks for picking it up, Marius) and a sponsor. And the fields of the proposal are looking really good already.

Of course on such an interesting subject, I couldn't resist providing some detailed technical feedback below.

But I think we should merge this as a stage 1 after a few minor adjustments which I'll outline in the body of this review, and we should address the rest of the technical feedback in a stage 2 PR.

The only points I'd like to address before merging this at stage 1 are:

adjusting the "People" heading
making the list of people into a bullet list
adjusting the "stage 1" link at the very bottom
and populating the "references" section

Everything else in my review can be addressed in the stage 2 PR. If you have time to incorporate more feedback in your next session on this, all the better. But feel free to just address the points above so we can move forward.

webmat · 2020-11-20T17:53:55Z

rfcs/text/0008-email.md

+| `email.bcc.addresses` | wildcard | Addresses of Bcc's |
+| `email.cc.addresses` | wildcard | Addresses of Cc's |
+| `email.attachments_count` | long | A field outside the flattened structure to control how many attachments are included in the email |
+| `email.attachments` | flattened | A flattened field for anything related to attachments. This allows objects being stored with all information for each file when you have multiple attachments |


The .attachments.* fields should follow the file.* fields. We can state this approach in the description for now.

We can see later about the implementation, whether it's full reuse, or explicitly defining the fields that make sense for attachments.

webmat · 2020-11-20T18:11:31Z

rfcs/text/0008-email.md

+| `email.cc.addresses` | wildcard | Addresses of Cc's |
+| `email.attachments_count` | long | A field outside the flattened structure to control how many attachments are included in the email |
+| `email.attachments` | flattened | A flattened field for anything related to attachments. This allows objects being stored with all information for each file when you have multiple attachments |
+| `email.direction` | keyword | Direction of the message based on the sending and receving domains |


Good question, @dainperkins.

I assume the allowed values in there should be "inbound" and "outbound". Perhaps also "unknown" in the case of relays? Actually just like network.direction, "internal" is another class of emails that has a different threat profile. I wonder if there's a need for the value "external" (as in, I'm just an exchange, relaying between Yahoo and Gmail)?

I agree populating this consistently may not be obvious in all scenarios.

I don't think as a third party, our solutions can determine between "inbound", "outbound" and "internal" without specific configuration that says what are "my domains".

But once we know that, I assume the heuristic is pretty straightforward:

direction = inbound when from is not one of "my domains"

direction = outbound when from = "my domains" and at least one receiver (to, cc, bcc) contains addresses not in "my domains"

direction = outbound when from = "my domains" and all receivers are "my domains"

So I'm +1 on adding the field. I think it makes sense. And unless I'm missing something, I think the heuristics are reasonable; and actually, perhaps some of the email-related event sources already provide such values? It's certainly useful for a spam filter to know which emails to filter. Not sure if it shows up in their logs though.

Action item for the RFC, though: let's start listing expected values for this field. I'm providing ideas above as a strawperson, based on what we have in network.direction. But if email data sources have other values for this, let's bring them to the table as well.

rfcs/text/0008-email.md

webmat · 2020-11-20T18:27:55Z

rfcs/text/0008-email.md

+| `email.attachments_count` | long | A field outside the flattened structure to control how many attachments are included in the email |
+| `email.attachments` | flattened | A flattened field for anything related to attachments. This allows objects being stored with all information for each file when you have multiple attachments |
+| `email.direction` | keyword | Direction of the message based on the sending and receving domains |
+| `email.sender.address` | wildcard | Senders email address |


If I remember correctly, address will contain the full Person Name <person@example.com>.

We're defining the domain breakdown fields here because the sender is potentially a threat, and this is where we'll be looking for known bad domains/TLDs and so on.

But looking at the fields, I wonder if we should do the same with email.reply_to.address and email.return_path.address? They're also relevant to the sender.

We can hold off on adding them for now, but I'm floating the idea to get feedback on whether there's a need for them.

I think its worth exploring in the upcoming stage for sure if that is appropriate.

webmat · 2020-11-20T18:34:25Z

rfcs/text/0008-email.md

+The goal here is to research and understand the impact of these changes on users in the community and development teams across Elastic. 2-5 sentences each.
+-->
+
+## Concerns


Agreed, I think email has enough subtleties in both the "senders" (sender, reply_to, return_path) and the receivers (to, cc, bcc) that I don't think it makes sense to put them in [source|destination].user.email. Also, these aren't array fields anyway, so we couldn't capture everything from the get go.

If an email server logs the source.ip from the MX that sent them an email though, that's totally appropriate to capture in that field.

rfcs/text/0008-email.md

webmat · 2020-11-20T19:03:23Z

rfcs/text/0008-email.md

+Stage 0: Provide a high level summary of the premise of these changes. Briefly describe the nature, purpose, and impact of the changes. ~2-5 sentences.
+-->
+
+## Fields


Let's mention this in the RFC "Fields" section.

P1llus · 2020-11-26T05:21:09Z

Added the required changes you mentioned + a few more tweaks based on some changes that could wait for stage 2 @webmat . Anything else we should do at this point?

webmat

Alright, thanks for the adjustments, @P1llus! I'll adjust the date and merge.

rfcs/text/0008-email.md

Create 0008-email.md

9b4d64d

jamiehynds changed the title ~~Create 0008-email.md~~ E-Mail RFC Oct 5, 2020

ebeahan added the RFC label Oct 5, 2020

ebeahan changed the title ~~E-Mail RFC~~ [RFC] E-Mail RFC Oct 5, 2020

ebeahan changed the title ~~[RFC] E-Mail RFC~~ [RFC] E-Mail Oct 5, 2020

vpiserchia reviewed Oct 6, 2020

View reviewed changes

webmat reviewed Oct 6, 2020

View reviewed changes

jeffrysleddens reviewed Oct 6, 2020

View reviewed changes

webmat mentioned this pull request Oct 6, 2020

[meta] Add support for email in ECS #939

Closed

BenB196 reviewed Oct 6, 2020

View reviewed changes

jeffrysleddens reviewed Oct 7, 2020

View reviewed changes

BenB196 reviewed Oct 7, 2020

View reviewed changes

ebeahan reviewed Oct 16, 2020

View reviewed changes

webmat reviewed Oct 27, 2020

View reviewed changes

ebeahan reviewed Oct 30, 2020

View reviewed changes

ebeahan mentioned this pull request Oct 30, 2020

Categorization Fields GA #942

Closed

5 tasks

ebeahan assigned P1llus Nov 3, 2020

updating the RFC based on PR comments

1a4c2fd

andrewstucki reviewed Nov 9, 2020

View reviewed changes

ebeahan reviewed Nov 9, 2020

View reviewed changes

rfcs/text/0008-email.md Outdated Show resolved Hide resolved

rfcs/text/0008-email.md Outdated Show resolved Hide resolved

rfcs/text/0008-email.md Show resolved Hide resolved

adding different topics under concerns

6a8af8f

adding event.start and event.end

a92a9d3

BenB196 reviewed Nov 11, 2020

View reviewed changes

rfcs/text/0008-email.md Outdated Show resolved Hide resolved

BenB196 reviewed Nov 11, 2020

View reviewed changes

rfcs/text/0008-email.md Outdated Show resolved Hide resolved

updating field type and adding new domain fields for sender

f11dea0

dainperkins reviewed Nov 16, 2020

View reviewed changes

webmat reviewed Nov 20, 2020

View reviewed changes

webmat added the 1.8.0 label Nov 23, 2020

updating rfc based on comments from @webmat to push it into stage 2

fe4e18b

webmat previously approved these changes Nov 30, 2020

View reviewed changes

webmat reviewed Nov 30, 2020

View reviewed changes

rfcs/text/0008-email.md Outdated Show resolved Hide resolved

Set the merge date

f10cf03

webmat dismissed their stale review via f10cf03 November 30, 2020 15:57

webmat merged commit 0b47844 into elastic:master Nov 30, 2020

ebeahan removed the 1.8.0 label Dec 14, 2020

peasead mentioned this pull request Aug 2, 2021

[RFC] Email - Stage 1 Proposal #1219

Merged

8 tasks


		<!-- An RFC should link to the PRs for each of it stage advancements. -->

		* Stage 0: https://github.com/elastic/ecs/pull/NNN

	- Stage: 1 <!-- Update to reflect target stage. See https://elastic.github.io/ecs/stages.html -->
	- Stage: 1 (proposal) <!-- Update to reflect target stage. See https://elastic.github.io/ecs/stages.html -->

[RFC] E-Mail #999

[RFC] E-Mail #999

Conversation

jamiehynds commented Oct 5, 2020 • edited by webmat Loading

vpiserchia left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

P1llus Nov 3, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jeffrysleddens Oct 6, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vpiserchia Oct 27, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BenB196 Oct 7, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

P1llus Nov 3, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

webmat left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

P1llus Nov 3, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

P1llus commented Nov 9, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ebeahan left a comment

Choose a reason for hiding this comment

P1llus commented Nov 10, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

webmat left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jamiehynds commented Oct 5, 2020 •

edited by webmat

Loading

P1llus Nov 3, 2020 •

edited

Loading

jeffrysleddens Oct 6, 2020 •

edited

Loading

vpiserchia Oct 27, 2020 •

edited

Loading

BenB196 Oct 7, 2020 •

edited

Loading

P1llus Nov 3, 2020 •

edited

Loading

P1llus Nov 3, 2020 •

edited

Loading

webmat left a comment •

edited

Loading