-
Notifications
You must be signed in to change notification settings - Fork 419
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] E-Mail #999
[RFC] E-Mail #999
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A series of question
rfcs/text/0008-email.md
Outdated
| `email.process` | keyword | Name of the executable that carried out the transaction, e.g. outlook, sendmail | | ||
| `email.protocol` | keyword | The email protocol used, e.g. SMTP, IMAP | | ||
| `email.reply_to.address` | keyword | Reply-to address | | ||
| `object.return.address` | keyword | The return address for the message | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should not be email.return.address?
rfcs/text/0008-email.md
Outdated
| `email.cc.address` | keyword | Addresses of Cc's | | ||
| `email.cc.domain` | keyword | Domains of Cc addresses | | ||
| `email.cipher` | keyword | Cipher used e.g. TLS | | ||
| `email.file.count` | value | Number of attachments included in the message | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not allowing multiple files as in an array ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would use the term 'attachment' instead of 'file'.
Schema should support multiple attachments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
file already exists as ECS field, it would be great to reuse it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed that attachment
is more idiomatic when referring to email than file
. Also agreed we will need to consider that a single email can have multiple attachments.
The file.*
fieldset in ECS is defined around a file that is created or exists on a filesystem, so many of the fields underneath file.*
will probably not apply to an email attachment. Though this will probably vary based on the source of the events.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with the name "attachments", and that it should support capturing multiple attachments. Note that I think it should be pluralized, since there can be many. This would be in line with how we named dns.answers.*
Here's a few suggestions on changes we could make to the attachment fields as we currently have them:
- Rename all mentions of
email.file.
toemail.attachments.
- Rename
email.file.count
toemail.attachments_count
, because everything after the.
will be present once for each entry in the array (for each attachment) - Make
email.attachments_count
along
- Make
email.attachments.size
along
- Make
email.attachments.name
awildcard
This will be an array of objects. In order for Elasticsearch to be able to index it so that we can query on multiple attachment attributes at a time (e.g. extension:mp3 AND size < 100000
) and get the expected results, would need to make email.attachment
into a nested field.
Querying nested fields is slightly different than normal fields, however (API, KQL).
It looks to me like it has good support across the stack all the way to KQL, which is great. But since this will be the first use of this type in ECS, and since querying these fields is a bit different than usual, I think this would be worth a mention in the Concerns section.
I would still adjust the field listing assuming we'll use nested
, since it looks like the best approach. So let's add an extra row here for it:
| `email.attachments` | nested | Array of objects containing information about each email attachment.
I like @vpiserchia's suggestion of reusing file
here. That could be an option, given that we can now reuse as another name (reuse file as email.attachments). However I agree with @ebeahan, that this would bring in way too many fields that are not useful in this context (consider the fields reused under file on top of this). So I think we could err on the side of defining the sub-fields explicitly for now, and trying to keep them consistent with their cousins as defined under file.*
. I do think it's worth capturing this as another point in the Concerns section, though.
rfcs/text/0008-email.md
Outdated
| `email.size` | keyword | Total size of the message, in bytes, including attachments | | ||
| `email.subject` | keyword | Subject of the message | | ||
| `email.to` | keyword | Recipieint address | | ||
| `email.to.domain` | keyword | Recipient domain | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
email.to is already a field of type keyword, how can you define email.to as an object having domain as subfield? sorry for the silly question
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch! 👀
I believe email.to
will be corrected to email.to.address
.
rfcs/text/0008-email.md
Outdated
| `email.from.domain` | keyword | Senders domain | | ||
| `email.latency` | keyword | The time, in milliseconds, the delivery attempt took | | ||
| `email.message_id` | keyword | Internet message ID of the message | | ||
| `email.process` | keyword | Name of the executable that carried out the transaction, e.g. outlook, sendmail | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MTA?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call out.
Although I wonder what the intent is, with this field? Is it:
- to capture simply the name of the event source under management (e.g. I'm parsing sendmail logs, therefore
email.process
is hardcoded to "sendmail" in my pipeline), - or to capture the series of agents that took part in transmission of the email (e.g. from each of the
Received
header).
I'm curious what folks would like to have here. Would both of these fields be useful?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why add a field for this, i.m.o this can be captured in process.*
fields. Specifically the process.name
field.
This would be in line with the statement on email.action
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @SHolzhauer on this, I think the name of the process should follow the current ECS format.
When it comes to question number 2 from @webmat it is something that might be useful in a separate field.
rfcs/text/0008-email.md
Outdated
| `object.return.address` | keyword | The return address for the message | | ||
| `email.size` | keyword | Total size of the message, in bytes, including attachments | | ||
| `email.subject` | keyword | Subject of the message | | ||
| `email.to` | keyword | Recipieint address | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
email.to.address
rfcs/text/0008-email.md
Outdated
|
||
<!-- An RFC should link to the PRs for each of it stage advancements. --> | ||
|
||
* Stage 0: https://github.com/elastic/ecs/pull/NNN |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* Stage 0: https://github.com/elastic/ecs/pull/NNN | |
* Stage 0: https://github.com/elastic/ecs/pull/999 |
rfcs/text/0008-email.md
Outdated
| `object.return.address` | keyword | The return address for the message | | ||
| `email.size` | keyword | Total size of the message, in bytes, including attachments | | ||
| `email.subject` | keyword | Subject of the message | | ||
| `email.to` | keyword | Recipieint address | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think 'recipient' is a more common term when it comes to email than 'to'.
Also the schema should support multiple recipients. Could also add a 'type' to 'recipient' to indicate if recipient was added to 'to', 'cc' or 'bcc'.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this really depends if you want to capture the semantics of the envelope/smtp and/or the email headers. This applies to other fields as well (to, cc, bcc, from)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree there's nuances in recipients, and in how closely we want to represent each protocol.
After looking at a few raw emails, I'm starting to think we should keep the design of this field set pretty high level.
If capturing each protocol's nuances is desired or necessary, then I think we should consider working on a specific breakdown per protocol, as a separate step.
So I think the guiding principle we could use here is to capture the commonalities in email.*
, even if it means merging some concepts, sometimes.
One point you're raising is pretty interesting, however. Should we capture each "type" of recipient in a different field, or in one parent field with an additional label to indicate which type of recipient this is?
The current proposal takes the former approach:
{ "email": {
"to": [ {"address": "alice@example.com", ...} ],
"cc": [ {"address": "bob@example.com", ...} ],
} }
The "recipient" suggestion would look like:
{ "email": {
"recipients": [
{ "type": "to", "address": "alice@example.com", ...},
{ "type": "cc", "address": "bob@example.com", ...},
]
} }
🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really like this one:
{ "email": { "to": [ {"address": "alice@example.com", ...} ], "cc": [ {"address": "bob@example.com", ...} ], } }
this avoids the need for nested objects. And It also opens to a new one in the "related" field:
related.emails: ["alice@example.com", "bob@example.com"]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From an ingest perspective and a visualization perspective I think its better to keep the "to, cc, bcc" field structure over type.
The point from @vpiserchia is a good one, because once we start having list of objects we will have issues with the internal structure when it comes to visualizations, parsing and aggregations.
rfcs/text/0008-email.md
Outdated
| `email.process` | keyword | Name of the executable that carried out the transaction, e.g. outlook, sendmail | | ||
| `email.protocol` | keyword | The email protocol used, e.g. SMTP, IMAP | | ||
| `email.reply_to.address` | keyword | Reply-to address | | ||
| `object.return.address` | keyword | The return address for the message | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.address exits for email.reply_to and object.return, but not .domain, should be constant and have .domain as an additional field.
rfcs/text/0008-email.md
Outdated
| `email.file.name` | keyword | File name of attachements | | ||
| `email.file.size` | keyword | Total size of all attachements in bytes | | ||
| `email.direction` | keyword | Direction of the message based on the sending and receving domains | | ||
| `email.from.address` | keyword | Senders email address | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The schema should support and distinguish both the envelope/smtp 'from' and the header/mime 'from'.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jeffrysleddens Do you think it would be worth considering a specific section in the schema for a breakdown of protocols like SMTP and POP3?
Or do you think a general purpose email.*
field set is enough?
On a related note, I think we should also consider adding email.reply_to.address
and email.reply_to.domain
, which can be different than the "from" address.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How would we distinguish which hash is for which file, and the same with size?
Looking at the comment above about changing file to attachments, we will still need to for example have a list of objects if we want to keep track about size, hash, extension belonging to a single file/attachment.
This kinda goes against the current features we normally support (lists have very limited possibilities in terms of visualizations and SIEM/Alert rules)
Are we thinking all of these fields should just be an array?
The goal here is to research and understand the impact of these changes on users in the community and development teams across Elastic. 2-5 sentences each. | ||
--> | ||
|
||
## Concerns |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When looking at the current fields provided, one of my concerns is it appears that they don't fit well with the rest of ECS. I think this can be partially fixed with the use of aliases, though, I don't believe aliases are standard/common in ECS.
Examples:
email.from -> source
email.to|cc|bcc -> destination
email.latency -> event.duration
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
latency might be redundant depending on the specific action being recorded, but I wouldn't equate email.to|from with source and destination (or client/server) network entities
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, I think email has enough subtleties in both the "senders" (sender, reply_to, return_path) and the receivers (to, cc, bcc) that I don't think it makes sense to put them in [source|destination].user.email
. Also, these aren't array fields anyway, so we couldn't capture everything from the get go.
If an email server logs the source.ip
from the MX that sent them an email though, that's totally appropriate to capture in that field.
rfcs/text/0008-email.md
Outdated
# 0008: Email | ||
<!-- Leave this ID at 0000. The ECS team will assign a unique, contiguous RFC number upon merging the initial stage of this RFC. --> | ||
|
||
- Stage: **1** <!-- Update to reflect target stage. See https://elastic.github.io/ecs/stages.html --> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Stage: **1** <!-- Update to reflect target stage. See https://elastic.github.io/ecs/stages.html --> | |
- Stage: **1 (proposal)** <!-- Update to reflect target stage. See https://elastic.github.io/ecs/stages.html --> |
Not required, but including the stage name with the stage number has become an informal RFC convention 😄
rfcs/text/0008-email.md
Outdated
- Stage: **1** <!-- Update to reflect target stage. See https://elastic.github.io/ecs/stages.html --> | ||
- Date: **Oct 5th 2020** <!-- The ECS team sets this date at merge time. This is the date of the latest stage advancement. --> | ||
|
||
This RFC proposes a new top-level field to facilitate email use cases. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Email has similar challenges to web browsing when you start looking at how to model it. Obviously, what is generally just referred to as "email" is actually a complex system of processes and protocols: SMTP, IMAP, POP3, SPF, DMARC, DKIM, DNS, TLS, x509, etc. etc.
I think having the top-level fieldset for email.*
makes sense as the starting point. However, I can also see need for additional top-level fields depending on the data source (e.g. if Packetbeat/Elastic Agent added support for SMTP, perhaps a smtp.*
field becomes necessary).
I mostly bring this up to set the expectations that to accurate capture the different facets of email in ECS, we'll likely be adding several new fieldsets over time. 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like us to capture two aspects of Eric's point in the Concerns section for now, so let's add two subsections there:
1 - There are many types of legitimate "email events" that could be captured. This RFC is currently focusing on a subset. But we'll want to keep in mind the whole list of possible types of email events, to make sure there's room in the schema for all. Here's the list:
- spam filter events
- email server log events
- deliverability events from an MTA (deferred, delivered, bounce, dropped, etc)
- engagement events (opens, clicks, unsubscribe, spam report)
- Email infrastructure monitoring (think dmarcian)
2 - One design decision we'll have to make is whether we also introduce fields for the 3 main "email protocols" (SMTP, IMAP and POP3), or do we try to fit most things under email.*
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@webmat Would there be any reason to discuss having the protocols under email rather than a top field level?
We could for example treat smtp, imap and pop3
as we do with for example user.*
, where it is expected to be nested under email.*
but could meet other usecases moving forward.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Quick question about these comments. Is email authentication (i.e. SPF/DKIM/DMARC) in scope for this RFC? Mainly wondering because they're really a product of alignment on Return-Path (MAIL FROM), From, and DKIM-Signature mail headers. Not that I recommend all implementations that can capture these headers attempt to validate SPF/DKIM/DMARC, but it's totally possible given only the message content (and some fun DNS lookups).
I think that they fall under a pretty different realm than say spam/reputation scoring, analytics reports, MTA actions, or even email protocols that were mentioned.
rfcs/text/0008-email.md
Outdated
| `email.cc.address` | keyword | Addresses of Cc's | | ||
| `email.cc.domain` | keyword | Domains of Cc addresses | | ||
| `email.cipher` | keyword | Cipher used e.g. TLS | | ||
| `email.file.count` | value | Number of attachments included in the message | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed that attachment
is more idiomatic when referring to email than file
. Also agreed we will need to consider that a single email can have multiple attachments.
The file.*
fieldset in ECS is defined around a file that is created or exists on a filesystem, so many of the fields underneath file.*
will probably not apply to an email attachment. Though this will probably vary based on the source of the events.
rfcs/text/0008-email.md
Outdated
| `email.file.size` | keyword | Total size of all attachements in bytes | | ||
| `email.direction` | keyword | Direction of the message based on the sending and receving domains | | ||
| `email.from.address` | keyword | Senders email address | | ||
| `email.from.domain` | keyword | Senders domain | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just making a note that as we refine and iterate on this list of fields, we will want to consider which would be good candidates for using wildcard
over keyword
(the .name
and .domain
fields are a couple of places).
rfcs/text/0008-email.md
Outdated
| `email.size` | keyword | Total size of the message, in bytes, including attachments | | ||
| `email.subject` | keyword | Subject of the message | | ||
| `email.to` | keyword | Recipieint address | | ||
| `email.to.domain` | keyword | Recipient domain | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch! 👀
I believe email.to
will be corrected to email.to.address
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excellent start, @jamiehynds.
Thanks everyone so far for chiming in. Looks like there's a lot of interest in getting support for email in ECS 😃
I think there's a lot of straightforward suggestions already made (in my review and prior reviews) that we could apply to the current RFC.
There's also many big discussions that don't have an obvious answer. Those should be captured in the "Concerns" section, so that these problem statements are part of the RFC document itself, and that we carry to the next RFC stages. I've tried to point them out in my review.
rfcs/text/0008-email.md
Outdated
- Stage: **1** <!-- Update to reflect target stage. See https://elastic.github.io/ecs/stages.html --> | ||
- Date: **Oct 5th 2020** <!-- The ECS team sets this date at merge time. This is the date of the latest stage advancement. --> | ||
|
||
This RFC proposes a new top-level field to facilitate email use cases. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like us to capture two aspects of Eric's point in the Concerns section for now, so let's add two subsections there:
1 - There are many types of legitimate "email events" that could be captured. This RFC is currently focusing on a subset. But we'll want to keep in mind the whole list of possible types of email events, to make sure there's room in the schema for all. Here's the list:
- spam filter events
- email server log events
- deliverability events from an MTA (deferred, delivered, bounce, dropped, etc)
- engagement events (opens, clicks, unsubscribe, spam report)
- Email infrastructure monitoring (think dmarcian)
2 - One design decision we'll have to make is whether we also introduce fields for the 3 main "email protocols" (SMTP, IMAP and POP3), or do we try to fit most things under email.*
?
rfcs/text/0008-email.md
Outdated
|
||
| field | type | description | | ||
| --- | --- | --- | | ||
| `email.action` | keyword | Action take by the source device, e.g. delivered, blocked, quarantined, deleted | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would rather have integrations use event.action
than introduce a new action field.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with this as well, we should reuse the ECS fields we already have when we can.
rfcs/text/0008-email.md
Outdated
| `email.cc.address` | keyword | Addresses of Cc's | | ||
| `email.cc.domain` | keyword | Domains of Cc addresses | | ||
| `email.cipher` | keyword | Cipher used e.g. TLS | | ||
| `email.file.count` | value | Number of attachments included in the message | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with the name "attachments", and that it should support capturing multiple attachments. Note that I think it should be pluralized, since there can be many. This would be in line with how we named dns.answers.*
Here's a few suggestions on changes we could make to the attachment fields as we currently have them:
- Rename all mentions of
email.file.
toemail.attachments.
- Rename
email.file.count
toemail.attachments_count
, because everything after the.
will be present once for each entry in the array (for each attachment) - Make
email.attachments_count
along
- Make
email.attachments.size
along
- Make
email.attachments.name
awildcard
This will be an array of objects. In order for Elasticsearch to be able to index it so that we can query on multiple attachment attributes at a time (e.g. extension:mp3 AND size < 100000
) and get the expected results, would need to make email.attachment
into a nested field.
Querying nested fields is slightly different than normal fields, however (API, KQL).
It looks to me like it has good support across the stack all the way to KQL, which is great. But since this will be the first use of this type in ECS, and since querying these fields is a bit different than usual, I think this would be worth a mention in the Concerns section.
I would still adjust the field listing assuming we'll use nested
, since it looks like the best approach. So let's add an extra row here for it:
| `email.attachments` | nested | Array of objects containing information about each email attachment.
I like @vpiserchia's suggestion of reusing file
here. That could be an option, given that we can now reuse as another name (reuse file as email.attachments). However I agree with @ebeahan, that this would bring in way too many fields that are not useful in this context (consider the fields reused under file on top of this). So I think we could err on the side of defining the sub-fields explicitly for now, and trying to keep them consistent with their cousins as defined under file.*
. I do think it's worth capturing this as another point in the Concerns section, though.
rfcs/text/0008-email.md
Outdated
| `email.bcc.address` | keyword | Addresses of Bcc's | | ||
| `email.bcc.domain` | keyword | Domains of the Bcc's | | ||
| `email.cc.address` | keyword | Addresses of Cc's | | ||
| `email.cc.domain` | keyword | Domains of Cc addresses | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good first start, for capturing each array of emails. I would make all of them plural, however.
Looking at this, I wonder if we should consider the nested
type here as well? I could see a desire to break down the domains some more, like elsewhere in ECS. domain
=> registered_domain
=> top_level_domain
. I'm not 100% sure if such a breakdown per email address is actually useful, though *.
If not, perhaps two distinct arrays of keywords are totally fine, like
"email": {
"cc": {
"addresses": ["alice@example.com", "bob@example.com", "cam@example2.com"],
"domains": ["example.com", "example2.com"]
}
}
* My feeling is that in most cases, the sender's email address would be a suspicious one we want to break down all the way, but recipients are not "suspicious", they're the unwitting recipients for the email :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@webmat I think it would make sense to have the structure you have above, but in the sense of looking for a specific address or domain, we should also accompany this with related.email
and related.domain
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Under the current guidance with the related.*
fields, email addresses are captured in related.user
and domains in related.hosts
.
Rather we continue with this guidance or adjust and add additional related.*
fields is a good conversation to have. 🤔
rfcs/text/0008-email.md
Outdated
| `email.file.name` | keyword | File name of attachements | | ||
| `email.file.size` | keyword | Total size of all attachements in bytes | | ||
| `email.direction` | keyword | Direction of the message based on the sending and receving domains | | ||
| `email.from.address` | keyword | Senders email address | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jeffrysleddens Do you think it would be worth considering a specific section in the schema for a breakdown of protocols like SMTP and POP3?
Or do you think a general purpose email.*
field set is enough?
On a related note, I think we should also consider adding email.reply_to.address
and email.reply_to.domain
, which can be different than the "from" address.
rfcs/text/0008-email.md
Outdated
| `email.from.domain` | keyword | Senders domain | | ||
| `email.latency` | keyword | The time, in milliseconds, the delivery attempt took | | ||
| `email.message_id` | keyword | Internet message ID of the message | | ||
| `email.process` | keyword | Name of the executable that carried out the transaction, e.g. outlook, sendmail | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call out.
Although I wonder what the intent is, with this field? Is it:
- to capture simply the name of the event source under management (e.g. I'm parsing sendmail logs, therefore
email.process
is hardcoded to "sendmail" in my pipeline), - or to capture the series of agents that took part in transmission of the email (e.g. from each of the
Received
header).
I'm curious what folks would like to have here. Would both of these fields be useful?
rfcs/text/0008-email.md
Outdated
| `email.latency` | keyword | The time, in milliseconds, the delivery attempt took | | ||
| `email.message_id` | keyword | Internet message ID of the message | | ||
| `email.process` | keyword | Name of the executable that carried out the transaction, e.g. outlook, sendmail | | ||
| `email.protocol` | keyword | The email protocol used, e.g. SMTP, IMAP | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm thinking we should favor network.protocol
here instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for this.
| `email.from.address` | keyword | Senders email address | | ||
| `email.from.domain` | keyword | Senders domain | | ||
| `email.latency` | keyword | The time, in milliseconds, the delivery attempt took | | ||
| `email.message_id` | keyword | Internet message ID of the message | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Message IDs can be pretty creative. For example one of the message IDs for this PR's email notifications was <elastic/ecs/pull/999/review/503143839@github.com>
.
So I would make this one wildcard
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nevertheless the message_id captures the uniqueness of a mail.
I can see that different mail servers have specific ways of building this Message ID and could be interesting (for identification purposes) capturing such behaviour (and spot the anomalies). With this said, a multi-field mapping would make sense here:
| `email.message_id` | keyword | Internet message ID of the message |
| `email.message_id.text` | text | Internet message ID of the message for full text search |
rfcs/text/0008-email.md
Outdated
| `object.return.address` | keyword | The return address for the message | | ||
| `email.size` | keyword | Total size of the message, in bytes, including attachments | | ||
| `email.subject` | keyword | Subject of the message | | ||
| `email.to` | keyword | Recipieint address | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree there's nuances in recipients, and in how closely we want to represent each protocol.
After looking at a few raw emails, I'm starting to think we should keep the design of this field set pretty high level.
If capturing each protocol's nuances is desired or necessary, then I think we should consider working on a specific breakdown per protocol, as a separate step.
So I think the guiding principle we could use here is to capture the commonalities in email.*
, even if it means merging some concepts, sometimes.
One point you're raising is pretty interesting, however. Should we capture each "type" of recipient in a different field, or in one parent field with an additional label to indicate which type of recipient this is?
The current proposal takes the former approach:
{ "email": {
"to": [ {"address": "alice@example.com", ...} ],
"cc": [ {"address": "bob@example.com", ...} ],
} }
The "recipient" suggestion would look like:
{ "email": {
"recipients": [
{ "type": "to", "address": "alice@example.com", ...},
{ "type": "cc", "address": "bob@example.com", ...},
]
} }
🤔
Stage 0: Provide a high level summary of the premise of these changes. Briefly describe the nature, purpose, and impact of the changes. ~2-5 sentences. | ||
--> | ||
|
||
## Fields |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another item that came to mind, and I think makes sense to capture for further discussion later:
Would email
be worth considering as an additional allowed value for event.category
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would say +1 on this.
@ebeahan if we would create this new event.category
we should also update which event.type
we can combine with it and maybe see if we need new ones?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's mention this in the RFC "Fields" section.
Just added the first iteration of the changes we discussed, hopefully this commit should cover all the above comments for now, some which we could investigate for later. Could you please let me know if my commit covered all of the topics we wanted to cover for now? |
rfcs/text/0008-email.md
Outdated
| `email.sender.top_level_domain` | keyword | Senders email address | | ||
| `email.message_id` | keyword | Internet message ID of the message | | ||
| `email.reply_to.address` | wildcard | Reply-to address | | ||
| `email.return.address` | wildcard | The return address for the message | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about renaming this to return_path
-- it's a bit more descriptive of what I think you're actually going for here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree on this point as well, will update it in the next commit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @P1llus!
I think we should still list some of the discussed points in the Concerns
section. We don't have to have resolutions at this point. We simply want to capture them now for further discussion later:
- The two items Mat captured in an earlier review thread: [RFC] E-Mail #999 (comment)
- Possibility of needing an email
event.category
and possibly newevent.type
values. - Maybe worth noting that if the flattened data type is used for
email.*
fields, it would be the first time using this type in ECS?
Adding in the latest changes @ebeahan 👍 |
rfcs/text/0008-email.md
Outdated
| `email.size` | keyword | Total size of the message, in bytes, including attachments | | ||
| `email.subject` | wildcard | Subject of the message | | ||
| `email.recipients.addresses` | keyword | Recipient addresses | | ||
| `email.domains` | keyword | domains related to the email | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This field really feels like it should be part of the related fields. Something like related.domains (though it currently doesn't exist, so it might be worth keeping here)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah this field is the outcome of a current discussion we have. Instead of having domain fields for bcc, cc, recipients etc, we decided currently to have them all as an array under one field. This might change in the upcoming stages. Thanks for the pointer, always happy to get feedback
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, my concern mainly with a related field is that you lose the directionality of the value. Which might be useful for some use-cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do have the email.direction
though, would that be sufficient? We would calculate the direction before moving the different domains into email.domains
for example.
| `email.cc.addresses` | wildcard | Addresses of Cc's | | ||
| `email.attachments_count` | long | A field outside the flattened structure to control how many attachments are included in the email | | ||
| `email.attachments` | flattened | A flattened field for anything related to attachments. This allows objects being stored with all information for each file when you have multiple attachments | | ||
| `email.direction` | keyword | Direction of the message based on the sending and receving domains | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do we for see as values for this?, and from which address fields (to, cc, bcc) would it be categorized on?
It seems to me like something that could potentially be difficult to implement, and not sure of the value for visualizations (but I could easily be missing something obvious, its been one of those days...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question, @dainperkins.
I assume the allowed values in there should be "inbound" and "outbound". Perhaps also "unknown" in the case of relays? Actually just like network.direction
, "internal" is another class of emails that has a different threat profile. I wonder if there's a need for the value "external" (as in, I'm just an exchange, relaying between Yahoo and Gmail)?
I agree populating this consistently may not be obvious in all scenarios.
I don't think as a third party, our solutions can determine between "inbound", "outbound" and "internal" without specific configuration that says what are "my domains".
But once we know that, I assume the heuristic is pretty straightforward:
- direction = inbound when from is not one of "my domains"
- direction = outbound when from = "my domains" and at least one receiver (to, cc, bcc) contains addresses not in "my domains"
- direction = outbound when from = "my domains" and all receivers are "my domains"
So I'm +1 on adding the field. I think it makes sense. And unless I'm missing something, I think the heuristics are reasonable; and actually, perhaps some of the email-related event sources already provide such values? It's certainly useful for a spam filter to know which emails to filter. Not sure if it shows up in their logs though.
Action item for the RFC, though: let's start listing expected values for this field. I'm providing ideas above as a strawperson, based on what we have in network.direction
. But if email data sources have other values for this, let's bring them to the table as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This RFC has by far surpassed the criteria for stage 1. We have a new author (thanks for picking it up, Marius) and a sponsor. And the fields of the proposal are looking really good already.
Of course on such an interesting subject, I couldn't resist providing some detailed technical feedback below.
But I think we should merge this as a stage 1 after a few minor adjustments which I'll outline in the body of this review, and we should address the rest of the technical feedback in a stage 2 PR.
The only points I'd like to address before merging this at stage 1 are:
- adjusting the "People" heading
- making the list of people into a bullet list
- adjusting the "stage 1" link at the very bottom
- and populating the "references" section
Everything else in my review can be addressed in the stage 2 PR. If you have time to incorporate more feedback in your next session on this, all the better. But feel free to just address the points above so we can move forward.
| `email.bcc.addresses` | wildcard | Addresses of Bcc's | | ||
| `email.cc.addresses` | wildcard | Addresses of Cc's | | ||
| `email.attachments_count` | long | A field outside the flattened structure to control how many attachments are included in the email | | ||
| `email.attachments` | flattened | A flattened field for anything related to attachments. This allows objects being stored with all information for each file when you have multiple attachments | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The .attachments.*
fields should follow the file.*
fields. We can state this approach in the description for now.
We can see later about the implementation, whether it's full reuse, or explicitly defining the fields that make sense for attachments.
| `email.cc.addresses` | wildcard | Addresses of Cc's | | ||
| `email.attachments_count` | long | A field outside the flattened structure to control how many attachments are included in the email | | ||
| `email.attachments` | flattened | A flattened field for anything related to attachments. This allows objects being stored with all information for each file when you have multiple attachments | | ||
| `email.direction` | keyword | Direction of the message based on the sending and receving domains | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question, @dainperkins.
I assume the allowed values in there should be "inbound" and "outbound". Perhaps also "unknown" in the case of relays? Actually just like network.direction
, "internal" is another class of emails that has a different threat profile. I wonder if there's a need for the value "external" (as in, I'm just an exchange, relaying between Yahoo and Gmail)?
I agree populating this consistently may not be obvious in all scenarios.
I don't think as a third party, our solutions can determine between "inbound", "outbound" and "internal" without specific configuration that says what are "my domains".
But once we know that, I assume the heuristic is pretty straightforward:
- direction = inbound when from is not one of "my domains"
- direction = outbound when from = "my domains" and at least one receiver (to, cc, bcc) contains addresses not in "my domains"
- direction = outbound when from = "my domains" and all receivers are "my domains"
So I'm +1 on adding the field. I think it makes sense. And unless I'm missing something, I think the heuristics are reasonable; and actually, perhaps some of the email-related event sources already provide such values? It's certainly useful for a spam filter to know which emails to filter. Not sure if it shows up in their logs though.
Action item for the RFC, though: let's start listing expected values for this field. I'm providing ideas above as a strawperson, based on what we have in network.direction
. But if email data sources have other values for this, let's bring them to the table as well.
| `email.attachments_count` | long | A field outside the flattened structure to control how many attachments are included in the email | | ||
| `email.attachments` | flattened | A flattened field for anything related to attachments. This allows objects being stored with all information for each file when you have multiple attachments | | ||
| `email.direction` | keyword | Direction of the message based on the sending and receving domains | | ||
| `email.sender.address` | wildcard | Senders email address | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I remember correctly, address will contain the full Person Name <person@example.com>
.
We're defining the domain breakdown fields here because the sender is potentially a threat, and this is where we'll be looking for known bad domains/TLDs and so on.
But looking at the fields, I wonder if we should do the same with email.reply_to.address
and email.return_path.address
? They're also relevant to the sender.
We can hold off on adding them for now, but I'm floating the idea to get feedback on whether there's a need for them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think its worth exploring in the upcoming stage for sure if that is appropriate.
The goal here is to research and understand the impact of these changes on users in the community and development teams across Elastic. 2-5 sentences each. | ||
--> | ||
|
||
## Concerns |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, I think email has enough subtleties in both the "senders" (sender, reply_to, return_path) and the receivers (to, cc, bcc) that I don't think it makes sense to put them in [source|destination].user.email
. Also, these aren't array fields anyway, so we couldn't capture everything from the get go.
If an email server logs the source.ip
from the MX that sent them an email though, that's totally appropriate to capture in that field.
Stage 0: Provide a high level summary of the premise of these changes. Briefly describe the nature, purpose, and impact of the changes. ~2-5 sentences. | ||
--> | ||
|
||
## Fields |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's mention this in the RFC "Fields" section.
Added the required changes you mentioned + a few more tweaks based on some changes that could wait for stage 2 @webmat . Anything else we should do at this point? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright, thanks for the adjustments, @P1llus! I'll adjust the date and merge.
make test
?make
and committed those changes?RFC Preview