-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Security Solution][Telemetry] Concurrent telemetry requests #73558
[Security Solution][Telemetry] Concurrent telemetry requests #73558
Conversation
Pinging @elastic/endpoint-data-visibility-team (Team:Endpoint Data Visibility) |
latestEndpointEvent, | ||
lastCheckin, | ||
dailyActiveCount | ||
try { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks more aggressive than it is. I just moved the contents of the for loop
into a try catch
import { EndpointUsage, getEndpointTelemetryFromFleet } from './endpoints'; | ||
|
||
export type RegisterCollector = (deps: CollectorDependencies) => void; | ||
export interface UsageData { | ||
detections: DetectionsUsage; | ||
detections: DetectionsUsage | {}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this will always be DetectionsUsage
, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, forgot to remove this when I updated. It's out now. Thanks!
ab764a7
to
2422ed5
Compare
@elasticmachine merge upstream |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@michaelolo24 this is labeled with 7.9
does it need to be backport to 7.9? Or is 7.10 fine?
@@ -76,9 +76,14 @@ export const registerCollector: RegisterCollector = ({ | |||
isReady: () => kibanaIndex.length > 0, | |||
fetch: async (callCluster: LegacyAPICaller): Promise<UsageData> => { | |||
const savedObjectsClient = await getInternalSavedObjectsClient(core); | |||
const [detections, endpoints] = await Promise.allSettled([ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If one of the requests fails should we log any errors? or will that already happen?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't already happen, we can throw a log in here, but
- I don't know what the likeliness of us getting those logs later on since telemetry is running in the background. Like do we contact a user if for whatever reason we don't get telemetry back?
- I worry about logging any errors in case there's PII in there, which I don't expect, but would rather avoid that potential
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense. Might be worth asking some others on the telemetry team what the general guidance is.
I guess my thinking was if we release a new stack version and we notice that we're not getting any telemetry for some reason, I believe we collect the logs of our cloud deployments so we could poke around and and least see if one of the requests is failing or something 🤷 .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, as far as I can tell there aren't any patterns it sounds like. We'll just see an empty object for now if anything fails, but I think we can work with the telemetry team put in some better logic here for 7.10
} | ||
} | ||
} catch (error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we log an error/warning here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see previous comment :). I'll also check with the Telemetry team to see how they handle errors to confirm
@@ -76,9 +76,14 @@ export const registerCollector: RegisterCollector = ({ | |||
isReady: () => kibanaIndex.length > 0, | |||
fetch: async (callCluster: LegacyAPICaller): Promise<UsageData> => { | |||
const savedObjectsClient = await getInternalSavedObjectsClient(core); | |||
const [detections, endpoints] = await Promise.allSettled([ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ℹ️ nice use of .allSettled
@@ -23,7 +23,7 @@ interface DetectionsMetric { | |||
|
|||
const isElasticRule = (tags: string[]) => tags.includes(`${INTERNAL_IMMUTABLE_KEY}:true`); | |||
|
|||
const initialRulesUsage: DetectionRulesUsage = { | |||
export const initialRulesUsage: DetectionRulesUsage = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❔ Doc comment on exports
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will add, thanks
uniqueHostIds.add(host.id); | ||
const agentId = elastic?.agent?.id; | ||
osTracker = updateEndpointOSTelemetry(os, osTracker); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❔ Could you add a comment here about where the error was throwing? Moving the try
up makes it safer, but maybe harder to understand where exceptions could throw from.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added
@@ -34,7 +34,7 @@ const initialRulesUsage: DetectionRulesUsage = { | |||
}, | |||
}; | |||
|
|||
const initialMlJobsUsage: MlJobsUsage = { | |||
export const initialMlJobsUsage: MlJobsUsage = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❔ Docs on exports
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a few ❔ s
8d8c6be
to
d3f9042
Compare
d3f9042
to
537aa12
Compare
const { last_checkin: lastCheckin, local_metadata: localMetadata } = metadataAttributes; | ||
const { host, os, elastic } = localMetadata as AgentLocalMetadata; | ||
|
||
// Although not perfect, the goal is to dedupe hosts to get the most recent data for a host |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fyi @jonathan-buttner per our conversation
// Although not perfect, the goal is to dedupe hosts to get the most recent data for a host | ||
// An agent re-installed on the same host will have all the same id, name, and kernel details | ||
// A cloned VM will have the same id, but "may" have the same name and kernel, but it's really up to the user. | ||
const compoundUniqueId = `${host?.id}-${host?.hostname}-${os?.kernel}`; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm actually thinking about this more, what would happen in the scenario where a user updates their computer to a new OS version? I think the os.kernel
would probably change right? I think we'd want to treat that telemetry information as the same user right?
I wonder if we should just stick with ${host?.id}-${host?.hostname}
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True, good point. Yea, the OS information would be tricky in update scenarios. I'll simplify it to just those two. Thanks!
💚 Build SucceededBuild metrics
History
To update your PR or re-run it, just comment with: |
* master: (54 commits) [ML] Migrate to React BrowserRouter and Kibana provided History. (elastic#71941) [Discover] Improve saveSearch functional test handling (elastic#73626) [Metrics UI] Fix all threshold alert conditions disappearing due to alert prefill (elastic#73708) [Metrics UI] Fix alert previews of ungrouped alerts (elastic#73735) [SIEM] Fixes "include building block button" to operate (elastic#73900) [Metrics UI] Fix alert management to open without refresh (elastic#73739) [Security Solution][Lists] - Tests cleanup and remove unnecessary import (elastic#73865) [Ingest Management] main branch uses epr-snapshot. Others production (elastic#73555) [Canvas][tech-debt] Fix SVG not shrinking vertically properly (elastic#73867) [Maps] upgrade turf (elastic#73816) [Security Solution][Telemetry] Concurrent telemetry requests (elastic#73558) [Security Solution][Exceptions] - Update how nested entries are displayed in exceptions viewer (elastic#73745) [Security Solution][Exceptions] Adds autocomplete workaround for .text fields (elastic#73761) [Metrics UI] Fix previewing of No Data results (elastic#73753) Closes elastic#72914 by hiding anomaly detection settings links when the ml plugin is disabled. (elastic#73638) [Ingest Manager] Fix config selection in enrollment flyout from config list page (elastic#73833) [DOCS] Fixes typo in Alerting actions (elastic#73756) [APM] fixes linking errors to ML and Discover (elastic#73758) Handle promise rejections when building artifacts (elastic#73831) [Security Solution][Detections] Change from sha1 to sha256 (elastic#73741) ...
* master: (38 commits) [Discover] Context unskip date nanos functional tests (elastic#73781) [ML] Migrate to React BrowserRouter and Kibana provided History. (elastic#71941) [Discover] Improve saveSearch functional test handling (elastic#73626) [Metrics UI] Fix all threshold alert conditions disappearing due to alert prefill (elastic#73708) [Metrics UI] Fix alert previews of ungrouped alerts (elastic#73735) [SIEM] Fixes "include building block button" to operate (elastic#73900) [Metrics UI] Fix alert management to open without refresh (elastic#73739) [Security Solution][Lists] - Tests cleanup and remove unnecessary import (elastic#73865) [Ingest Management] main branch uses epr-snapshot. Others production (elastic#73555) [Canvas][tech-debt] Fix SVG not shrinking vertically properly (elastic#73867) [Maps] upgrade turf (elastic#73816) [Security Solution][Telemetry] Concurrent telemetry requests (elastic#73558) [Security Solution][Exceptions] - Update how nested entries are displayed in exceptions viewer (elastic#73745) [Security Solution][Exceptions] Adds autocomplete workaround for .text fields (elastic#73761) [Metrics UI] Fix previewing of No Data results (elastic#73753) Closes elastic#72914 by hiding anomaly detection settings links when the ml plugin is disabled. (elastic#73638) [Ingest Manager] Fix config selection in enrollment flyout from config list page (elastic#73833) [DOCS] Fixes typo in Alerting actions (elastic#73756) [APM] fixes linking errors to ML and Discover (elastic#73758) Handle promise rejections when building artifacts (elastic#73831) ...
Summary
This PR handles a couple things:
[Deduplication]: Due to cloned VM's sharing the same host id, I've updated the de-duplication to be based on a compound id of
[hostId]-[hostName]
. All cloned VM's prior to this would be treated as just one host. It is still possible to have missed clones if the hostname is the same across vm's, but this is currently the known best alternative without adding potentially false positives.[Performance & error handling]: I don't expect that we'll run into any performance issues atm, but to help guard against that, I've set up the two sides of our telemetry to run concurrently and independently. In the event of any of our telemetry branches (endpoint & detections) fail for one reason or another, this will still allow us to get data back if the other branches are successful.
Checklist