-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Progress remaining O11y rule types to FAAD #169867
Comments
Pinging @elastic/response-ops (Team:ResponseOps) |
This issue isn't prioritized for 8.12 so I added an 8.13 label to have it as a candidate. We can backlog the issue for now. |
cc @maryam-saeidi we're using this issue to track the remaining O11y rules that need to onboard framework alerts-as-data APIs. It's not likely that we'll have this prioritized in 8.13 but we're more than happy to let someone else drive this (or part of) with our help. |
cc: @vinaychandrasekhar , per our recent discussion about AAD |
@mikecote Thanks for pinging me here; my point in the meeting was a suggestion about the possible meaning of that item. I'll ping @paulb-elastic regarding prioritization. |
Sounds good, no specific prioritization ask from us at this time, so it's ok if you don't have capacity 👍 but if you want to pick up the issue, we're more than happy to help! |
@maryam-saeidi fyi, we plan to make some progress in this area in 8.14. |
Should we add |
@ersin-erdal yes good catch, please add it to the description 🙏 |
@mikecote can you point us to docs / etc where folks can read about what FAAD is? That stands for "Framework Alerts-as-Data", is that right? Thanks! |
@jasonrhodes You can read more about Framework Alerts-as-Data here: https://github.com/elastic/response-ops-team/issues/95 with the various phases we are unifying the architecture to have a the framework provide everything. |
Towards: #169867 This PR onboards Inventory Metric Threshold rule type with FAAD. ## To verify. I used [data-generator](https://github.com/ersin-erdal/data-generator) to generate metric data. Then created an Inventory Threshold rule with actions (alert and recovered), conitions: `For Hosts, When CPU usage is above 10`. Inventory Threshold uses the following formula to calculate the result: (`system.cpu.user.pct` + `system.cpu.system.pct`) / `system.cpu.cores` Set `system.cpu.user.pct` = 1 `system.cpu.system.pct` = 1 `system.cpu.cores` = 4 in the [cpu-001](https://github.com/ersin-erdal/data-generator/blob/main/src/indexers/metrics/docs/cpu-001.json). This makes the CPU usage 0.5 (50%) for the `host-1` and run the generator with `./generate metrics` Your rule should create an alert and should saved it in `.internal.alerts-observability.metrics.alerts-default-000001` Then set `system.cpu.user.pct`=0 and `system.cpu.system.pct`=0. The alert should be recovered and the AAD in the above index should be updated `kibana.alert.status: recovered`.
Towards: #169867 This PR onboards Log Threshold rule type with FAAD. ### To verify Create a log threshold rule. Example: ``` POST kbn:/api/alerting/rule { "params": { "logView": { "logViewId": "Default", "type": "log-view-reference" }, "timeSize": 5, "timeUnit": "m", "count": { "value": -1, "comparator": "more than" }, "criteria": [ { "field": "log.level", "comparator": "equals", "value": "error" } ] }, "consumer": "alerts", "schedule": { "interval": "1m" }, "tags": [], "name": "test", "rule_type_id": "logs.alert.document.count", "notify_when": "onActionGroupChange", "actions": [] } ``` Your rule should create an alert and should saved it in `.internal.alerts-observability.metrics.alerts-default-000001` Example: ``` GET .internal.alerts-*/_search ``` Then set `count.value: 75` The alert should be recovered and the AAD in the above index should be updated `kibana.alert.status: recovered`.
Towards: #169867 This PR onboards "SLO burn rate" rule type with FAAD. ## To verify Create an SLO by using a test index (create a dataview for it), use very low `budget consumed %` The rule bound to the SLO should create an alert and save it under `.internal.alerts-observability.slo.alerts-default-000001`
Towards: #169867 This PR onboards "Custom Threshold" rule type with FAAD. ## To verify Create a Custom Threshold rule by using a test index and DW. Set the `Role visibility` `metrics`. When the rule runs, it generates an alert and saves it under `.internal.alerts-observability.threshold.alerts-default`. The alert should be visible on `Observability > alerts` page as well. --------- Co-authored-by: kibanamachine <42973632+kibanamachine@users.noreply.github.com>
Towards: #169867 This PR onboards Latency Threshold rule type with FAAD. ### To verify 1. Run the following script to generate APM data: ``` node scripts/synthtrace simple_trace.ts --local --live ``` 2. Create a latency threshold rule. Example: ``` POST kbn:/api/alerting/rule { "params": { "aggregationType": "avg", "environment": "ENVIRONMENT_ALL", "threshold": 400, "windowSize": 5, "windowUnit": "m" }, "consumer": "alerts", "schedule": { "interval": "1m" }, "tags": [], "name": "testinggg", "rule_type_id": "apm.transaction_duration", "notify_when": "onActionGroupChange", "actions": [] } ``` 3. Your rule should create an alert and should saved it in `.internal.alerts-observability.apm.alerts-default-000001` Example: ``` GET .internal.alerts-*/_search ``` 4. Set `threshold: 10000` 5. The alert should be recovered and the AAD in the above index should be updated `kibana.alert.status: recovered`.
Towards: #169867 This PR onboards the Error Count Threshold rule type with FAAD. ### To verify 1. Run the following script to generate APM data: ``` node scripts/synthtrace many_errors.ts --local --live ``` 2. Create an error count threshold rule. Example: ``` POST kbn:/api/alerting/rule { "params": { "threshold": 25, "windowSize": 5, "windowUnit": "m", "environment": "ENVIRONMENT_ALL" }, "consumer": "alerts", "schedule": { "interval": "1m" }, "tags": [], "name": "testinggg", "rule_type_id": "apm.error_rate", "notify_when": "onActionGroupChange", "actions": [] } ``` 3. Your rule should create an alert and should saved it in `.internal.alerts-observability.apm.alerts-default-000001` Example: ``` GET .internal.alerts-*/_search ``` 4. Recover the alert by setting `threshold: 10000` 5. The alert should be recovered and the AAD in the above index should be updated `kibana.alert.status: recovered`.
towards: #169867 This PR onboards APM Anomaly rule type with FAAD. I am having trouble getting this rule to create an alert. If there is any easy way to verify pls let me know!
Towards: #169867 This PR onboards the Transaction Error Rate rule type with FAAD. ### To verify 1. Run the following script to generate APM data: ``` node scripts/synthtrace many_errors.ts --local --live ``` 2. Create a transaction error rate rule. Example: ``` POST kbn:/api/alerting/rule { "params": { "threshold": 0, "windowSize": 5, "windowUnit": "m", "environment": "ENVIRONMENT_ALL" }, "consumer": "alerts", "schedule": { "interval": "1m" }, "tags": [], "name": "test", "rule_type_id": "apm.transaction_error_rate", "notify_when": "onActionGroupChange", "actions": [] } ``` 3. Your rule should create an alert and should saved it in `.internal.alerts-observability.apm.alerts-default-000001` Example: ``` GET .internal.alerts-*/_search ``` 4. Recover the alert by setting `threshold: 200` 5. The alert should be recovered and the AAD in the above index should be updated `kibana.alert.status: recovered`.
Towards: #169867 This PR onboards Uptime rule types (Tls, Duration Anolamy and Monitor status) with FAAD. We are deprecating the rule-registry plugin and onboard the rule types with the new alertsClient to manage alert-as-data. There is no new future, all the rule types should work as they were, and save alerts with all the existing fields. ## To verify: - Switch to Kibana 8.9.0 in your local repo. (In this version Uptime rules are not deprecated) - Run your ES with: `yarn es snapshot -E path.data=../local-es-data` - Run your Kibana - Create Uptime rules with an active and a recovered action (You can run Heartbeat locally if needed, [follow the instructions](https://www.elastic.co/guide/en/beats/heartbeat/current/heartbeat-installation-configuration.html)) - Stop your ES and Kibana - Switch to this branch and run your ES with `yarn es snapshot -E path.data=../local-es-data` again. - Run your Kibana - Modify Uptime rulesType codes to force them to create an alert. Example: Mock [availabilityResults in status_check](https://github.com/elastic/kibana/blob/main/x-pack/plugins/observability_solution/uptime/server/legacy_uptime/lib/alerts/status_check.ts#L491) with below data ``` availabilityResults = [ { monitorId: '1', up: 1, down: 0, location: 'location', availabilityRatio: 0.5, monitorInfo: { timestamp: '', monitor: { id: '1', status: 'down', type: 'type', check_group: 'default', }, docId: 'docid', }, }, ]; ``` It should create an alert. The alert should be saved under `.alerts-observability.uptime.alerts` index and be visible under observability alerts page. Then remove the mock, the alert should be recovered.
Towards: elastic#169867 This PR onboards Uptime rule types (Tls, Duration Anolamy and Monitor status) with FAAD. We are deprecating the rule-registry plugin and onboard the rule types with the new alertsClient to manage alert-as-data. There is no new future, all the rule types should work as they were, and save alerts with all the existing fields. ## To verify: - Switch to Kibana 8.9.0 in your local repo. (In this version Uptime rules are not deprecated) - Run your ES with: `yarn es snapshot -E path.data=../local-es-data` - Run your Kibana - Create Uptime rules with an active and a recovered action (You can run Heartbeat locally if needed, [follow the instructions](https://www.elastic.co/guide/en/beats/heartbeat/current/heartbeat-installation-configuration.html)) - Stop your ES and Kibana - Switch to this branch and run your ES with `yarn es snapshot -E path.data=../local-es-data` again. - Run your Kibana - Modify Uptime rulesType codes to force them to create an alert. Example: Mock [availabilityResults in status_check](https://github.com/elastic/kibana/blob/main/x-pack/plugins/observability_solution/uptime/server/legacy_uptime/lib/alerts/status_check.ts#L491) with below data ``` availabilityResults = [ { monitorId: '1', up: 1, down: 0, location: 'location', availabilityRatio: 0.5, monitorInfo: { timestamp: '', monitor: { id: '1', status: 'down', type: 'type', check_group: 'default', }, docId: 'docid', }, }, ]; ``` It should create an alert. The alert should be saved under `.alerts-observability.uptime.alerts` index and be visible under observability alerts page. Then remove the mock, the alert should be recovered. (cherry picked from commit d228f48)
# Backport This will backport the following commits from `main` to `8.14`: - [Onboard Uptime rule types with FAAD (#179493)](#179493) <!--- Backport version: 9.4.3 --> ### Questions ? Please refer to the [Backport tool documentation](https://github.com/sqren/backport) <!--BACKPORT [{"author":{"name":"Ersin Erdal","email":"92688503+ersin-erdal@users.noreply.github.com"},"sourceCommit":{"committedDate":"2024-05-02T15:53:56Z","message":"Onboard Uptime rule types with FAAD (#179493)\n\nTowards: https://github.com/elastic/kibana/issues/169867\r\n\r\nThis PR onboards Uptime rule types (Tls, Duration Anolamy and Monitor\r\nstatus) with FAAD.\r\n\r\nWe are deprecating the rule-registry plugin and onboard the rule types\r\nwith the new alertsClient to manage alert-as-data.\r\nThere is no new future, all the rule types should work as they were, and\r\nsave alerts with all the existing fields.\r\n\r\n## To verify:\r\n\r\n- Switch to Kibana 8.9.0 in your local repo. (In this version Uptime\r\nrules are not deprecated)\r\n- Run your ES with: `yarn es snapshot -E path.data=../local-es-data`\r\n- Run your Kibana\r\n- Create Uptime rules with an active and a recovered action (You can run\r\nHeartbeat locally if needed, [follow the\r\ninstructions](https://www.elastic.co/guide/en/beats/heartbeat/current/heartbeat-installation-configuration.html))\r\n- Stop your ES and Kibana\r\n- Switch to this branch and run your ES with `yarn es snapshot -E\r\npath.data=../local-es-data` again.\r\n- Run your Kibana\r\n- Modify Uptime rulesType codes to force them to create an alert.\r\nExample:\r\nMock [availabilityResults in\r\nstatus_check](https://github.com/elastic/kibana/blob/main/x-pack/plugins/observability_solution/uptime/server/legacy_uptime/lib/alerts/status_check.ts#L491)\r\nwith below data\r\n```\r\navailabilityResults = [\r\n {\r\n monitorId: '1',\r\n up: 1,\r\n down: 0,\r\n location: 'location',\r\n availabilityRatio: 0.5,\r\n monitorInfo: {\r\n timestamp: '',\r\n monitor: {\r\n id: '1',\r\n status: 'down',\r\n type: 'type',\r\n check_group: 'default',\r\n },\r\n docId: 'docid',\r\n },\r\n },\r\n ];\r\n```\r\n\r\nIt should create an alert. The alert should be saved under\r\n`.alerts-observability.uptime.alerts` index and be visible under\r\nobservability alerts page.\r\n\r\nThen remove the mock, the alert should be recovered.","sha":"d228f488ec0456c96b3e06aee57a0d28af851eb4","branchLabelMapping":{"^v8.15.0$":"main","^v(\\d+).(\\d+).\\d+$":"$1.$2"}},"sourcePullRequest":{"labels":["release_note:skip","Team:ResponseOps","ci:project-deploy-observability","apm:review","v8.14.0","v8.15.0"],"title":"Onboard Uptime rule types with FAAD","number":179493,"url":"https://github.com/elastic/kibana/pull/179493","mergeCommit":{"message":"Onboard Uptime rule types with FAAD (#179493)\n\nTowards: https://github.com/elastic/kibana/issues/169867\r\n\r\nThis PR onboards Uptime rule types (Tls, Duration Anolamy and Monitor\r\nstatus) with FAAD.\r\n\r\nWe are deprecating the rule-registry plugin and onboard the rule types\r\nwith the new alertsClient to manage alert-as-data.\r\nThere is no new future, all the rule types should work as they were, and\r\nsave alerts with all the existing fields.\r\n\r\n## To verify:\r\n\r\n- Switch to Kibana 8.9.0 in your local repo. (In this version Uptime\r\nrules are not deprecated)\r\n- Run your ES with: `yarn es snapshot -E path.data=../local-es-data`\r\n- Run your Kibana\r\n- Create Uptime rules with an active and a recovered action (You can run\r\nHeartbeat locally if needed, [follow the\r\ninstructions](https://www.elastic.co/guide/en/beats/heartbeat/current/heartbeat-installation-configuration.html))\r\n- Stop your ES and Kibana\r\n- Switch to this branch and run your ES with `yarn es snapshot -E\r\npath.data=../local-es-data` again.\r\n- Run your Kibana\r\n- Modify Uptime rulesType codes to force them to create an alert.\r\nExample:\r\nMock [availabilityResults in\r\nstatus_check](https://github.com/elastic/kibana/blob/main/x-pack/plugins/observability_solution/uptime/server/legacy_uptime/lib/alerts/status_check.ts#L491)\r\nwith below data\r\n```\r\navailabilityResults = [\r\n {\r\n monitorId: '1',\r\n up: 1,\r\n down: 0,\r\n location: 'location',\r\n availabilityRatio: 0.5,\r\n monitorInfo: {\r\n timestamp: '',\r\n monitor: {\r\n id: '1',\r\n status: 'down',\r\n type: 'type',\r\n check_group: 'default',\r\n },\r\n docId: 'docid',\r\n },\r\n },\r\n ];\r\n```\r\n\r\nIt should create an alert. The alert should be saved under\r\n`.alerts-observability.uptime.alerts` index and be visible under\r\nobservability alerts page.\r\n\r\nThen remove the mock, the alert should be recovered.","sha":"d228f488ec0456c96b3e06aee57a0d28af851eb4"}},"sourceBranch":"main","suggestedTargetBranches":["8.14"],"targetPullRequestStates":[{"branch":"8.14","label":"v8.14.0","branchLabelMappingKey":"^v(\\d+).(\\d+).\\d+$","isSourceBranch":false,"state":"NOT_CREATED"},{"branch":"main","label":"v8.15.0","branchLabelMappingKey":"^v8.15.0$","isSourceBranch":true,"state":"MERGED","url":"https://github.com/elastic/kibana/pull/179493","number":179493,"mergeCommit":{"message":"Onboard Uptime rule types with FAAD (#179493)\n\nTowards: https://github.com/elastic/kibana/issues/169867\r\n\r\nThis PR onboards Uptime rule types (Tls, Duration Anolamy and Monitor\r\nstatus) with FAAD.\r\n\r\nWe are deprecating the rule-registry plugin and onboard the rule types\r\nwith the new alertsClient to manage alert-as-data.\r\nThere is no new future, all the rule types should work as they were, and\r\nsave alerts with all the existing fields.\r\n\r\n## To verify:\r\n\r\n- Switch to Kibana 8.9.0 in your local repo. (In this version Uptime\r\nrules are not deprecated)\r\n- Run your ES with: `yarn es snapshot -E path.data=../local-es-data`\r\n- Run your Kibana\r\n- Create Uptime rules with an active and a recovered action (You can run\r\nHeartbeat locally if needed, [follow the\r\ninstructions](https://www.elastic.co/guide/en/beats/heartbeat/current/heartbeat-installation-configuration.html))\r\n- Stop your ES and Kibana\r\n- Switch to this branch and run your ES with `yarn es snapshot -E\r\npath.data=../local-es-data` again.\r\n- Run your Kibana\r\n- Modify Uptime rulesType codes to force them to create an alert.\r\nExample:\r\nMock [availabilityResults in\r\nstatus_check](https://github.com/elastic/kibana/blob/main/x-pack/plugins/observability_solution/uptime/server/legacy_uptime/lib/alerts/status_check.ts#L491)\r\nwith below data\r\n```\r\navailabilityResults = [\r\n {\r\n monitorId: '1',\r\n up: 1,\r\n down: 0,\r\n location: 'location',\r\n availabilityRatio: 0.5,\r\n monitorInfo: {\r\n timestamp: '',\r\n monitor: {\r\n id: '1',\r\n status: 'down',\r\n type: 'type',\r\n check_group: 'default',\r\n },\r\n docId: 'docid',\r\n },\r\n },\r\n ];\r\n```\r\n\r\nIt should create an alert. The alert should be saved under\r\n`.alerts-observability.uptime.alerts` index and be visible under\r\nobservability alerts page.\r\n\r\nThen remove the mock, the alert should be recovered.","sha":"d228f488ec0456c96b3e06aee57a0d28af851eb4"}}]}] BACKPORT--> Co-authored-by: Ersin Erdal <92688503+ersin-erdal@users.noreply.github.com>
Towards: #169867 This PR onboards the Synthetics Monitor Status rule type with FAAD. ### To verify I can't get the rule to alert, so I modified the status check to report the monitor as down. If you know of an easier way pls let me know 🙂 1. Create a [monitor](http://localhost:5601/app/synthetics/monitors), by default creating a monitor creates a rule. 2. Click on the monitor and grab the id and locationId from the url 3. Go to [the status check code](https://github.com/elastic/kibana/blob/main/x-pack/plugins/observability_solution/synthetics/server/queries/query_monitor_status.ts#L208) and replace the object that is returned with the following using the id and locationId you got from the monitor. ``` { up: 0, down: 1, pending: 0, upConfigs: {}, pendingConfigs: {}, downConfigs: { '${id}-${locationId}': { configId: '${id}', monitorQueryId: '${id}', status: 'down', locationId: '${locationId}', ping: { '@timestamp': new Date().toISOString(), state: { id: 'test-state', }, monitor: { name: 'test-monitor', }, observer: { name: 'test-monitor', }, } as any, timestamp: new Date().toISOString(), }, }, enabledMonitorQueryIds: ['${id}'], }; ``` 5. Your rule should create an alert and should saved it in `.internal.alerts-observability.uptime.alerts-default-000001` Example: ``` GET .internal.alerts-*/_search ``` 6. Recover repeating step 3 using ``` { up: 1, down: 0, pending: 0, downConfigs: {}, pendingConfigs: {}, upConfigs: { '${id}-${locationId}': { configId: '${id}', monitorQueryId: '${id}', status: 'down', locationId: '${locationId}', ping: { '@timestamp': new Date().toISOString(), state: { id: 'test-state', }, monitor: { name: 'test-monitor', }, observer: { name: 'test-monitor', }, } as any, timestamp: new Date().toISOString(), }, }, enabledMonitorQueryIds: ['${id}'], }; ``` 8. The alert should be recovered and the AAD in the above index should be updated `kibana.alert.status: recovered`.
Towards: elastic#169867 This PR onboards the Synthetics Monitor Status rule type with FAAD. ### To verify I can't get the rule to alert, so I modified the status check to report the monitor as down. If you know of an easier way pls let me know 🙂 1. Create a [monitor](http://localhost:5601/app/synthetics/monitors), by default creating a monitor creates a rule. 2. Click on the monitor and grab the id and locationId from the url 3. Go to [the status check code](https://github.com/elastic/kibana/blob/main/x-pack/plugins/observability_solution/synthetics/server/queries/query_monitor_status.ts#L208) and replace the object that is returned with the following using the id and locationId you got from the monitor. ``` { up: 0, down: 1, pending: 0, upConfigs: {}, pendingConfigs: {}, downConfigs: { '${id}-${locationId}': { configId: '${id}', monitorQueryId: '${id}', status: 'down', locationId: '${locationId}', ping: { '@timestamp': new Date().toISOString(), state: { id: 'test-state', }, monitor: { name: 'test-monitor', }, observer: { name: 'test-monitor', }, } as any, timestamp: new Date().toISOString(), }, }, enabledMonitorQueryIds: ['${id}'], }; ``` 5. Your rule should create an alert and should saved it in `.internal.alerts-observability.uptime.alerts-default-000001` Example: ``` GET .internal.alerts-*/_search ``` 6. Recover repeating step 3 using ``` { up: 1, down: 0, pending: 0, downConfigs: {}, pendingConfigs: {}, upConfigs: { '${id}-${locationId}': { configId: '${id}', monitorQueryId: '${id}', status: 'down', locationId: '${locationId}', ping: { '@timestamp': new Date().toISOString(), state: { id: 'test-state', }, monitor: { name: 'test-monitor', }, observer: { name: 'test-monitor', }, } as any, timestamp: new Date().toISOString(), }, }, enabledMonitorQueryIds: ['${id}'], }; ``` 8. The alert should be recovered and the AAD in the above index should be updated `kibana.alert.status: recovered`.
Now that we've successfully onboarded our first O11y rule type to use FAAD (#164220) we should start onboarding the remaining rule types as well.
The list of rule types include:
APM
Infra
Logs
SLO
Uptime
Definition of Done
The text was updated successfully, but these errors were encountered: