Skip to content
This repository has been archived by the owner on Jan 7, 2020. It is now read-only.

Support marking APIs as healthy/unhealthy #812

Merged
merged 3 commits into from
Feb 26, 2019
Merged

Support marking APIs as healthy/unhealthy #812

merged 3 commits into from
Feb 26, 2019

Conversation

amdprophet
Copy link
Member

Description

Adds the ability to mark an API as healthy/unhealthy. Uchiwa will only make requests to APIs marked as healthy. APIs marked as unhealthy will be watched be a goroutine and marked as healthy once they respond with an HTTP 2xx error.

Related Issue

Fixes #811.

Motivation and Context

See #811.

How Has This Been Tested?

I've tested with the following configuration:

{
  "sensu": [
    {
      "name": "Site 1",
      "host": "127.0.0.1",
      "port": 4567
    },
    {
      "name": "Site 2",
      "host": "127.0.0.1",
      "port": 4567
    },
    {
      "name": "Site 2",
      "host": "10.0.1.11",
      "port": 4567
    }
  ],
  "uchiwa": {
    "host": "0.0.0.0",
    "port": 3000,
      "refresh": 5,
      "loglevel": "warn"
  }
}

I've tested Uchiwa with one or both APIs down, one API up and one API down, and then starting the APIs that were down to see APIs marked as healthy.

Screenshots (if appropriate):

screen shot 2019-02-25 at 3 43 05 pm

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

  • My change requires a change to the documentation.
  • I have updated the documentation accordingly.
  • I have read the CONTRIBUTING document.
  • I have added tests to cover my changes.
  • All new and existing tests passed.

Signed-off-by: Justin Kolberg <amd.prophet@gmail.com>
Signed-off-by: Justin Kolberg <amd.prophet@gmail.com>
Signed-off-by: Justin Kolberg <amd.prophet@gmail.com>
@amdprophet
Copy link
Member Author

@cwjohnston had mentioned that we should mark the API as unhealthy if the stats from GET /info show an unhealthy state. I'm not sure how we can do this as the JSON deserialization and call to determineHealth() happens far after we call GET /info. @palourde do you have any ideas on how we can implement this without deserializing twice?

@codecov-io
Copy link

codecov-io commented Feb 26, 2019

Codecov Report

Merging #812 into master will decrease coverage by 0.13%.
The diff coverage is 4.34%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #812      +/-   ##
==========================================
- Coverage   20.93%   20.79%   -0.14%     
==========================================
  Files          28       28              
  Lines        2532     2553      +21     
==========================================
+ Hits          530      531       +1     
- Misses       1937     1957      +20     
  Partials       65       65
Impacted Files Coverage Δ
uchiwa/daemon/subscriptions.go 83.33% <0%> (ø) ⬆️
uchiwa/daemon/helpers.go 22.22% <0%> (ø) ⬆️
uchiwa/daemon/daemon.go 5.86% <0%> (-0.44%) ⬇️
uchiwa/main.go 44.64% <100%> (+1%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f83c003...79440f3. Read the comment docs.

@palourde
Copy link
Contributor

@amdprophet So determineHealth() is called before checkAPIHealth() right? Therefore, before we return here, maybe we could just mark the API as unhealthly?

if d.snapshot.Info != nil {
if !d.snapshot.Info.Redis.Connected {
return structs.SensuHealth{Output: "Not connected to Redis", Status: 1}
}
if !d.snapshot.Info.Transport.Connected && d.snapshot.Info != nil {
return structs.SensuHealth{Output: "Not connected to the transport", Status: 1}
}
}

@amdprophet
Copy link
Member Author

@palourde the problem is trying to determine which API is unhealthy but I just realized that /info would mean the whole datacenter is unhealthy. Perhaps we shouldn't mark APIs as unhealthy based on the contents of the /info response else we should mark all APIs within the DC as unhealthy.

@cwjohnston
Copy link
Contributor

I just realized that /info would mean the whole datacenter is unhealthy

@amdprophet can you elaborate on this?

In my experience the /info API will return a HTTP 500 (internal server error) when that particular API host is disconnected from either Redis or RabbitMQ. Because /info only returns information about the API host being queried, I don’t think we can reasonably consider a whole datacenter down based on a a single API host’s state.

On the other hand, the /health endpoint could be used to determine the health of a whole datacenter, as the combination of queue depth and number of consumers reflects the overall state of the datacenter. That said, I think that reporting on /health results is outside the scope of what we’re trying to do here.

@palourde
Copy link
Contributor

At this point, like Cameron mentioned, I feel like it's a different requirement and probably a different solution too.

The health of Sensu itself (number of consumers, quantity of messages, etc.) shouldn't impact how Uchiwa deal with its datacenter APIs. However, we could definitely improve how Uchiwa reports the health of a datacenter with these metrics.

@amdprophet
Copy link
Member Author

@cwjohnston I was confusing the functionality of /info with /health and I agree with what has been said.

I think this PR is in a state where it can be reviewed and merged if everything looks good.

@amdprophet amdprophet changed the title [WIP] Support marking APIs as healthy/unhealthy Support marking APIs as healthy/unhealthy Feb 26, 2019
Copy link
Contributor

@palourde palourde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@amdprophet amdprophet merged commit d76dc26 into master Feb 26, 2019
@amdprophet amdprophet deleted the health-daemon branch February 26, 2019 21:00
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Lack of health checking leads to loss of situational awareness
4 participants