Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Notify Slack when stage nodes aren't updating #6922

Merged
merged 3 commits into from
Dec 16, 2023
Merged
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
102 changes: 102 additions & 0 deletions .circleci/config.yml
Original file line number Diff line number Diff line change
Expand Up @@ -335,6 +335,99 @@ jobs:
git checkout -b release-client-v${VERSION}
git push -u origin release-client-v${VERSION}

notify-stuck-stage-nodes-job:
resource_class: small
docker:
- image: cimg/base:2023.01
steps:
- run:
name: Alert Slack of stuck stage nodes
command: |
handle_error() {
# Construct failure Slack message
failure_content="{ \"blocks\": ["
failure_content+="{ \"type\": \"section\", \"text\": { \"type\": \"plain_text\", \"text\": \"Encountered error while checking for stuck staging nodes\n\" } }"
failure_content+="]}"
echo "Sending error message to Slack: $failure_content"

# Send Slack failure message
curl -f -X POST -H 'Content-type: application/json' \
--data "$failure_content" \
$SLACK_DAILY_DEPLOY_WEBHOOK
}

fetchEndpoints() {
url=$1
fallback=$2
fetchedEndpoints=$(curl -s "$url" | jq -r '.data[]' 2>/dev/null)

if [ -z "$fetchedEndpoints" ]; then
echo "FETCH_ERROR"
echo "$fallback"
else
echo $fetchedEndpoints
fi
}

(
set -e

# Fetch the latest version from the GitHub repository (assume Content and Discovery have the same latest versions)
versionUrl="https://raw.githubusercontent.com/AudiusProject/audius-protocol/main/packages/discovery-provider/.version.json"
VERSION=$(curl -s "$versionUrl" | jq -r '.version')

if [ -z "$VERSION" ]; then
echo "Failed to fetch version data"
exit 1
fi

contentFallbackEndpoints=("https://creatornode5.staging.audius.co" "https://creatornode6.staging.audius.co" "https://creatornode7.staging.audius.co" "https://creatornode8.staging.audius.co" "https://creatornode9.staging.audius.co" "https://creatornode10.staging.audius.co" "https://creatornode11.staging.audius.co" "https://creatornode12.staging.audius.co")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels worse to me tbh. I'd like to avoid putting anything like this in our CI... why do we need a fall back?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

like for when the api is down. i don't feel strongly if you want it to just fail and not try the hardcoded list - can make that change now

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok fallback is removed completely now. this is what the happy and sad paths look like (i'll unregister that testing node)

Screenshot 2023-12-15 at 3 24 20 PM

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:D

discoveryFallbackEndpoints=("https://discoveryprovider.staging.audius.co" "https://discoveryprovider2.staging.audius.co" "https://discoveryprovider3.staging.audius.co" "https://discoveryprovider4.staging.audius.co" "https://discoveryprovider5.staging.audius.co")

contentEndpoints=($(fetchEndpoints "https://api.staging.audius.co/content" "${contentFallback[@]}"))
discoveryEndpoints=($(fetchEndpoints "https://api.staging.audius.co/discovery" "${discoveryFallback[@]}"))

slack_message=""

slack_message=""

compareVersions() {
for endpoint in "$@"; do
if [ "$endpoint" == "FETCH_ERROR" ]; then
continue
fi
response=$(curl -s -o /dev/null -w "%{http_code}" "$endpoint/health_check")
if [ "$response" -eq 200 ]; then
endpointVersion=$(curl -s "$endpoint/health_check" | jq -r '.data.version')
if [ "$endpointVersion" != "$VERSION" ]; then
slack_message+="\n$endpoint (behind at v$endpointVersion)"
fi
else
slack_message+="\n$endpoint (error status=$response)"
fi
done
}

compareVersions "${contentEndpoints[@]}"
compareVersions "${discoveryEndpoints[@]}"

# Send Slack message if any node is behind
if [ ! -z "$slack_message" ]; then
json_content="{ \"blocks\": [ { \"type\": \"section\", \"text\": { \"type\": \"mrkdwn\", \"text\": \"Please set these nodes back on auto-upgrade if they're not in use:$slack_message\" } } ] }"
curl -f -X POST -H 'Content-type: application/json' \
--data "$json_content" \
$SLACK_DAILY_DEPLOY_WEBHOOK
fi

# Also send a message if the API Gateway is down
if [[ " ${contentEndpoints[@]} " =~ " FETCH_ERROR " ]] || [[ " ${discoveryEndpoints[@]} " =~ " FETCH_ERROR " ]]; then
json_content="{ \"blocks\": [ { \"type\": \"section\", \"text\": { \"type\": \"mrkdwn\", \"text\": \"Note: api.staging.audius.co is offline, so a hardcoded list was used to check for offline/out-of-date nodes. \" } } ] }"
curl -f -X POST -H 'Content-type: application/json' \
--data "$json_content" \
$SLACK_DAILY_DEPLOY_WEBHOOK
fi
) || handle_error

workflows:
setup:
when:
Expand Down Expand Up @@ -516,3 +609,12 @@ workflows:
- equal: ['release-client-create-branch', << pipeline.schedule.name >>]
jobs:
- generate-client-release

notify-stuck-stage-nodes:
when:
and:
- equal: [scheduled_pipeline, << pipeline.trigger_source >>]
- equal: ['notify-stuck-stage-nodes', << pipeline.schedule.name >>]
jobs:
- notify-stuck-stage-nodes-job:
context: [slack-secrets]