Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

O+M 2024-01-01 #4574

Closed
12 tasks
hkdctol opened this issue Jan 2, 2024 · 4 comments
Closed
12 tasks

O+M 2024-01-01 #4574

hkdctol opened this issue Jan 2, 2024 · 4 comments
Assignees
Labels
O&M Operations and maintenance tasks for the Data.gov platform

Comments

@hkdctol
Copy link
Contributor

hkdctol commented Jan 2, 2024

As part of day-to-day operation of Data.gov, there are many Operation and Maintenance (O&M) responsibilities. Instead of having the entire team watching notifications and risking some notifications slipping through the cracks, we have created an O&M Triage role. One person on the team is assigned the Triage role which rotates each sprint. This is not meant to be a 24/7 responsibility, only East Coast business hours. If you are unavailable, please note when you will be unavailable in Slack and ask for someone to take on the role for that time.

Check the O&M Rotation Schedule for future planning.

Acceptance criteria

You are responsible for all O&M responsibilities this week. We've highlighted a few so they're not forgotten. You can copy each checklist into your daily report.

Daily Checklist

Note: Catalog Auto Tasks
You will need to update the chart values manually. Click the Action link in each issue and grab the values from monitor task output and check runtime.

Weekly Checklist

Monthly Checklist

ad-hoc checklist

  • audit/review applications on cloud foundry and determine what can be stopped and/or deleted.

Reference

@hkdctol hkdctol moved this to 🏗 In Progress [8] in data.gov team board Jan 2, 2024
@hkdctol hkdctol added the O&M Operations and maintenance tasks for the Data.gov platform label Jan 2, 2024
@Jin-Sun-tts
Copy link
Contributor

Tuesday 01/02

https://github.com/GSA/data.gov/
Image

notes:

  • extended the date for some know issues ignored by snyk to avoid vulnerability check warning.
    Update .snyk catalog.data.gov#1192
    Update .snyk inventory-app#681

  • restart egress failed when restart all apps in the prod-egress space for logstack-space-drain. This app does not need to be run instance, tested on development, set the instance number to 0 could avoid this issue.

Check Catalog Auto Tasks

Check Harvesting Emails

  • Catalog:

Other

Image

notes:

Checked the old log message for location not accessible warning because of Solr restart. It happens more often recently, we will continue monitoring the frequency of this error, and already have ticket to turn the solr restart speed,

Checked catalog, inventory production, works fine.

Also checked Solr leader and followers, all work as normal.

@FuhuXia
Copy link
Member

FuhuXia commented Jan 3, 2024

Wednesday 01/03

It appears that all Socrata backended json sources (e.g montgomerycountymd-json and dot-socrata-data-json) are giving 500 error. It started 3 weeks ago and haven't recovered yet. We might need to contact Socrata directly or though affected agencies, such as Montgomery County of Maryland and DOT.

Daily tracking update jobs shows the visited dataset count has been doubled to 13k, time spent also doubled to 1.2 hours, compared to pre-holiday season. I assume we are counting some bots traffic into the daily tracking. We will need to re-visit ticket exclude bot traffic to update the bot exclusion list.

db-solr-sync job is being hold up by some erroneous dataset. Need to re-open this ticket and fix USDA json.

@hkdctol
Copy link
Contributor Author

hkdctol commented Jan 3, 2024

I will find a contact.

@Jin-Sun-tts
Copy link
Contributor

Jin-Sun-tts commented Jan 4, 2024

Thursday 01/04

Extended expiration date for know vulnerability issue
Image

deleted the logstack-space-drain instance in prod-egress to avoid the re-start issue.

issued resolved in #4535 (comment), db-solr-sync back to normal

  • DB-Solr Sync:
    0 packages need to be removed from Solr
    0 packages need to be updated/added to Solr
    55 packages without harvest_object need to be mannually deleted
    Finished 543s

@Jin-Sun-tts Jin-Sun-tts moved this from 🏗 In Progress [8] to ✔ Done in data.gov team board Jan 8, 2024
@hkdctol hkdctol moved this from ✔ Done to 🗄 Closed in data.gov team board Jan 18, 2024
@github-project-automation github-project-automation bot moved this from 🗄 Closed to ✔ Done in data.gov team board Sep 3, 2024
@btylerburton btylerburton moved this from ✔ Done to 🗄 Closed in data.gov team board Sep 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
O&M Operations and maintenance tasks for the Data.gov platform
Projects
Archived in project
Development

No branches or pull requests

4 participants