-
Notifications
You must be signed in to change notification settings - Fork 110
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
O+M 2024-01-01 #4574
Comments
Tuesday 01/02https://github.com/GSA/data.gov/ notes:
Check Catalog Auto TasksCheck Harvesting Emails
Other notes:Checked the old log message for location not accessible warning because of Solr restart. It happens more often recently, we will continue monitoring the frequency of this error, and already have ticket to turn the solr restart speed, Checked catalog, inventory production, works fine.Also checked Solr leader and followers, all work as normal. |
Wednesday 01/03It appears that all Socrata backended json sources (e.g montgomerycountymd-json and dot-socrata-data-json) are giving 500 error. It started 3 weeks ago and haven't recovered yet. We might need to contact Socrata directly or though affected agencies, such as Montgomery County of Maryland and DOT. Daily tracking update jobs shows the visited dataset count has been doubled to 13k, time spent also doubled to 1.2 hours, compared to pre-holiday season. I assume we are counting some bots traffic into the daily tracking. We will need to re-visit ticket exclude bot traffic to update the bot exclusion list. db-solr-sync job is being hold up by some erroneous dataset. Need to re-open this ticket and fix USDA json. |
I will find a contact. |
Thursday 01/04Extended expiration date for know vulnerability issue deleted the logstack-space-drain instance in prod-egress to avoid the re-start issue. issued resolved in #4535 (comment), db-solr-sync back to normal
|
As part of day-to-day operation of Data.gov, there are many Operation and Maintenance (O&M) responsibilities. Instead of having the entire team watching notifications and risking some notifications slipping through the cracks, we have created an O&M Triage role. One person on the team is assigned the Triage role which rotates each sprint. This is not meant to be a 24/7 responsibility, only East Coast business hours. If you are unavailable, please note when you will be unavailable in Slack and ask for someone to take on the role for that time.
Check the O&M Rotation Schedule for future planning.
Acceptance criteria
You are responsible for all O&M responsibilities this week. We've highlighted a few so they're not forgotten. You can copy each checklist into your daily report.
Daily Checklist
Weekly Checklist
Monthly Checklist
ad-hoc checklist
Reference
The text was updated successfully, but these errors were encountered: