O+M 2024-01-01 #4574

hkdctol · 2024-01-02T13:19:22Z

As part of day-to-day operation of Data.gov, there are many Operation and Maintenance (O&M) responsibilities. Instead of having the entire team watching notifications and risking some notifications slipping through the cracks, we have created an O&M Triage role. One person on the team is assigned the Triage role which rotates each sprint. This is not meant to be a 24/7 responsibility, only East Coast business hours. If you are unavailable, please note when you will be unavailable in Slack and ask for someone to take on the role for that time.

Check the O&M Rotation Schedule for future planning.

Acceptance criteria

You are responsible for all O&M responsibilities this week. We've highlighted a few so they're not forgotten. You can copy each checklist into your daily report.

Daily Checklist

Check Production State/Actions

Note: Catalog Auto Tasks
You will need to update the chart values manually. Click the Action link in each issue and grab the values from monitor task output and check runtime.

Check auto generated O&M tickets from no status column
Check Harvesting Emails
New Relic Alerts Triaged
Triage DMARC Report from Google

Weekly Checklist

DB-Solr Sync
Audit Log (more info on AU-3 and AU-6 Log auditing)
Tracking Update
- NOTE: This job will consistently timeout, but it is processing results ((more details)[https://github.com/change tracking update from nightly job to weekly #4345])
Check Catalog Solr
Catalog Dupe Check

Monthly Checklist

Invicti Scan

ad-hoc checklist

audit/review applications on cloud foundry and determine what can be stopped and/or deleted.

Reference

Watch for user email requests
Watch in #datagov-alerts and Vulnerable dependency notifications (daily email reports) for critical alerts.
Monitor and improve Data.gov O&M Dashboard
Update and revise Data.gov O&M Tasks

The text was updated successfully, but these errors were encountered:

Jin-Sun-tts · 2024-01-02T22:31:54Z

Tuesday 01/02

https://github.com/GSA/data.gov/

notes:

extended the date for some know issues ignored by snyk to avoid vulnerability check warning.
Update .snyk catalog.data.gov#1192
Update .snyk inventory-app#681
restart egress failed when restart all apps in the prod-egress space for logstack-space-drain. This app does not need to be run instance, tested on development, set the instance number to 0 could avoid this issue.

Check Catalog Auto Tasks

Check Harvesting Emails

Catalog:

Other

New Relic Alerts

notes:

Checked the old log message for location not accessible warning because of Solr restart. It happens more often recently, we will continue monitoring the frequency of this error, and already have ticket to turn the solr restart speed,

Checked catalog, inventory production, works fine.

Also checked Solr leader and followers, all work as normal.

FuhuXia · 2024-01-03T17:29:46Z

Wednesday 01/03

It appears that all Socrata backended json sources (e.g montgomerycountymd-json and dot-socrata-data-json) are giving 500 error. It started 3 weeks ago and haven't recovered yet. We might need to contact Socrata directly or though affected agencies, such as Montgomery County of Maryland and DOT.

Daily tracking update jobs shows the visited dataset count has been doubled to 13k, time spent also doubled to 1.2 hours, compared to pre-holiday season. I assume we are counting some bots traffic into the daily tracking. We will need to re-visit ticket exclude bot traffic to update the bot exclusion list.

db-solr-sync job is being hold up by some erroneous dataset. Need to re-open this ticket and fix USDA json.

dataset index error: "Geometry not valid JSON Extra data" #4535 (comment)

hkdctol · 2024-01-03T17:33:02Z

I will find a contact.

Jin-Sun-tts · 2024-01-04T16:23:09Z

Thursday 01/04

Extended expiration date for know vulnerability issue

deleted the logstack-space-drain instance in prod-egress to avoid the re-start issue.

issued resolved in #4535 (comment), db-solr-sync back to normal

DB-Solr Sync:
0 packages need to be removed from Solr
0 packages need to be updated/added to Solr
55 packages without harvest_object need to be mannually deleted
Finished 543s

hkdctol added this to data.gov team board Jan 2, 2024

hkdctol moved this to 🏗 In Progress [8] in data.gov team board Jan 2, 2024

hkdctol added the O&M Operations and maintenance tasks for the Data.gov platform label Jan 2, 2024

hkdctol assigned Jin-Sun-tts Jan 2, 2024

FuhuXia mentioned this issue Jan 4, 2024

Implement Grok Rules in logstack application #4234

Closed

6 tasks

Jin-Sun-tts moved this from 🏗 In Progress [8] to ✔ Done in data.gov team board Jan 8, 2024

hkdctol moved this from ✔ Done to 🗄 Closed in data.gov team board Jan 18, 2024

btylerburton closed this as completed Sep 3, 2024

github-project-automation bot moved this from 🗄 Closed to ✔ Done in data.gov team board Sep 3, 2024

btylerburton moved this from ✔ Done to 🗄 Closed in data.gov team board Sep 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

O+M 2024-01-01 #4574

O+M 2024-01-01 #4574

hkdctol commented Jan 2, 2024

Jin-Sun-tts commented Jan 2, 2024

FuhuXia commented Jan 3, 2024 •

edited

Loading

hkdctol commented Jan 3, 2024

Jin-Sun-tts commented Jan 4, 2024 •

edited

Loading

O+M 2024-01-01 #4574

O+M 2024-01-01 #4574

Comments

hkdctol commented Jan 2, 2024

Acceptance criteria

Daily Checklist

Weekly Checklist

Monthly Checklist

ad-hoc checklist

Reference

Jin-Sun-tts commented Jan 2, 2024

Tuesday 01/02

notes:

Check Catalog Auto Tasks

notes:

Checked catalog, inventory production, works fine.

Also checked Solr leader and followers, all work as normal.

FuhuXia commented Jan 3, 2024 • edited Loading

Wednesday 01/03

hkdctol commented Jan 3, 2024

Jin-Sun-tts commented Jan 4, 2024 • edited Loading

Thursday 01/04

FuhuXia commented Jan 3, 2024 •

edited

Loading

Jin-Sun-tts commented Jan 4, 2024 •

edited

Loading