Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

O+M 2023-06-23 #4363

Closed
10 tasks
nickumia-reisys opened this issue Jun 20, 2023 · 3 comments
Closed
10 tasks

O+M 2023-06-23 #4363

nickumia-reisys opened this issue Jun 20, 2023 · 3 comments
Assignees
Labels
O&M Operations and maintenance tasks for the Data.gov platform

Comments

@nickumia-reisys
Copy link
Contributor

As part of day-to-day operation of Data.gov, there are many Operation and Maintenance (O&M) responsibilities. Instead of having the entire team watching notifications and risking some notifications slipping through the cracks, we have created an O&M Triage role. One person on the team is assigned the Triage role which rotates each sprint. This is not meant to be a 24/7 responsibility, only East Coast business hours. If you are unavailable, please note when you will be unavailable in Slack and ask for someone to take on the role for that time.

Miscs

Acceptance criteria

You are responsible for all O&M responsibilities this week. We've highlighted a few so they're not forgotten. You can copy each checklist into your daily report.

Daily Checklist

Check Production State/Actions

Note: Catalog Auto Tasks
You will need to update the chart values manually. Click the Action link in each issue and grab the values from monitor task output and check runtime.

Weekly Checklist

@nickumia-reisys
Copy link
Contributor Author

nickumia-reisys commented Jun 20, 2023

Day 1 Summary

  • Fixed snyk issues on catalog
  • Enabled monitoring of harvesting command on Github Actions
  • Investigated catalog performance slowdown with @FuhuXia
    • Solr (prod) EFS is okay, we may be able to lower provisioned throughout
    • Solr (prod) restarts are okay
    • Solr (prod) load balancers are configured properly
    • Catalog response time in NR is 2-3x since June 1
      • Ongoing testing with different configuration and code changes
  • Washington State and Connecticut State harvest jobs took exceedingly long to finish, we suspect it is just the app/harvester being busy. Gather finished within seconds. Fetch didn't start until a day later. Even after fetch finished within minutes, the harvester did not consider the job complete until it was forcibly closed.
    • More investigation needs to be done
  • Tracking update took ~12 hours on Sunday. Since it exceeded the 6 hour runtime on Github Actions, we looked at the cloud.gov logs, https://logs.fr.cloud.gov
  • Catalog restart had a weird cancellation
    image

@nickumia-reisys
Copy link
Contributor Author

Day 2 + 3 Summary

  • Fixed snyk errors on inventory
  • Fixed harvest url for IMLS harvest source (reference)
  • db-solr-sync had a weird interruption on 6/22. It did some work, but then got a 143 from cloud.gov to terminate.
  • Minor CI outage due to cloudfoundry cli download issue
  • Solr Follower 1 and 2 restarted 6/22

@nickumia-reisys
Copy link
Contributor Author

nickumia-reisys commented Jun 23, 2023

Day 4 Summary

  • Solr follower 1 restarted 6/23
  • Confirmed IMLS harvest source fixed
    image
  • Found issue with Alaska Department of Natural Resources, IRM harvest source (reference)
  • Found and resolved issue with EAC data harvest source (reference)
  • Found and partially resolved issue with Alaska Division of Geological and Geophysical Surveys harvest source (reference)
  • Found and partially resolved issue with Arkansas Geographic Information Office harvest source (reference)
  • Found and resolved issue with City of Baton Rouge Data.json harvest source (reference)
  • Found and resolved issue with EnergyStar harvest source (reference)
  • Performed log review

@github-project-automation github-project-automation bot moved this from 🏗 In Progress [8] to ✔ Done in data.gov team board Jun 26, 2023
@hkdctol hkdctol moved this from ✔ Done to 🗄 Closed in data.gov team board Jul 6, 2023
@nickumia-reisys nickumia-reisys added the O&M Operations and maintenance tasks for the Data.gov platform label Oct 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
O&M Operations and maintenance tasks for the Data.gov platform
Projects
Archived in project
Development

No branches or pull requests

1 participant