Rework GHES resource allocation docs (#52171)

Co-authored-by: Ryan Trauntvein <djdefi@github.com> Co-authored-by: Stoney <19228888+ThatStoney@users.noreply.github.com> Co-authored-by: hubwriter <hubwriter@github.com>
github · Sep 6, 2024 · 54469db · 54469db
1 parent 5381196
commit 54469db
Show file tree

Hide file tree

Showing 4 changed files with 189 additions and 12 deletions.
diff --git a/assets/images/enterprise/management-console/monitor-dash-link.png b/assets/images/enterprise/management-console/monitor-dash-link.png
diff --git a/...aging-your-instance/monitoring-your-instance/accessing-the-monitor-dashboard.md b/...aging-your-instance/monitoring-your-instance/accessing-the-monitor-dashboard.md
@@ -27,20 +27,75 @@ shortTitle: Access the monitor dashboard
 
    ![Screenshot of the header of the {% data variables.enterprise.management_console %}. A tab, labeled "Monitor", is highlighted with an orange outline.](/assets/images/enterprise/management-console/monitor-dash-link.png)
 
-## Troubleshooting common resource allocation problems on your appliance
+1. In HA and cluster environments you can switch between nodes using the dropdown and clicking on a different hostname.
 
-{% note %}
+## Using the monitor dashboard
 
-**Note**: Because regularly polling {% data variables.location.product_location %} with continuous integration (CI) or build servers can effectively cause a denial of service attack that results in problems, we recommend using webhooks to push updates. For more information, see "[AUTOTITLE](/get-started/exploring-integrations/about-webhooks)".
+The page visualizes metrics which can be useful for troubleshooting performance issues and better understanding how your {% data variables.product.prodname_ghe_server %} appliance is being used. The data behind the graphs is gathered by the `collectd` service and sampled every 10 seconds.
 
-{% endnote %}
+Within the pre-built dashboard you can find various sections grouping graphs of different types of system resources.
 
-Use the monitor dashboard to stay informed on your appliance's resource health and make decisions on how to fix high usage issues.
+Building your own dashboard and alerts requires the data to be forwarded to an external instance, by enabling `collectd` forwarding. For more information, see "[AUTOTITLE](/admin/monitoring-and-managing-your-instance/monitoring-your-instance/configuring-collectd-for-your-instance)."
 
-| Problem | Possible cause(s) | Recommendations |
-| -------- | ----------------- | --------------- |
-| High CPU usage | VM contention from other services or programs running on the same host | If possible, reconfigure other services or programs to use fewer CPU resources. To increase total CPU resources for the VM, see "[AUTOTITLE](/admin/enterprise-management/updating-the-virtual-machine-and-physical-resources/increasing-cpu-or-memory-resources)." |
-| High memory usage | VM contention from other services or programs running on the same host | If possible, reconfigure other services or programs to use less memory. To increase the total memory available on the VM, see "[AUTOTITLE](/admin/enterprise-management/updating-the-virtual-machine-and-physical-resources/increasing-cpu-or-memory-resources)." |
-| Low disk space availability | Large binaries or log files consuming disk space | If possible, host large binaries on a separate server, and compress or archive log files. If necessary, increase disk space on the VM by following the steps for your platform in "[AUTOTITLE](/admin/enterprise-management/updating-the-virtual-machine-and-physical-resources/increasing-storage-capacity)." |
-| Higher than usual response times | Often caused by one of the above issues | Identify and fix the underlying issues. If response times remain high, contact us by visiting {% data variables.contact.contact_ent_support %}. |
-| Elevated error rates | Software issues  | Contact us by visiting {% data variables.contact.contact_ent_support %} and include your support bundle. For more information, see "[Providing data to {% data variables.product.prodname_enterprise %} Support](/enterprise/{{ currentVersion}}/admin/guides/enterprise-support/providing-data-to-github-support#creating-and-sharing-support-bundles)." |
+## About the metrics on the monitor dashboard
+
+### System health
+
+The system health graphs provide a general overview of services and system resource utilization. The CPU, memory, and load average graphs are useful for identifying trends or times where provisioned resource saturation has occurred. For more information, see "[AUTOTITLE](/admin/monitoring-and-managing-your-instance/monitoring-your-instance/recommended-alert-thresholds)."
+
+### Processes
+
+The processes graph section looks deeper into the major individual services which make up the {% data variables.product.prodname_ghe_server %} appliance. Looking at these services individually can show how usage trends impact system resources over time.
+
+### Authentication
+
+The authentication graphs break down the rates at which users and applications are authenticating to the {% data variables.product.prodname_ghe_server %} appliance. We also track the protocol or service type such as Git or API for the authentications, which is useful in identifying broad user activity trends. The authentication graphs can help you find interesting trends or timeframes to look at when diving deeper into authentication and API request logs.
+
+### LDAP
+
+LDAP graphs will only display data if LDAP authentication is enabled on the {% data variables.product.prodname_ghe_server %} appliance. For more information, see "[AUTOTITLE](/admin/managing-iam/using-ldap-for-enterprise-iam/using-ldap)." These graphs can help you to identify slow responses from your LDAP server, as well as the overall volume of LDAP password based authentications.
+
+### App servers
+
+The application servers section provides insight into the activity of {% data variables.product.prodname_ghe_server %} services which provide data to users and integrations.
+
+### App request/response
+
+The **App request/response** section looks at the rate of requests, how quickly those requests are responded to, and with what status they returned.
+
+### Actions
+
+The graphs break down different metrics about {% data variables.product.prodname_actions %} on {% data variables.location.product_location %} including an overview of {% data variables.product.prodname_actions %} services web requests.
+
+### Background jobs
+
+Number of tasks queued for background processing on the {% data variables.product.prodname_ghe_server %} appliance.
+
+### Network
+
+The network interface graphs can be useful in profiling user activity, and throughput of traffic in and out of the {% data variables.product.prodname_ghe_server %} appliance.
+
+### Storage
+
+{% data variables.product.prodname_ghe_server %} repository performance is very dependent on the underlying storage system. Low latency, local SSD disks provide the highest performance. For more information on the {% data variables.product.prodname_enterprise %} storage architecture, see "[AUTOTITLE](/enterprise-server@3.14/admin/overview/system-overview)."
+
+### Appliance-specific system services
+
+System services graphs contain data related to the major databases on {% data variables.product.prodname_ghe_server %}. These are MySQL, and Elasticseach persistent databases, as well as Redis and Memcached which contain ephemeral data.
+
+* Memcached: Provides a layer of in-memory caching for web and API operations. Memcached helps to provide quicker response times for users and integrations interacting with the system.
+* MySQL: The primary database in {% data variables.product.prodname_ghe_server %}. User, issue, and other non-git or search related metadata is stored within MySQL.
+* Nomad Jobs: {% data variables.product.prodname_ghe_server %} utilizes Nomad internally as the workload orchestrator, where the CPU and memory usage of individual services can be seen.
+* Redis: The database mainly contains background job queue, as well as session state information.
+* Kafka-Lite: Kafka broker service for job processing.
+* Elasticsearch: Powers the built-in search features in {% data variables.product.prodname_ghe_server %}.
+* Custom hooks: Graphs related to pre-receive hook execution.
+* Git fetch caching: {% data variables.product.prodname_ghe_server %} will attempt to cache intensive operations, such as Git pack-objects, when multiple identical requests arrive in quick succession.
+* MinIO: Storage used by some {% data variables.product.prodname_ghe_server %} services.
+* Packages: Requests powering {% data variables.product.prodname_registry %}.
+* SecretScanning: Services powering {% data variables.product.prodname_secret_scanning_caps %} features.
+* CodeScanning: Services powering {% data variables.product.prodname_code_scanning_caps %} features.
+* Cluster: Graphs related to {% data variables.product.prodname_ghe_server %} high availability or clustering.
+* Babeld: Git proxy.
+* Alive: Service powering live updates.
+* ghes-manage: Service powering GHES Manage API.
diff --git a/...t/admin/monitoring-and-managing-your-instance/monitoring-your-instance/index.md b/...t/admin/monitoring-and-managing-your-instance/monitoring-your-instance/index.md
@@ -21,6 +21,7 @@ children:
   - /collectd-metrics-for-github-enterprise-server
   - /monitoring-using-snmp
   - /about-system-logs
+  - /troubleshooting-resource-allocation-problems
   - /generating-a-health-check-for-your-enterprise
 shortTitle: Monitor your instance
 ---

diff --git a/...stance/monitoring-your-instance/troubleshooting-resource-allocation-problems.md b/...stance/monitoring-your-instance/troubleshooting-resource-allocation-problems.md
@@ -0,0 +1,121 @@
+---
+title: Troubleshooting resource allocation problems
+intro: Troubleshooting common resource allocation issues that may occur on your {% data variables.product.prodname_ghe_server %} appliance.
+redirect_from:
+  - /enterprise/admin/installation/troubleshooting-resource-allocation-problems
+versions:
+  ghes: '*'
+type: how_to
+topics:
+  - Enterprise
+  - Fundamentals
+  - Infrastructure
+  - Monitoring
+  - Performance
+  - Troubleshooting
+shortTitle: Troubleshooting resource allocation problems
+---
+
+## Troubleshooting common resource allocation problems on your appliance
+
+{% note %}
+
+**Note**: Regularly making repeated requests (polling) to {% data variables.location.product_location %} from continuous integration (CI) systems, build servers, or any other clients (such as Git or API clients) can overwhelm the system. This can lead to a denial of service (DoS) attack, causing significant performance issues and resource saturation.
+
+To avoid these problems, we strongly recommend using webhooks to receive updates. Webhooks allow the system to push updates to you automatically, eliminating the need for constant polling. Additionally, consider using conditional requests and caching strategies to minimize unnecessary requests. Avoid running jobs in large, simultaneous batches (thundering herds) and instead wait for webhook events to trigger actions.
+
+For more information, see "[AUTOTITLE](/get-started/exploring-integrations/about-webhooks)."
+
+{% endnote %}
+
+We recommend using the monitor dashboard to stay informed on your appliance's resource health and make decisions on how to fix high usage issues, such as the ones outlined on this page.
+
+For system-critical issues, and prior to making modifications to your appliance, we highly recommend contacting us by visiting {% data variables.contact.contact_ent_support %} and including your support bundle. For more information, see "[Providing data to {% data variables.product.prodname_enterprise %} Support](/enterprise/{{ currentVersion}}/admin/guides/enterprise-support/providing-data-to-github-support#creating-and-sharing-support-bundles)."
+
+## High CPU usage
+
+### Possible Causes
+
+* CPU of your instance is under-provisioned for your workload.
+* Upgrading to a new {% data variables.product.prodname_ghe_server %} releases often increases CPU and memory usage due to new features. Additionally, post-upgrade migration or reconciliation background jobs can temporarily degrade performance until they complete.
+* Elevated requests against Git or API. Increased requests to Git or API can occur due to various factors, such as excessive repository cloning, CI/CD processes, or unintentional usage by API scripts or new workloads.
+* Increased number of [GitHub Actions jobs](/admin/monitoring-and-managing-your-instance/monitoring-your-instance/accessing-the-monitor-dashboard#actions).
+* Elevated amount of Git commands executed a large repository.
+
+### Recommendations
+
+* Ensure CPU cores are [provisioned appropriately](/admin/installing-your-enterprise-server/setting-up-a-github-enterprise-server-instance/installing-github-enterprise-server-on-aws#minimum-requirements).
+* [Set alert thresholds](/admin/monitoring-and-managing-your-instance/monitoring-your-instance/recommended-alert-thresholds).
+* After an upgrade, check whether background upgrade jobs have completed, by running `ghe-check-background-upgrade-jobs`.
+* Use webhooks instead of pulling.
+* Use [API rate-limiting](/admin/configuring-settings/configuring-user-applications-for-your-enterprise/configuring-rate-limits).
+* Analyze Git usage by checking [current operations](/admin/administering-your-instance/administering-your-instance-from-the-command-line/command-line-utilities#ghe-btop) and [Git traffic](/admin/administering-your-instance/administering-your-instance-from-the-command-line/command-line-utilities#ghe-governor).
+
+## High memory usage
+
+### Possible causes
+
+* Memory of your instance is under-provisioned.
+* Elevated requests against Git or API. Increased requests to Git or API can occur due to various factors, such as excessive repository cloning, CI/CD processes, or unintentional usage by API scripts or new workloads.
+* Individual services exceeding their expected memory usage and running Out Of Memory (OOM).
+* Increased background job processing.
+
+### Recommendations
+
+* Memory of your instance is under-provisioned for your workload, data volume, given usage over time may exceed the [minimum requirements](/admin/installing-your-enterprise-server/setting-up-a-github-enterprise-server-instance/installing-github-enterprise-server-on-aws#minimum-requirements).
+* Within the Nomad graphs, identify services with out of memory trends which are often followed by free memory trends after they get restarted. For more information, see "[AUTOTITLE](/enterprise-server@3.14/admin/monitoring-and-managing-your-instance/monitoring-your-instance/accessing-the-monitor-dashboard#appliance-specific-system-services)."
+* Check logs for processes going out of memory by running `rg -z 'kernel: Out of memory: Killed process' /var/log/syslog*` (for this, first log in to the administrative shell using SSH - see "[AUTOTITLE](/enterprise-server@3.14/admin/administering-your-instance/administering-your-instance-from-the-command-line/accessing-the-administrative-shell-ssh).")
+* Ensure the correct ratio of memory to CPU services is met (at least `6.5:1`).
+* Check the amount of tasks queued for background processing - see "[AUTOTITLE](/enterprise-server@3.14/admin/monitoring-and-managing-your-instance/monitoring-your-instance/accessing-the-monitor-dashboard#background-jobs)."
+
+## Low disk space availability
+
+Both storage volumes, the one mounted to the root filesystem path (`/`) and the other to the user filesystem path (`/data/user`) can cause problems to the stability of your instance if low disk space is available.
+
+Keep in mind that the root storage volume is split into two equally-sized partitions. One of the partitions will be mounted as the root filesystem (`/`). The other partition is only mounted during upgrades and rollbacks of upgrades as `/mnt/`upgrade, to facilitate easier rollbacks if necessary. For more information, see "[AUTOTITLE](/admin/overview/system-overview#storage-architecture)."
+
+### Possible Causes
+
+* Service failure causing increased amount of logs
+* High disk usage through organic traffic
+
+### Recommendations
+
+* Check disk usage of `/var/log` folder by running (`sudo du -csh /var/log/*`) or manually force a log rotation (`sudo logrotate -f /etc/logrotate.conf`).
+* Check the disk for large files that have been deleted but still have open file handles (`ghe-check-disk-usage`).
+* Increase disk storage capacity - see "[AUTOTITLE](/enterprise-server@3.14/admin/monitoring-and-managing-your-instance/updating-the-virtual-machine-and-physical-resources/increasing-storage-capacity)."
+
+## Higher than usual response times
+
+### Possible causes
+
+* Elevated requests against Git or API. Increased requests to Git or API can occur due to various factors, such as excessive repository cloning, CI/CD processes, or unintentional usage by API scripts or new workloads.
+* Slow database queries.
+* Post upgrade ElasticSearch elevated service resource usage.
+* Reaching IOPS quotas on disk and/or heavy IO contention.
+* Saturated workers.
+* Webhook delivery delays.
+
+### Recommendations
+
+* Look for spikes or sustained numbers in the **Disk pending operations: Number of operations queued** graphs.
+* Check the **App request/response** panel to see if only certain services are affected.
+* After an upgrade, check whether background upgrade jobs have completed, by running `ghe-check-background-upgrade-jobs`.
+* Check the database logs for slow queries in `/var/log/github/exceptions.log` (for this, first log in to the administrative shell using SSH - see "[AUTOTITLE](/enterprise-server@3.14/admin/administering-your-instance/administering-your-instance-from-the-command-line/accessing-the-administrative-shell-ssh)"), for example by checking for Top 10 slow requests by URL: `grep SlowRequest github-logs/exceptions.log | jq '.url' | sort | uniq -c | sort -rn | head`.
+* Check the **Queued requests** graph for certain workers and consider adjusting their active worker count.
+* Increase the storage disks to ones with higher IOPS/throughput.
+* Check the amount of tasks queued for background processing - see "[AUTOTITLE](/enterprise-server@3.14/admin/monitoring-and-managing-your-instance/monitoring-your-instance/accessing-the-monitor-dashboard#background-jobs)."
+
+## Elevated error rates
+
+### Possible Causes
+
+* Elevated requests against Git or API. Increased requests to Git or API can occur due to various factors, such as excessive repository cloning, CI/CD processes, or unintentional usage by API scripts or new workloads.
+* Failing `haproxy` service or non-availability of individual services.
+* Failed repository network maintenance over time.
+
+### Recommendations
+
+* Check the **App request/response** panel to see if only certain services are affected.
+* Check the `haproxy` logs and try to identify if bad actors may be cause.
+* Check for failed repository network maintenance jobs (visit `http(s)://[hostname]/stafftools/networks`).