Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PLG Cloud.gov #3192

Merged
merged 99 commits into from
Nov 1, 2024
Merged

PLG Cloud.gov #3192

merged 99 commits into from
Nov 1, 2024

Conversation

elipe17
Copy link

@elipe17 elipe17 commented Sep 18, 2024

Summary of Changes

  • Add manifests to deploy PLG as binaries via binary buildpack
  • Add local vs deployed configurations for PLG apps
  • Added routing based on dev environment for the interim
  • Added deploy script for PLG and PG exporters
  • Fixed local Nginx proxying to Grafana

Pull request closes #3046

Considerations:

  • We need to move/should we move all PLG apps into the production env?
  • We need to actually calculate memory requirements for PLG. My assumption is Loki: 2-4GB, Prometheus: 2-4GB (calculator), Grafana: 1GB, 3 PG exporters: 3 * 24MB, 6 Backend Promtails: 6 * 64MB. [1]
  • We can/should hook Grafana up to our RDS to have a more persistent set of datasources, dashboards, etc...
  • Need to play around with cloud.gov provisioned dashboards. I couldn't get it to work so I just uploaded them manually.
  • Need to re-evaluate getting promtail to run with frontend
  • Need to update/add our Promtail pipelines to parse log messages better so we can filter our logs dashboard more effectively [3]
  • Consider moving deployment of all apps to docker containers to make life with PLG and frontend/backend deployment a bit easier.
  • Need to verify logs in Loki are queryable from days, weeks, months, etc ago
  • While TDP lives in cloud.gov it might be worth it to see if we can setup a log drain through cloud.gov to loki. Then we should get all of the cloud.gov logs also.
  • Consider adding metric exporting for Elastic
  • Consider adding metric exporting for machine stats (cpu, memory, etc) [5]
  • Consider using Grafana Alerting vs AlertManager [2]
  • Prometheus does not have persistent storage. Need to integrate with something like Mimir to support S3 based metric storage. [4]

How to Test

PLG is deployed in the dev environment for the moment. If you would like to browse Grafana I have opened a public route to it for the interim until we decide where PLG is going to live. Reach out to me for username and password. Once you're logged in, feel free to browse the dashboards. Note, the Logs dashboard only has logging information as far back as 09/25/2024 at ~9:40am ET since that is when promtail had it's first successful exports.

Deliverables

More details on how deliverables herein are assessed included here.

Deliverable 1: Accepted Features

Checklist of ACs:

  • Prometheus, Loki, and Grafana are deployed in cloud.gov
  • Grafana is connected to Promethues and Loki and can query the logs from Loki
  • Backend apps can push logs to Loki
  • Prometheus can pull metrics from backend apps and postgres exporters
  • Loki has persistent log storage via S3
  • Testing Checklist has been run and all tests pass
  • README is updated, if necessary

Deliverable 2: Tested Code

  • Are all areas of code introduced in this PR meaningfully tested?
    • If this PR introduces backend code changes, are they meaningfully tested?
    • If this PR introduces frontend code changes, are they meaningfully tested?
  • Are code coverage minimums met?
    • Frontend coverage: [insert coverage %] (see CodeCov Report comment in PR)
    • Backend coverage: [insert coverage %] (see CodeCov Report comment in PR)

Deliverable 3: Properly Styled Code

  • Are backend code style checks passing on CircleCI?
  • Are frontend code style checks passing on CircleCI?
  • Are code maintainability principles being followed?

Deliverable 4: Accessible

  • Does this PR complete the epic?
  • Are links included to any other gov-approved PRs associated with epic?
  • Does PR include documentation for Raft's a11y review?
  • Did automated and manual testing with iamjolly and ttran-hub using Accessibility Insights reveal any errors introduced in this PR?

Deliverable 5: Deployed

  • Was the code successfully deployed via automated CircleCI process to development on Cloud.gov?

Deliverable 6: Documented

  • Does this PR provide background for why coding decisions were made?
  • If this PR introduces backend code, is that code easy to understand and sufficiently documented, both inline and overall?
  • If this PR introduces frontend code, is that code easy to understand and sufficiently documented, both inline and overall?
  • If this PR introduces dependencies, are their licenses documented?
  • Can reviewer explain and take ownership of these elements presented in this code review?

Deliverable 7: Secure

  • Does the OWASP Scan pass on CircleCI?
  • Do manual code review and manual testing detect any new security issues?
  • If new issues detected, is investigation and/or remediation plan documented?

Deliverable 8: User Research

Research product(s) clearly articulate(s):

  • the purpose of the research
  • methods used to conduct the research
  • who participated in the research
  • what was tested and how
  • impact of research on TDP
  • (if applicable) final design mockups produced for TDP development

@elipe17 elipe17 self-assigned this Sep 18, 2024
Copy link

codecov bot commented Sep 18, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 91.52%. Comparing base (9bd3deb) to head (ce8df47).
Report is 101 commits behind head on develop.

Additional details and impacted files

Impacted file tree graph

@@             Coverage Diff             @@
##           develop    #3192      +/-   ##
===========================================
+ Coverage    90.65%   91.52%   +0.86%     
===========================================
  Files          299      297       -2     
  Lines         8490     8415      -75     
  Branches       794      608     -186     
===========================================
+ Hits          7697     7702       +5     
+ Misses         676      603      -73     
+ Partials       117      110       -7     
Flag Coverage Δ
dev-backend 91.37% <ø> (+0.98%) ⬆️
dev-frontend 92.66% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
tdrs-backend/tdpservice/settings/common.py 99.33% <ø> (ø)

... and 12 files with indirect coverage changes


Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update cb51a68...ce8df47. Read the comment docs.

@elipe17 elipe17 added the Deploy with CircleCI-raft Deploy to https://tdp-frontend-raft.app.cloud.gov through CircleCI label Sep 19, 2024
@elipe17 elipe17 added Deploy with CircleCI-raft Deploy to https://tdp-frontend-raft.app.cloud.gov through CircleCI and removed Deploy with CircleCI-raft Deploy to https://tdp-frontend-raft.app.cloud.gov through CircleCI labels Sep 19, 2024
@elipe17 elipe17 marked this pull request as ready for review September 19, 2024 14:38
- update volume mount
- update promtail to scrape new location
- update backend log file location
- Remove docker container scrape config
- Testing running bogus command for nginx
export GF_DATABASE_NAME=grafana
export GF_DATABASE_USER=$(echo $VCAP_SERVICES | jq -r '."aws-rds"[0].credentials.username')
export GF_DATABASE_PASSWORD=$(echo $VCAP_SERVICES | jq -r '."aws-rds"[0].credentials.password')
wget https://dl.grafana.com/oss/release/grafana-11.2.0.linux-amd64.tar.gz
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created #3232 for this.

@ADPennington
Copy link
Collaborator

@elipe17 can you remind me: is there a relationship between this PR and #3199? I see some file changes here that bump-up RAM resources, so I'd like to have OCIO approval before we increase costs if possible. reference to our questions

@elipe17
Copy link
Author

elipe17 commented Oct 16, 2024

@elipe17 can you remind me: is there a relationship between this PR and #3199? I see some file changes here that bump-up RAM resources, so I'd like to have OCIO approval before we increase costs if possible. reference to our questions

@ADPennington, There is a relationship between this and the ADR. The ADR provides a brief "why" with respect to the introduction of PLG and where it lives in the boundary diagram. The ADR does not cover the cost with respect to RAM for PLG, that could definitely be added. There is not any increase in RAM allocation for existing apps (frontend/backend/etc). However, I did provide general "guestimates" of the RAM needed for PLG to run smoothly while monitoring all of our deployments. That number is an aggregate needed for PLG in it's entirety and it is not required per space or anything like that. As an absolute maximum we would be looking at 10GB of memory and 3.5GB on the lowest end. I am pretty confident we will skew towards the low end. Just need to do some maths and better estimations to figure it out.

@ADPennington ADPennington removed the Deploy with CircleCI-raft Deploy to https://tdp-frontend-raft.app.cloud.gov through CircleCI label Oct 16, 2024
@elipe17
Copy link
Author

elipe17 commented Oct 16, 2024

@elipe17 can you remind me: is there a relationship between this PR and #3199? I see some file changes here that bump-up RAM resources, so I'd like to have OCIO approval before we increase costs if possible. reference to our questions

@ADPennington I also wanted to add that merging this wont incur any new cost. Everything is already deployed in the dev space with the resources we currently have. The cost implications will come into question when 3222 is worked.

@ADPennington ADPennington added the Blocked Label for Pull Requests that are currently blocked by a dependency label Oct 17, 2024
Copy link

@raftmsohani raftmsohani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@elipe17
Copy link
Author

elipe17 commented Oct 24, 2024

Below are my true calculations for the required memory to deploy PLG in production.

Prometheus is scraping about 2400 series from our dev environment. You can determine this by logging into Grafana and querying theprometheus_tsdb_head_series metric. Some simple math says 2400 * 3 is about 7200 series in total for Prometheus to manage with our current configuration across all deployed environments. The dev team would like to (in the future) add other series to be tracked for other monitoring and alerting solutions (CPU, memory, Elastic usage, etc...). Looking ahead, it would make sense to pad Prometheus' memory in preparation for those potential additions. We should expect to have ~2 or 3 times as many series when those are added; to be safe I assume 3 times as many, or 21,600 series across all environments. Using the tool provided here I calculated that Prometheus will need ~100MB of memory to manage all of our series and future series (see screenshot below). From there we add the memory usage that the dev deployed Prometheus app is currently using (see screenshot below) and see that the app will require a minimum of 180MB or memory. We should pad this value for safety and availability. Thus, the current configuration of Prometheus with 512MB of memory should be sufficient for our use cases.

Screenshot 2024-10-24 at 9 45 15 AM

Screenshot 2024-10-24 at 9 47 53 AM

For Loki, we don't have as nice of a tool and we have to do a little bit of estimation. On my local machine, at rest with no logging or log querying happening Loki consumes ~100MB of memory. When parsing a 50MB file with logging set to it's most verbose setting (the case in dev/staging), and Grafana querying for new logs every 5 seconds (as fast as it will go), Loki's memory consumption peaked at ~250MB. With that in mind, when Loki is receiving logs from all environments there is a possibility that it's memory could increase by a factor of at least 6 (for each backend app). The likelihood of that all backend apps are parsing files at the same time is very low. But there is a good chance one backend app from each environment could parse a hefty file at the same time. With that said, starting Loki with a memory usage of 1GB would be my recommendation. We should also keep in mind that there are use cases that cause Loki memory consumption to grow without bounds. I was able to produce this locally and is something for future consideration.

For Grafana, no matter what I did in the deployed or local environment it's consumption never breached 150MB. Grafana's documentation also suggests that the minimum memory should be set at 512MB which is how it is configured in this PR and is my recommended configuration.

Including the Promtail deployments and the Postgres exporters we should expect a initial total memory requirement for PLG to be 2.5GB:
512MB (Prometheus)
512MB (Grafana)
512MB (PG exporters & Promtails)
1GB (Loki)

Because all of these apps are stateless, we can always scale their memory up or down as we learn more and more about their needs with respect to our workloads while they are deployed.

cc. @ADPennington, @andrew-jameson

@elipe17 elipe17 mentioned this pull request Oct 24, 2024
elipe17 and others added 4 commits October 24, 2024 11:20
* pypi cfg for nexus

* Changes for docker/apt install

* new registry in package-lock

* anonymous curl works better, auth fails

* remove kib-dash, made backend-back use bash in task file

* change backend system to buster

* Postgres buster not available. Upgrade everything to bullseye

* Merge thought to intentionally remove prometheus. Fixing pipfiles

* typo

* Cleanup of exploratory cf-check code

* cleanup pt2 cf-check

---------

Co-authored-by: andrew-jameson <ajameson@teamraft.com>
@ADPennington ADPennington removed Blocked Label for Pull Requests that are currently blocked by a dependency OCIO Review labels Nov 1, 2024
Copy link
Collaborator

@ADPennington ADPennington left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for pairing on review @elipe17 🚀

  • RAM quota updated in org, so we can bump when needed
  • uid is not a secret
  • let's check the zap scan results next week (after this merges) to see if it catches anything related to header settings

@elipe17 elipe17 merged commit f2f91ea into develop Nov 1, 2024
17 checks passed
@elipe17 elipe17 deleted the 3046-plg-cloud branch November 1, 2024 15:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

PLG deployed in Cloud.gov
5 participants