PLG Cloud.gov #3192

elipe17 · 2024-09-18T19:40:08Z

Summary of Changes

Add manifests to deploy PLG as binaries via binary buildpack
Add local vs deployed configurations for PLG apps
Added routing based on dev environment for the interim
Added deploy script for PLG and PG exporters
Fixed local Nginx proxying to Grafana

Pull request closes #3046

Considerations:

We need to move/should we move all PLG apps into the production env?
We need to actually calculate memory requirements for PLG. My assumption is Loki: 2-4GB, Prometheus: 2-4GB (calculator), Grafana: 1GB, 3 PG exporters: 3 * 24MB, 6 Backend Promtails: 6 * 64MB. [1]
~~We can/should hook Grafana up to our RDS to have a more persistent set of datasources, dashboards, etc...~~
Need to play around with cloud.gov provisioned dashboards. I couldn't get it to work so I just uploaded them manually.
Need to re-evaluate getting promtail to run with frontend
Need to update/add our Promtail pipelines to parse log messages better so we can filter our logs dashboard more effectively [3]
Consider moving deployment of all apps to docker containers to make life with PLG and frontend/backend deployment a bit easier.
~~Need to verify logs in Loki are queryable from days, weeks, months, etc ago~~
While TDP lives in cloud.gov it might be worth it to see if we can setup a log drain through cloud.gov to loki. Then we should get all of the cloud.gov logs also.
Consider adding metric exporting for Elastic
Consider adding metric exporting for machine stats (cpu, memory, etc) [5]
Consider using Grafana Alerting vs AlertManager [2]
Prometheus does not have persistent storage. Need to integrate with something like Mimir to support S3 based metric storage. [4]

How to Test

PLG is deployed in the dev environment for the moment. If you would like to browse Grafana I have opened a public route to it for the interim until we decide where PLG is going to live. Reach out to me for username and password. Once you're logged in, feel free to browse the dashboards. Note, the Logs dashboard only has logging information as far back as 09/25/2024 at ~9:40am ET since that is when promtail had it's first successful exports.

Deliverables

More details on how deliverables herein are assessed included here.

Deliverable 1: Accepted Features

Checklist of ACs:

Prometheus, Loki, and Grafana are deployed in cloud.gov
Grafana is connected to Promethues and Loki and can query the logs from Loki
Backend apps can push logs to Loki
Prometheus can pull metrics from backend apps and postgres exporters
Loki has persistent log storage via S3
Testing Checklist has been run and all tests pass
README is updated, if necessary

Deliverable 2: Tested Code

Are all areas of code introduced in this PR meaningfully tested?
- If this PR introduces backend code changes, are they meaningfully tested?
- If this PR introduces frontend code changes, are they meaningfully tested?
Are code coverage minimums met?
- Frontend coverage: [insert coverage %] (see CodeCov Report comment in PR)
- Backend coverage: [insert coverage %] (see CodeCov Report comment in PR)

Deliverable 3: Properly Styled Code

Are backend code style checks passing on CircleCI?
Are frontend code style checks passing on CircleCI?
Are code maintainability principles being followed?

Deliverable 4: Accessible

Does this PR complete the epic?
Are links included to any other gov-approved PRs associated with epic?
Does PR include documentation for Raft's a11y review?
Did automated and manual testing with iamjolly and ttran-hub using Accessibility Insights reveal any errors introduced in this PR?

Deliverable 5: Deployed

Was the code successfully deployed via automated CircleCI process to development on Cloud.gov?

Deliverable 6: Documented

Does this PR provide background for why coding decisions were made?
If this PR introduces backend code, is that code easy to understand and sufficiently documented, both inline and overall?
If this PR introduces frontend code, is that code easy to understand and sufficiently documented, both inline and overall?
If this PR introduces dependencies, are their licenses documented?
Can reviewer explain and take ownership of these elements presented in this code review?

Deliverable 7: Secure

Does the OWASP Scan pass on CircleCI?
Do manual code review and manual testing detect any new security issues?
If new issues detected, is investigation and/or remediation plan documented?

Deliverable 8: User Research

Research product(s) clearly articulate(s):

the purpose of the research
methods used to conduct the research
who participated in the research
what was tested and how
impact of research on TDP
(if applicable) final design mockups produced for TDP development

codecov · 2024-09-18T19:48:33Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 91.52%. Comparing base (9bd3deb) to head (ce8df47).
Report is 101 commits behind head on develop.

Additional details and impacted files

@@             Coverage Diff             @@
##           develop    #3192      +/-   ##
===========================================
+ Coverage    90.65%   91.52%   +0.86%     
===========================================
  Files          299      297       -2     
  Lines         8490     8415      -75     
  Branches       794      608     -186     
===========================================
+ Hits          7697     7702       +5     
+ Misses         676      603      -73     
+ Partials       117      110       -7

Flag	Coverage Δ
dev-backend	`91.37% <ø> (+0.98%)`	⬆️
dev-frontend	`92.66% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
tdrs-backend/tdpservice/settings/common.py	`99.33% <ø> (ø)`

... and 12 files with indirect coverage changes

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update cb51a68...ce8df47. Read the comment docs.

- UPdate path for promtail

- update volume mount - update promtail to scrape new location - update backend log file location

- Remove docker container scrape config

- Testing running bogus command for nginx

andrew-jameson · 2024-10-14T13:30:42Z

tdrs-backend/plg/grafana/manifest.yml

+      export GF_DATABASE_NAME=grafana
+      export GF_DATABASE_USER=$(echo $VCAP_SERVICES | jq -r '."aws-rds"[0].credentials.username')
+      export GF_DATABASE_PASSWORD=$(echo $VCAP_SERVICES | jq -r '."aws-rds"[0].credentials.password')
+      wget https://dl.grafana.com/oss/release/grafana-11.2.0.linux-amd64.tar.gz


Created #3232 for this.

ADPennington · 2024-10-16T14:35:48Z

@elipe17 can you remind me: is there a relationship between this PR and #3199? I see some file changes here that bump-up RAM resources, so I'd like to have OCIO approval before we increase costs if possible. reference to our questions

elipe17 · 2024-10-16T15:09:05Z

@elipe17 can you remind me: is there a relationship between this PR and #3199? I see some file changes here that bump-up RAM resources, so I'd like to have OCIO approval before we increase costs if possible. reference to our questions

@ADPennington, There is a relationship between this and the ADR. The ADR provides a brief "why" with respect to the introduction of PLG and where it lives in the boundary diagram. The ADR does not cover the cost with respect to RAM for PLG, that could definitely be added. There is not any increase in RAM allocation for existing apps (frontend/backend/etc). However, I did provide general "guestimates" of the RAM needed for PLG to run smoothly while monitoring all of our deployments. That number is an aggregate needed for PLG in it's entirety and it is not required per space or anything like that. As an absolute maximum we would be looking at 10GB of memory and 3.5GB on the lowest end. I am pretty confident we will skew towards the low end. Just need to do some maths and better estimations to figure it out.

elipe17 · 2024-10-16T15:26:45Z

@elipe17 can you remind me: is there a relationship between this PR and #3199? I see some file changes here that bump-up RAM resources, so I'd like to have OCIO approval before we increase costs if possible. reference to our questions

@ADPennington I also wanted to add that merging this wont incur any new cost. Everything is already deployed in the dev space with the resources we currently have. The cost implications will come into question when 3222 is worked.

raftmsohani

LGTM!

…3046-plg-cloud

…into 3046-plg-cloud

elipe17 · 2024-10-24T14:53:29Z

Below are my true calculations for the required memory to deploy PLG in production.

Prometheus is scraping about 2400 series from our dev environment. You can determine this by logging into Grafana and querying theprometheus_tsdb_head_series metric. Some simple math says 2400 * 3 is about 7200 series in total for Prometheus to manage with our current configuration across all deployed environments. The dev team would like to (in the future) add other series to be tracked for other monitoring and alerting solutions (CPU, memory, Elastic usage, etc...). Looking ahead, it would make sense to pad Prometheus' memory in preparation for those potential additions. We should expect to have ~2 or 3 times as many series when those are added; to be safe I assume 3 times as many, or 21,600 series across all environments. Using the tool provided here I calculated that Prometheus will need ~100MB of memory to manage all of our series and future series (see screenshot below). From there we add the memory usage that the dev deployed Prometheus app is currently using (see screenshot below) and see that the app will require a minimum of 180MB or memory. We should pad this value for safety and availability. Thus, the current configuration of Prometheus with 512MB of memory should be sufficient for our use cases.

For Loki, we don't have as nice of a tool and we have to do a little bit of estimation. On my local machine, at rest with no logging or log querying happening Loki consumes ~100MB of memory. When parsing a 50MB file with logging set to it's most verbose setting (the case in dev/staging), and Grafana querying for new logs every 5 seconds (as fast as it will go), Loki's memory consumption peaked at ~250MB. With that in mind, when Loki is receiving logs from all environments there is a possibility that it's memory could increase by a factor of at least 6 (for each backend app). The likelihood of that all backend apps are parsing files at the same time is very low. But there is a good chance one backend app from each environment could parse a hefty file at the same time. With that said, starting Loki with a memory usage of 1GB would be my recommendation. We should also keep in mind that there are use cases that cause Loki memory consumption to grow without bounds. I was able to produce this locally and is something for future consideration.

For Grafana, no matter what I did in the deployed or local environment it's consumption never breached 150MB. Grafana's documentation also suggests that the minimum memory should be set at 512MB which is how it is configured in this PR and is my recommended configuration.

Including the Promtail deployments and the Postgres exporters we should expect a initial total memory requirement for PLG to be 2.5GB:
512MB (Prometheus)
512MB (Grafana)
512MB (PG exporters & Promtails)
1GB (Loki)

Because all of these apps are stateless, we can always scale their memory up or down as we learn more and more about their needs with respect to our workloads while they are deployed.

cc. @ADPennington, @andrew-jameson

* pypi cfg for nexus * Changes for docker/apt install * new registry in package-lock * anonymous curl works better, auth fails * remove kib-dash, made backend-back use bash in task file * change backend system to buster * Postgres buster not available. Upgrade everything to bullseye * Merge thought to intentionally remove prometheus. Fixing pipfiles * typo * Cleanup of exploratory cf-check code * cleanup pt2 cf-check --------- Co-authored-by: andrew-jameson <ajameson@teamraft.com>

ADPennington

thanks for pairing on review @elipe17 🚀

RAM quota updated in org, so we can bump when needed
uid is not a secret
let's check the zap scan results next week (after this merges) to see if it catches anything related to header settings

elipe17 added 8 commits September 16, 2024 11:20

- Add log file rollover

fd4e974

- Add initial configs for cloud deployments

feec150

- initial config for grafana deploy

8805b94

- Remove empty file

cd4b739

- move data sources to template file

c935ce0

- general deploy routine for pg and grafana

b7bd9e4

- added deploy routine for prometheus

092df96

- Added deploy routine for loki

1440b7f

elipe17 self-assigned this Sep 18, 2024

elipe17 added 5 commits September 19, 2024 08:55

- Initial update for promtail sidecars

c6edd4a

- allow deploy no matter test state

19b48a1

- Update deploy scripts to prepare promtail config

0d2518a

- add quotes

e495aac

- Update frontend to write error log to file

8388f40

- UPdate path for promtail

elipe17 added the Deploy with CircleCI-raft Deploy to https://tdp-frontend-raft.app.cloud.gov through CircleCI label Sep 19, 2024

-- for faster turnaround

57de9d4

elipe17 added Deploy with CircleCI-raft Deploy to https://tdp-frontend-raft.app.cloud.gov through CircleCI and removed Deploy with CircleCI-raft Deploy to https://tdp-frontend-raft.app.cloud.gov through CircleCI labels Sep 19, 2024

elipe17 marked this pull request as ready for review September 19, 2024 14:38

elipe17 added 8 commits September 19, 2024 11:18

- add ignore for file generation

3da3526

- update volume mount - update promtail to scrape new location - update backend log file location

- Move limits to per process

4797ced

- Remove docker container scrape config

- update disk quota to match backend

6b1eedd

- Uping promtail memory

30ca7fc

- Explicitely execute nginx

9926c4d

- Testing less memory

79d578d

- Testing running bogus command for nginx

- Tell nginx to reload

5bb32d2

- try removing nginx command

2316312

elipe17 added the QASP Review label Oct 9, 2024

elipe17 requested a review from ADPennington October 9, 2024 20:06

Merge branch 'develop' into 3046-plg-cloud

bb44fb9

andrew-jameson reviewed Oct 14, 2024

View reviewed changes

Merge branch 'develop' into 3046-plg-cloud

10b2b9c

ADPennington removed the Deploy with CircleCI-raft Deploy to https://tdp-frontend-raft.app.cloud.gov through CircleCI label Oct 16, 2024

ADPennington added the Blocked Label for Pull Requests that are currently blocked by a dependency label Oct 17, 2024

raftmsohani approved these changes Oct 18, 2024

View reviewed changes

elipe17 added 2 commits October 18, 2024 12:44

Merge branch 'develop' of https://github.com/raft-tech/TANF-app into …

37a56ea

…3046-plg-cloud

Merge branch '3046-plg-cloud' of https://github.com/raft-tech/TANF-app …

32c07df

…into 3046-plg-cloud

ADPennington added the OCIO Review label Oct 18, 2024

elipe17 mentioned this pull request Oct 24, 2024

Doc/3199 monitoring adr #3210

Merged

elipe17 and others added 4 commits October 24, 2024 11:20

Merge branch 'develop' into 3046-plg-cloud

ce65bec

Merge branch 'develop' into 3046-plg-cloud

f8cf20a

Merge branch 'develop' into 3046-plg-cloud

4fec7f2

elipe17 force-pushed the 3046-plg-cloud branch from 1414394 to 4fec7f2 Compare October 30, 2024 20:32

Merge branch 'develop' into 3046-plg-cloud

ce8df47

elipe17 mentioned this pull request Oct 31, 2024

Investigate and/or implement Promtail deployment with frontend for centralized logging #3256

Open

7 tasks

ADPennington removed Blocked Label for Pull Requests that are currently blocked by a dependency OCIO Review labels Nov 1, 2024

ADPennington approved these changes Nov 1, 2024

View reviewed changes

ADPennington added Ready to Merge and removed QASP Review labels Nov 1, 2024

elipe17 merged commit f2f91ea into develop Nov 1, 2024
17 checks passed

elipe17 deleted the 3046-plg-cloud branch November 1, 2024 15:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PLG Cloud.gov #3192

PLG Cloud.gov #3192

elipe17 commented Sep 18, 2024 •

edited by ADPennington

Loading

codecov bot commented Sep 18, 2024 •

edited

Loading

andrew-jameson Oct 14, 2024

ADPennington commented Oct 16, 2024

elipe17 commented Oct 16, 2024

elipe17 commented Oct 16, 2024

raftmsohani left a comment

elipe17 commented Oct 24, 2024 •

edited

Loading

ADPennington left a comment

PLG Cloud.gov #3192

PLG Cloud.gov #3192

Conversation

elipe17 commented Sep 18, 2024 • edited by ADPennington Loading

Summary of Changes

How to Test

Deliverables

codecov bot commented Sep 18, 2024 • edited Loading

Codecov Report

Choose a reason for hiding this comment

ADPennington commented Oct 16, 2024

elipe17 commented Oct 16, 2024

elipe17 commented Oct 16, 2024

raftmsohani left a comment

Choose a reason for hiding this comment

elipe17 commented Oct 24, 2024 • edited Loading

ADPennington left a comment

Choose a reason for hiding this comment

elipe17 commented Sep 18, 2024 •

edited by ADPennington

Loading

codecov bot commented Sep 18, 2024 •

edited

Loading

elipe17 commented Oct 24, 2024 •

edited

Loading