Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

7/24/2024 Production Deploy #1211

Merged
merged 37 commits into from
Jul 25, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
37 commits
Select commit Hold shift + click to select a range
9ec0ff6
Upgrade terraform-cloudgov, remove recursive_delete in dev module
jskinne3 Jul 18, 2024
56c87e9
handling throttling exception
Jul 19, 2024
875d378
fix ExpiringDict caching solution
Jul 19, 2024
4d24b82
clean up
Jul 19, 2024
aa63003
cleanup
Jul 22, 2024
f61a47a
code review feedback
Jul 22, 2024
968c144
Merge pull request #1197 from GSA/jskinne3-update-terraform-cloudgov-…
ccostino Jul 22, 2024
e97b567
code review feedback
Jul 22, 2024
946b1e9
fix flake8
Jul 22, 2024
f1e0990
remove print statement
Jul 22, 2024
b82e387
Merge pull request #1198 from GSA/notify-api-1140
ccostino Jul 22, 2024
67b43ff
Getting the logging where I think it will be most useful.
xlorepdarkhelm Jul 22, 2024
32500a1
Fixing some things.
xlorepdarkhelm Jul 22, 2024
26c6d39
Merge pull request #1200 from GSA/grrr
ccostino Jul 22, 2024
a98c8f6
Bump cachetools from 5.3.3 to 5.4.0
dependabot[bot] Jul 22, 2024
bc830d5
Merge pull request #1187 from GSA/dependabot/pip/cachetools-5.4.0
ccostino Jul 22, 2024
0e86972
Bump boto3 from 1.34.143 to 1.34.144
dependabot[bot] Jul 22, 2024
fbd34a4
Merge pull request #1188 from GSA/dependabot/pip/boto3-1.34.144
ccostino Jul 22, 2024
1dcc6a7
Bump exceptiongroup from 1.2.1 to 1.2.2
dependabot[bot] Jul 22, 2024
cfb9c60
Merge pull request #1189 from GSA/dependabot/pip/exceptiongroup-1.2.2
ccostino Jul 22, 2024
bce064f
Start troubleshooting section, add errors seen in sandboxing
jskinne3 Jul 22, 2024
3eb3415
Notes about push command, URLs for Sandbox
jskinne3 Jul 22, 2024
2ac0ae7
Update table of contents
jskinne3 Jul 22, 2024
be360cd
Complete link to Troubleshooting section
jskinne3 Jul 22, 2024
0a10738
Merge pull request #1203 from GSA/jskinne3-notes-on-working-sandbox
ccostino Jul 23, 2024
213eee3
Increased API and worker app memory to 4GB
ccostino Jul 23, 2024
bf587aa
set delivery receipt delay to 30 seconds (from 120 seconds)
Jul 23, 2024
03ee2e4
Bump pytest from 8.2.2 to 8.3.1
dependabot[bot] Jul 23, 2024
7cbedfb
Merge pull request #1204 from GSA/increase-prod-memory
ccostino Jul 23, 2024
2df00a1
Merge pull request #1205 from GSA/notify-api-512
ccostino Jul 23, 2024
29ada9a
Merge pull request #1209 from GSA/dependabot/pip/pytest-8.3.1
ccostino Jul 24, 2024
d6b5961
Fixing hopefully
xlorepdarkhelm Jul 24, 2024
e15f892
Update Python runtime version for cloud.gov
ccostino Jul 24, 2024
cfa8c91
Merge pull request #1210 from GSA/update-python-runtime
ccostino Jul 24, 2024
6f9e0cf
Merge pull request #1201 from GSA/admin-1701_Logging_set_up_around_se…
ccostino Jul 24, 2024
76aef32
Bump botocore from 1.34.144 to 1.34.148
dependabot[bot] Jul 24, 2024
9cdd8c3
Merge pull request #1212 from GSA/dependabot/pip/botocore-1.34.148
ccostino Jul 25, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions app/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -252,6 +252,7 @@ def register_blueprint(application):


def init_app(app):

@app.before_request
def record_request_details():
g.start = monotonic()
Expand Down
71 changes: 71 additions & 0 deletions app/aws/s3.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,77 @@
JOBS_CACHE_MISSES = "JOBS_CACHE_MISSES"


def list_s3_objects():
bucket_name = current_app.config["CSV_UPLOAD_BUCKET"]["bucket"]
access_key = current_app.config["CSV_UPLOAD_BUCKET"]["access_key_id"]
secret_key = current_app.config["CSV_UPLOAD_BUCKET"]["secret_access_key"]
region = current_app.config["CSV_UPLOAD_BUCKET"]["region"]
session = Session(
aws_access_key_id=access_key,
aws_secret_access_key=secret_key,
region_name=region,
)
s3 = session.client("s3")

try:
response = s3.list_objects_v2(Bucket=bucket_name)
while True:
for obj in response.get("Contents", []):
yield obj["Key"]
if "NextContinuationToken" in response:
response = s3.list_objects_v2(
Bucket=bucket_name,
ContinuationToken=response["NextContinuationToken"],
)
else:
break
except Exception as e:
current_app.logger.error(
f"An error occurred while regenerating cache #notify-admin-1200 {e}"
)


def get_s3_files():
current_app.logger.info("Regenerate job cache #notify-admin-1200")
bucket_name = current_app.config["CSV_UPLOAD_BUCKET"]["bucket"]
access_key = current_app.config["CSV_UPLOAD_BUCKET"]["access_key_id"]
secret_key = current_app.config["CSV_UPLOAD_BUCKET"]["secret_access_key"]
region = current_app.config["CSV_UPLOAD_BUCKET"]["region"]
session = Session(
aws_access_key_id=access_key,
aws_secret_access_key=secret_key,
region_name=region,
)
objects = list_s3_objects()

s3res = session.resource("s3", config=AWS_CLIENT_CONFIG)
current_app.logger.info(
f"JOBS cache length before regen: {len(JOBS)} #notify-admin-1200"
)
for object in objects:
# We put our csv files in the format "service-{service_id}-notify/{job_id}"
try:
object_arr = object.split("/")
job_id = object_arr[1] # get the job_id
job_id = job_id.replace(".csv", "") # we just want the job_id
if JOBS.get(job_id) is None:
object = (
s3res.Object(bucket_name, object)
.get()["Body"]
.read()
.decode("utf-8")
)
if "phone number" in object.lower():
JOBS[job_id] = object
except LookupError as le:
# perhaps our key is not formatted as we expected. If so skip it.
current_app.logger.error(f"LookupError {le} #notify-admin-1200")

current_app.logger.info(
f"JOBS cache length after regen: {len(JOBS)} #notify-admin-1200"
)


def get_s3_file(bucket_name, file_location, access_key, secret_key, region):
s3_file = get_s3_object(bucket_name, file_location, access_key, secret_key, region)
return s3_file.get()["Body"].read().decode("utf-8")
Expand Down
18 changes: 17 additions & 1 deletion app/celery/provider_tasks.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@
import os
from datetime import timedelta

from botocore.exceptions import ClientError
from flask import current_app
from sqlalchemy.orm.exc import NoResultFound

Expand All @@ -22,7 +23,7 @@

# This is the amount of time to wait after sending an sms message before we check the aws logs and look for delivery
# receipts
DELIVERY_RECEIPT_DELAY_IN_SECONDS = 120
DELIVERY_RECEIPT_DELAY_IN_SECONDS = 30


@notify_celery.task(
Expand Down Expand Up @@ -62,6 +63,21 @@ def check_sms_delivery_receipt(self, message_id, notification_id, sent_at):
provider_response=provider_response,
)
raise self.retry(exc=ntfe)
except ClientError as err:
# Probably a ThrottlingException but could be something else
error_code = err.response["Error"]["Code"]
provider_response = (
f"{error_code} while checking sms receipt -- still looking"
)
status = "pending"
carrier = ""
update_notification_status_by_id(
notification_id,
status,
carrier=carrier,
provider_response=provider_response,
)
raise self.retry(exc=err)

if status == "success":
status = NotificationStatus.DELIVERED
Expand Down
5 changes: 5 additions & 0 deletions app/celery/tasks.py
Original file line number Diff line number Diff line change
Expand Up @@ -441,6 +441,11 @@ def send_inbound_sms_to_service(self, inbound_sms_id, service_id):
)


@notify_celery.task(name="regenerate-job-cache")
def regenerate_job_cache():
s3.get_s3_files()


@notify_celery.task(name="process-incomplete-jobs")
def process_incomplete_jobs(job_ids):
jobs = [dao_get_job_by_id(job_id) for job_id in job_ids]
Expand Down
5 changes: 5 additions & 0 deletions app/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -249,6 +249,11 @@ class Config(object):
"schedule": crontab(hour=6, minute=0),
"options": {"queue": QueueNames.PERIODIC},
},
"regenerate-job-cache": {
"task": "regenerate-job-cache",
"schedule": crontab(minute="*/30"),
"options": {"queue": QueueNames.PERIODIC},
},
"cleanup-unfinished-jobs": {
"task": "cleanup-unfinished-jobs",
"schedule": crontab(hour=4, minute=5),
Expand Down
2 changes: 2 additions & 0 deletions app/service/rest.py
Original file line number Diff line number Diff line change
Expand Up @@ -201,6 +201,8 @@ def get_service_by_id(service_id):
fetched = dao_fetch_service_by_id(service_id)

data = service_schema.dump(fetched)

current_app.logger.info(f'>> SERVICE: {data["id"]}; {data}')
return jsonify(data=data)


Expand Down
4 changes: 2 additions & 2 deletions deploy-config/production.yml
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
env: production
web_instances: 2
web_memory: 2G
web_memory: 4G
worker_instances: 1
worker_memory: 2G
worker_memory: 4G
scheduler_memory: 256M
public_api_route: notify-api.app.cloud.gov
admin_base_url: https://beta.notify.gov
Expand Down
49 changes: 41 additions & 8 deletions docs/all.md
Original file line number Diff line number Diff line change
Expand Up @@ -60,9 +60,13 @@
- [Data Storage Policies \& Procedures](#data-storage-policies--procedures)
- [Potential PII Locations](#potential-pii-locations)
- [Data Retention Policy](#data-retention-policy)
- [Debug messages not being sent](#debug-messages-not-being-sent)
- [Getting the file location and tracing what happens](#getting-the-file-location-and-tracing-what-happens)
- [Viewing the csv file](#viewing-the-csv-file)
- [Troubleshooting](#troubleshooting)
- [Debug messages not being sent](#debug-messages-not-being-sent)
- [Getting the file location and tracing what happens](#getting-the-file-location-and-tracing-what-happens)
- [Viewing the csv file](#viewing-the-csv-file)
- [Deployment / app push problems](#deployment--app-push-problems)
- [Routes cannot be mapped to destinations in different spaces](#routes-cannot-be-mapped-to-destinations-in-different-spaces)
- [API request failed](#api-request-failed)


# Infrastructure overview
Expand Down Expand Up @@ -449,7 +453,10 @@ If this is the first time you have used Terraform in this repository, you will f
```
cf push --vars-file deploy-config/sandbox.yml --var NEW_RELIC_LICENSE_KEY=$NEW_RELIC_LICENSE_KEY
```

The real `push` command has more var arguments than the single one above. Get their values from a Notify team member.
1. Visit the URL of the app you just deployed
* Admin https://notify-sandbox.app.cloud.gov/
* API https://notify-api-sandbox.app.cloud.gov/

# Database management

Expand Down Expand Up @@ -1327,11 +1334,12 @@ Seven (7) days by default. Each service can be set with a custom policy via `Ser

Data cleanup is controlled by several tasks in the `nightly_tasks.py` file, kicked off by Celery Beat.

# Troubleshooting

# Debug messages not being sent
## Debug messages not being sent


## Getting the file location and tracing what happens
### Getting the file location and tracing what happens


Ask the user to provide the csv file name. Either the csv file they uploaded, or the one that is autogenerated when they do a one-off send and is visible in the UI
Expand All @@ -1340,7 +1348,7 @@ Starting with the admin logs, search for this file name. When you find it, the

In the api logs, search by job_id. Either you will see evidence of the job failing and retrying over and over (in which case search for a stack trace using timestamp), or you will ultimately get to a log line that links the job_id to a message_id. In this case, now search by message_id. You should be able to find the actual result from AWS, either success or failure, with hopefully some helpful info.

## Viewing the csv file
### Viewing the csv file

If you need to view the questionable csv file on production, run the following command:

Expand All @@ -1355,11 +1363,36 @@ locally, just do:
poetry run flask command download-csv-file-by-name -f <file location in admin logs>
```

## Debug steps
### Debug steps

1. Either send a message and capture the csv file name, or get a csv file name from a user
2. Using the log tool at logs.fr.cloud.gov, use filters to limit what you're searching on (cf.app is 'notify-admin-production' for example) and then search with the csv file name in double quotes over the relevant time period (last 5 minutes if you just sent a message, or else whatever time the user sent at)
3. When you find the log line, you should also find the job_id and the s3 file location. Save these somewhere.
4. To get the csv file contents, you can run the command above. This command currently prints to the notify-api log, so after you run the command,
you need to search in notify-api-production for the last 5 minutes with the logs sorted by timestamp. The contents of the csv file unfortunately appear on separate lines so it's very important to sort by time.
5. If you want to see where the message actually failed, search with cf.app is notify-api-production using the job_id that you saved in step #3. If you get far enough, you might see one of the log lines has a message_id. If you see it, you can switch and search on that, which should tell you what happened in AWS (success or failure).

## Deployment / app push problems

### Routes cannot be mapped to destinations in different spaces

During `cf push` you may see

```
For application 'notify-api-sandbox': Routes cannot be mapped to destinations in different spaces
```

:ghost: This indicates a ghost route squatting on a route you need to create. In the cloud.gov web interface, check for incomplete deployments. They might be holding on to a route. Delete them. Also, check the list of routes (from the CloudFoundry icon in the left sidebar) for routes without an associated app. If they look like a route your app would need to create, delete them.

### API request failed

After pushing the Admin app, you might see this in the logs

```
{"name": "app", "levelname": "ERROR", "message": "API unknown failed with status 503 message Request failed", "pathname": "/home/vcap/app/app/__init__.py", ...
```

This indicates that the Admin and API apps are unable to talk to each other because of either a missing route or a missing network policy. The apps require [container-to-container networking](https://cloud.gov/docs/management/container-to-container/) to communicate. List `cf network-policies` and compare the output to our other deployed envs. If you find a policy is missing, you might have to create a network policy with something like:
```
cf add-network-policy notify-admin-sandbox notify-api-sandbox --protocol tcp --port 61443
```
44 changes: 22 additions & 22 deletions poetry.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

Loading
Loading