Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Send Slack alerts when found modified legal case (ADR/AF/MUR) #4957

Merged
merged 2 commits into from
Sep 30, 2021

Conversation

fec-jli
Copy link
Contributor

@fec-jli fec-jli commented Sep 26, 2021

Summary

During reloading modified legal case(MUR/AF/ADR), we're experiencing two kinds of issue

Ref tickets:

#4364
#4361

Changes are included in this PR

  • New cron task to send Slack alerts once a day when there are modified legal case(MUR/AF/ADR) in production.
  • Add 'local' to APP_NAME when testing on local
  • Add comments to each cron task.
  • Clean some codes and change string value with single quote to double quote

Required reviewers

one backend developer is ok, two developers are better.

Impacted areas of the application

General components of the application that this PR will affect:
Celery-worker cron task

How to test

Option 1: Local test (hard to setup locally, but easy to test cronjob):

 "send_alert_legal_case": {
            "task": "webservices.tasks.legal_docs.send_alert_most_recent_legal_case",
            "schedule": crontab(minute="*/2"),
        },
  • Modify the query RECENTLY_MODIFIED_CASES_SEND_ALERT in tasks/legal_docs.py file, line 41
    change 'n' hour to get modified case result return.
SELECT case_no, case_type, pg_date, published_flg
    FROM fecmur.cases_with_parsed_case_serial_numbers_vw
    WHERE pg_date >= NOW() - 'n hour'::INTERVAL 	
    and case_type='MUR'
    ORDER BY case_serial
  • start celery-beat and celery-worker, check message in Slack/#test-bot.
    It should be something like this:
...
ADR 945 found published at 2021-09-22 10:32:43.113921
ADR 1001 found published at 2021-09-16 09:39:32.261229
AF 2470 found published at 2021-09-16 15:00:58.298312
AF 3357 found published at 2021-09-16 15:01:05.270154
MUR 7220 found published at 2021-09-24 15:39:58.709447
MUR 7647 found published at 2021-09-24 09:53:37.252871
in fec | api | local

Note: if you don't setup Elasticsearch locally, you can comment out this task schedule to get rid of some error messages.

        "refresh_legal_docs": {
            "task": "webservices.tasks.legal_docs.refresh_most_recent_legal_doc",
            "schedule": crontab(minute="*/5", hour="10-23"),
        },

Option 2: Deploy on dev:

  • Create your own branch, modify the query RECENTLY_MODIFIED_CASES_SEND_ALERT in tasks/legal_docs.py file, line 41. change 'n' hour to get modified case result return.
  • Change tasks/legal_docs.py file, line41, SLACK_BOTS = “#test-bot"
  • Modify tasks/init.py file, task: send_alert_legal_case to change schedule to "*/2"
  • Deploy your test branch to dev
  • You should see alert message in slack channel.
    It should be something like this:
...
ADR 945 found published at 2021-09-22 10:32:43.113921
ADR 1001 found published at 2021-09-16 09:39:32.261229
AF 2470 found published at 2021-09-16 15:00:58.298312
AF 3357 found published at 2021-09-16 15:01:05.270154
MUR 7220 found published at 2021-09-24 15:39:58.709447
MUR 7647 found published at 2021-09-24 09:53:37.252871
in fec | api | dev

Task Schedule Management Diagram:
https://docs.google.com/drawings/d/1RjDRBGRzi6iZOqTSGTgyEgWPmOk5g5HWKAYOmrdD8GU/edit
Task Schedule Management Diagram

Send message to slack when found modified cases.
Change string value with single quote to double quote
@fec-jli fec-jli changed the title [WIP]Add local to APP_NAME,Add comment to cron task. [WIP]Send Slack alerts when found modified legal case (ADR/AF/MUR) Sep 26, 2021
@codecov-commenter
Copy link

codecov-commenter commented Sep 26, 2021

Codecov Report

Merging #4957 (1af499b) into develop (4a92a43) will decrease coverage by 0.14%.
The diff coverage is 5.88%.

Impacted file tree graph

@@             Coverage Diff             @@
##           develop    #4957      +/-   ##
===========================================
- Coverage    85.97%   85.83%   -0.15%     
===========================================
  Files           81       81              
  Lines         7582     7597      +15     
===========================================
+ Hits          6519     6521       +2     
- Misses        1063     1076      +13     
Impacted Files Coverage Δ
webservices/tasks/legal_docs.py 0.00% <0.00%> (ø)
webservices/tasks/refresh.py 0.00% <0.00%> (ø)
webservices/tasks/utils.py 75.00% <0.00%> (ø)
webservices/tasks/__init__.py 69.23% <60.00%> (ø)
webservices/rest.py 91.08% <0.00%> (+0.55%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4a92a43...1af499b. Read the comment docs.

@fec-jli
Copy link
Contributor Author

fec-jli commented Sep 26, 2021

Mini postmortem missing MUR_7133 issue (15:15pm 9/16/2021)

  1. 15:15pm, Jason published 3 MURs (7180,7698,7713) through SMUR application.
  2. 16:04pm, Jason reported the MUR_7713 don’t appear in fec website and does appear in EQS. MUR_7180 and MUR_7698 appear on both EQS and fec website within 10 minutes.
  3. 16:20pm, Check Aurora database, MUR_7713 was not in fecmur.case table on prod, stage and dev spaces.
  4. 16:36pm, Jason unpublished and published MUR_7713 again, Thank Jason. Then we saw the MUR_7713 in database on all 3 spaces this time. (see below database result).
  5. 16:44pm, Pat reload MUR_7713 to elasticsearch in prod manually (after 8 minutes, celery-beat not to trigger). The MUR_7713 appears in fec website.
  6. Check on stage and dev space, the celery-beat/ celery-worker didn’t trigger task until
    after 18pm, about 2 hours later. The MUR_7713 appear on stage and dev website.
    https://stage.fec.gov/data/legal/matter-under-review/7713/
    https://dev.fec.gov/data/legal/matter-under-review/7713/

Anil(Contractor) response(09/20/2021 12:11pm):

There was a deadlock on the fecmur.entity Oracle table during publish on 9/16. This is the first we have seen of this error that I can recall.
2021-09-16 15:16:10,748 ERROR [smurExecutor-5] g.f.s.s.PublishingServiceImpl [PublishingServiceImpl.java:456] smurapp-prod-use1-001.itss.fec.gov Failed to publish MUR-7713 from postgres. Cause: org.postgresql.util.PSQLException: ERROR: deadlock detected
Detail: Process 29496 waits for ShareLock on transaction 80657975; blocked by process 29502.
Process 29502 waits for ShareLock on transaction 80658001; blocked by process 29496.
Hint: See server log for query details.
Where: while inserting index tuple (13,117) in relation "entity"
SQL statement "INSERT INTO FECMUR.ENTITY AS A (
entity_id,
first_name,
middle_name,
last_name,
prefix,
suffix,
name,
TYPE)
VALUES (3102668,
E'James',
E'E. "Trey"',
E'Trainor',
E'Comm''r',
E'',
E'Trainor, James E. "Trey"',
E'Personnel')
ON CONFLICT (entity_id)
DO UPDATE SET (
first_name,
middle_name,
last_name,
prefix,
suffix,
name,
type) = (
E'James',
E'E. "Trey"',
E'Trainor',
E'Comm''r',
E'',
E'Trainor, James E. "Trey"',
E'Personnel')
WHERE A.entity_id = 3102668"
PL/pgSQL function inline_code_block line 360 at SQL statement
org.springframework.dao.DeadlockLoserDataAccessException: StatementCallback; SQL

@fec-jli
Copy link
Contributor Author

fec-jli commented Sep 27, 2021

Analyze MUR 7875 in Kibana logs:
1)2021-09-20 14:12:55.935 EST--published MUR 7875 (request to SMUR application and update database, published_flg=false)
2)2021-09-20 14:15:00.106 EST--MUR 7875 found modified at 2021-09-20 14:12:55.935 (Celery beat-->task)
3)2021-09-20 14:15:00.337 EST--Found an unpublished case - deleting MUR 7875
4)2021-09-20 14:15:01.438 EST--deleting error: mur not exist (not in ES yet, ignore error)

5)2021-09-20 14:20:00.103 EST--MUR 7875 found modified at 2021-09-20 14:12:55.935 (Celery task again after 5 mins)
6)2021-09-20 14:20:00.205 EST--Found an unpublished case - deleting MUR 7875 (try to delete again)
7)2021-09-20 14:20:00.954 EST--deleting error: mur not exist (not in ES yet, ignore error)

....
8)2021-09-20 14:40:00.102 EST--MUR 7875 found modified at 2021-09-20 14:36:20.306 (published_flg=true)
9)2021-09-20 14:40:01.449 EST--Loading MUR 7875 (first time to load to ES)
......re-load to ES every 5 mins

10)2021-09-20 18:35:00.106 EST--MUR 7875 found modified at 2021-09-20 14:36:20.306 (last time Celery task run)
11)2021-09-20 18:35:01.249 EST--Loading MUR 7875 (last time load to ES, beat schedule ends at 19pm)

---------Questions ----------
1)From 14:15:00pm to 18:35:00pm
celery-worker task run 51 times (3 times published_flg=false, 48 time published_flg=true)
load to Elasticsearch 48 times
celery beat schedule: 1 time/5min ==>12 times/hour ==> 48 times/4 hours
overload in both celery-worker and Elasticsearch

2)The worst case: if published case start at 10am, that means
one case will be run 96 times and loaded to ES 96 times.

3)Can we think about reduce schedule frequency?
every 10 minutes? 50%off overload

mur7875_1

mur7875_2

mur7875_3

mur7875_4

mur7875_5

@fec-jli
Copy link
Contributor Author

fec-jli commented Sep 27, 2021

Analyze MUR 6828
From 15:00:00 to 19:00:03
celery-worker task run 49 times
load to Elasticsearch 46 times
mur_6828_1

mur_6828_2

@fec-jli fec-jli changed the title [WIP]Send Slack alerts when found modified legal case (ADR/AF/MUR) Send Slack alerts when found modified legal case (ADR/AF/MUR) Sep 27, 2021
@pkfec
Copy link
Contributor

pkfec commented Sep 28, 2021

@fec-jli It appears that elasticsearch service is required while testing this PR on local environment. Can you update your Option#1 test instructions?

Copy link
Contributor

@pkfec pkfec left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have tested on my local env. New alerts appeared on slack #test-bot with the latest case numbers and timestamp that were recently published in aurora db.

Works as expected. Awesome work @fec-jli

Copy link
Member

@lbeaufort lbeaufort left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work, @fec-jli!

@lbeaufort lbeaufort merged commit 61fe6f6 into develop Sep 30, 2021
@lbeaufort lbeaufort deleted the feature/error_handle branch October 20, 2021 17:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants