Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TDR's GET /repository/v1/snapshots takes too long #46

Closed
hannes-ucsc opened this issue Oct 3, 2021 · 3 comments
Closed

TDR's GET /repository/v1/snapshots takes too long #46

hannes-ucsc opened this issue Oct 3, 2021 · 3 comments
Assignees
Labels
debt [type] A defect incurring continued engineering cost infra [subject] Project infrastructure like CI/CD, build and deployment scripts orange [process] Done by the Azul team spike:2 [process] Spike estimate of two points

Comments

@hannes-ucsc
Copy link
Collaborator

Azul IT failure in sandbox, subtest catalog_complete:

https://gitlab.dev.singlecell.gi.ucsc.edu/ucsc/azul/-/jobs/30266#L537

TDR takes 20s for the first page, and 10s into waiting for the second page, the Lambda execution times out and so does API Gateway, hence the 504.

CloudWatch Logs Insights
region: us-east-1
log-group-names: /aws/lambda/azul-service-sandbox, /aws/apigateway/azul-service-sandbox
start-time: 2021-10-03T05:22:20.657Z
end-time: 2021-10-03T05:35:24.325Z
query-string:

fields @timestamp, @message
# | filter (@message like 'Received' and @message like 'GET') or status = 504
| filter @requestId = 'ca6c7324-7f5a-482f-bf00-e0b1ad5a2a7c'
| sort @timestamp desc
| limit 200

@timestamp @message
2021-10-03 05:23:36.229 END RequestId: ca6c7324-7f5a-482f-bf00-e0b1ad5a2a7c
2021-10-03 05:23:36.229 REPORT RequestId: ca6c7324-7f5a-482f-bf00-e0b1ad5a2a7c Duration: 31032.17 ms Billed Duration: 31000 ms Memory Size: 1024 MB Max Memory Used: 149 MB
2021-10-03 05:23:36.229 2021-10-03T05:23:36.228Z ca6c7324-7f5a-482f-bf00-e0b1ad5a2a7c Task timed out after 31.03 seconds
2021-10-03 05:23:25.546 [DEBUG] 2021-10-03T05:23:25.546Z ca6c7324-7f5a-482f-bf00-e0b1ad5a2a7c _request('GET', 'https://jade.datarepo-dev.broadinstitute.org/api/repository/v1/snapshots', fields={'offset': 100, 'limit': 100}, headers=None, body=None)
2021-10-03 05:23:25.545 [DEBUG] 2021-10-03T05:23:25.545Z ca6c7324-7f5a-482f-bf00-e0b1ad5a2a7c _request(…) -> b'{"total":138,"filteredTotal":138,"items":[{"id":"93dc1e76-8f1c-4949-8f9b-07a087f3ce7b","name":"WorkflowAnalysisSnapshot_bam","description":"Snapshot for public dataset 1000 Genomes.","createdDate":"2020-04-27T19:45:07.628232Z","profileId":"390e7a85-d47f...'
2021-10-03 05:23:05.209 [DEBUG] 2021-10-03T05:23:05.209Z ca6c7324-7f5a-482f-bf00-e0b1ad5a2a7c _request('GET', 'https://jade.datarepo-dev.broadinstitute.org/api/repository/v1/snapshots', fields={'offset': 0, 'limit': 100}, headers=None, body=None)
2021-10-03 05:23:05.198 [INFO] 2021-10-03T05:23:05.197Z ca6c7324-7f5a-482f-bf00-e0b1ad5a2a7c Authenticated request as OAuth2(access_token='REDACTED')
2021-10-03 05:23:05.197 [INFO] 2021-10-03T05:23:05.197Z ca6c7324-7f5a-482f-bf00-e0b1ad5a2a7c Received GET request for '/index/files', with query {"catalog": "it2", "filters": "{}", "size": "100"} and headers {"accept-encoding": "identity", "authorization": "Bearer REDACTED", "cloudfront-forwarded-proto": "https", "cloudfront-is-desktop-viewer": "true", "cloudfront-is-mobile-viewer": "false", "cloudfront-is-smarttv-viewer": "false", "cloudfront-is-tablet-viewer": "false", "cloudfront-viewer-country": "US", "host": "service.sandbox.dev.singlecell.gi.ucsc.edu", "user-agent": "python-urllib3/1.26.5", "via": "1.1 ca8d1424de70ce439236d37048e65f54.cloudfront.net (CloudFront)", "x-amz-cf-id": "FbxN6vVYXryNxRL9AkNMfj03HVQSxiUVF119vgzBIIva4M3bO_kuLQ==", "x-amzn-trace-id": "Root=1-61593e39-0fb134250b0f9d29436726d7", "x-forwarded-for": "52.20.215.168, 70.132.60.159", "x-forwarded-port": "443", "x-forwarded-proto": "https"}.
2021-10-03 05:23:05.193 START RequestId: ca6c7324-7f5a-482f-bf00-e0b1ad5a2a7c Version: $LATEST

@hannes-ucsc hannes-ucsc changed the title TDR GET /repository/v1/snapshots TDR GET /repository/v1/snapshots takes too long Oct 3, 2021
@hannes-ucsc hannes-ucsc changed the title TDR GET /repository/v1/snapshots takes too long TDR's GET /repository/v1/snapshots takes too long Oct 3, 2021
@hannes-ucsc hannes-ucsc added the orange [process] Done by the Azul team label Oct 3, 2021
hannes-ucsc added a commit to DataBiosphere/azul that referenced this issue Oct 3, 2021
Make sure that all dev snapshots fit on one page.
@hannes-ucsc
Copy link
Collaborator Author

hannes-ucsc commented Oct 3, 2021

Spike to mine CloudWatch and Gitlab logs for dev and sandbox to see how prevalent this is. I want to understand if this is just a temporary condition or if this request has been slow for a while now. The number of snapshots increased significantly as part of @aherbst-broad team's effort to copy snapshots from prod for #17.

@amarjandu amarjandu added infra [subject] Project infrastructure like CI/CD, build and deployment scripts and removed infra [subject] Project infrastructure like CI/CD, build and deployment scripts labels Oct 5, 2021
@theathorn theathorn added debt [type] A defect incurring continued engineering cost spike:2 [process] Spike estimate of two points labels Oct 6, 2021
@achave11-ucsc
Copy link

Spike report:

Failures per day in dev:
CloudWatch Logs Insights
region: us-east-1
log-group-names: /aws/lambda/azul-service-dev
Relative: 12M
query-string:

filter @message like "Task timed out after"
| stats count(*) as exceptionCount by bin(24h) as day
| sort day desc
day exceptionCount
2021-10-09 1
2021-10-06 1
2021-10-05 2
2021-10-04 45
2021-10-03 32
2021-10-02 49
2021-09-30 1
2021-09-28 16
2021-09-26 1
2021-09-25 1
2021-09-21 1
2021-09-20 1
2021-09-19 1
2021-09-17 1
2021-09-14 16
2021-09-13 1
2021-09-03 13
2021-09-02 70
2021-08-09 231
2021-08-07 1049
2021-08-06 1
2021-08-04 4
2021-08-03 9
2021-08-02 14
2021-07-25 3
2021-07-23 3
2021-07-22 3
2021-07-20 1
2021-07-14 2
2021-07-13 2
2021-07-10 2
2021-06-30 2
2021-05-01 1
2021-03-17 1
2020-12-17 8
2020-12-16 4
2020-12-12 1
2020-11-17 11
2020-11-16 6
2020-11-13 1
2020-11-10 1
2020-11-04 1
2020-10-19 1

From the above we can deduce that this sort of failures might have been lingering even before the increase in number of snapshots. Next, a subset of failing executions of 2021-08-09 for dev to make a few observations.

CloudWatch Logs Insights
region: us-east-1
log-group-names: /aws/lambda/azul-service-dev
start-time: 2021-08-07T13:30:24
end-time: 2021-08-09T12:02:15
query-string:

fields @timestamp, @message
# | filter @message like "Task timed out after"
| filter (@requestId like 'eb8a3f23' 
or @requestId like 'a45564d6'
or @requestId like 'fb5d42f7'
or @requestId like '2ebb6ab8'
or @requestId like 'f68dfbe9'
or @requestId like 'b80cf29a'
or @requestId like 'ee044031'
or @requestId like 'e4ffd794'
or @requestId like 'eb39ea99'
or @requestId like '1f06eb67'
or @requestId like '82437c90'
or @requestId like 'dfdeaf51'
or @requestId like '44e15480'
or @requestId like 'f5d56e44')
| sort @timestamp asc
| limit 23
@timestamp @message
2021-08-09 17:35:02.055 [INFO] 2021-08-09T17:35:02.055Z 44e15480-dc1d-4173-ad4f-0454f74de8b9 Received HEAD request for '/index/files', with query {"size": "1"} and headers {…}.
2021-08-09 17:35:02.085 [INFO] 2021-08-09T17:35:02.084Z dfdeaf51-44dc-40fe-9819-6528ee9f9698 Received HEAD request for '/index/bundles', with query {"size": "1"} and headers {…}.
2021-08-09 17:35:02.797 [INFO] 2021-08-09T17:35:02.797Z 82437c90-075e-431e-8ac5-ebc82cde8174 Received HEAD request for '/index/samples', with query {"size": "1"} and headers {…}.
2021-08-09 17:35:02.945 [DEBUG] 2021-08-09T17:35:02.944Z 44e15480-dc1d-4173-ad4f-0454f74de8b9 _request('GET', 'https://jade.datarepo-dev.broadinstitute.org/api/repository/v1/snapshots', fields={'limit': '100'}, headers=None, body=None)
2021-08-09 17:35:03.095 [DEBUG] 2021-08-09T17:35:03.095Z dfdeaf51-44dc-40fe-9819-6528ee9f9698 _request('GET', 'https://jade.datarepo-dev.broadinstitute.org/api/repository/v1/snapshots', fields={'limit': '100'}, headers=None, body=None)
2021-08-09 17:35:03.754 [DEBUG] 2021-08-09T17:35:03.753Z 82437c90-075e-431e-8ac5-ebc82cde8174 _request('GET', 'https://jade.datarepo-dev.broadinstitute.org/api/repository/v1/snapshots', fields={'limit': '100'}, headers=None, body=None)
2021-08-09 17:35:33.086 2021-08-09T17:35:33.086Z 44e15480-dc1d-4173-ad4f-0454f74de8b9 Task timed out after 31.03 seconds

2021-08-09 17:35:33.115 2021-08-09T17:35:33.115Z dfdeaf51-44dc-40fe-9819-6528ee9f9698 Task timed out after 31.03 seconds

2021-08-09 17:35:33.824 2021-08-09T17:35:33.823Z 82437c90-075e-431e-8ac5-ebc82cde8174 Task timed out after 31.03 seconds

2021-08-09 17:35:51.482 [INFO] 2021-08-09T17:35:51.482Z 1f06eb67-a089-4bd3-b927-1c825231f7da Received HEAD request for '/index/samples', with query {"size": "1"} and headers {…}.
2021-08-09 17:35:51.600 [INFO] 2021-08-09T17:35:51.600Z eb39ea99-c001-4ff2-9d16-cc31479c6895 Received HEAD request for '/index/files', with query {"size": "1"} and headers {…}.
2021-08-09 17:35:51.622 [INFO] 2021-08-09T17:35:51.622Z e4ffd794-3d5c-4764-8899-b15e6a986247 Received HEAD request for '/index/bundles', with query {"size": "1"} and headers {…}.
2021-08-09 17:35:52.463 [DEBUG] 2021-08-09T17:35:52.463Z 1f06eb67-a089-4bd3-b927-1c825231f7da _request('GET', 'https://jade.datarepo-dev.broadinstitute.org/api/repository/v1/snapshots', fields={'limit': '100'}, headers=None, body=None)
2021-08-09 17:35:52.578 [DEBUG] 2021-08-09T17:35:52.578Z eb39ea99-c001-4ff2-9d16-cc31479c6895 _request('GET', 'https://jade.datarepo-dev.broadinstitute.org/api/repository/v1/snapshots', fields={'limit': '100'}, headers=None, body=None)
2021-08-09 17:35:52.620 [DEBUG] 2021-08-09T17:35:52.619Z e4ffd794-3d5c-4764-8899-b15e6a986247 _request('GET', 'https://jade.datarepo-dev.broadinstitute.org/api/repository/v1/snapshots', fields={'limit': '100'}, headers=None, body=None)
2021-08-09 17:35:55.405 [INFO] 2021-08-09T17:35:55.405Z f5d56e44-27fe-4369-a95a-0b067a02dc97 Received HEAD request for '/index/samples', with query {"size": "1"} and headers {…}.
2021-08-09 17:35:55.419 [INFO] 2021-08-09T17:35:55.419Z ee044031-f96b-4dec-b6a8-df7426fee01d Received HEAD request for '/index/files', with query {"size": "1"} and headers {…}.
2021-08-09 17:35:55.452 [INFO] 2021-08-09T17:35:55.452Z b80cf29a-76d4-4f30-8c41-dd3c0f33fa22 Received HEAD request for '/index/bundles', with query {"size": "1"} and headers {…}.
2021-08-09 17:35:56.352 [DEBUG] 2021-08-09T17:35:56.351Z f5d56e44-27fe-4369-a95a-0b067a02dc97 _request('GET', 'https://jade.datarepo-dev.broadinstitute.org/api/repository/v1/snapshots', fields={'limit': '100'}, headers=None, body=None)
2021-08-09 17:35:56.371 [DEBUG] 2021-08-09T17:35:56.370Z ee044031-f96b-4dec-b6a8-df7426fee01d _request('GET', 'https://jade.datarepo-dev.broadinstitute.org/api/repository/v1/snapshots', fields={'limit': '100'}, headers=None, body=None)
2021-08-09 17:35:56.431 [DEBUG] 2021-08-09T17:35:56.431Z b80cf29a-76d4-4f30-8c41-dd3c0f33fa22 _request('GET', 'https://jade.datarepo-dev.broadinstitute.org/api/repository/v1/snapshots', fields={'limit': '100'}, headers=None, body=None)
2021-08-09 17:36:19.070 [INFO] 2021-08-09T17:36:19.070Z f68dfbe9-e99f-4f2e-9964-1d6b8780e680 Received HEAD request for '/index/bundles', with query {"size": "1"} and headers {…}.
2021-08-09 17:36:19.179 [INFO] 2021-08-09T17:36:19.178Z 2ebb6ab8-5ae0-4029-8cfc-e5ff94bbf9bf Received HEAD request for '/index/files', with query {"size": "1"} and headers {"accept": "*/*", "accept-encoding": "gzip, deflate", "cloudfront-forwarded-proto": "https", "cloudfront-is-desktop-viewer": "true", "cloudfront-is-mobile-viewer": "false", "cloudfront-is-smarttv-viewer": "false", "cloudfront-is-tablet-viewer": "false", "cloudfront-viewer-country": "US", "host": "service.dev.singlecell.gi.ucsc.edu", "user-agent": "python-requests/2.22.0", "via": "1.1 284f4720cb0b553d351a28630b21584a.cloudfront.net (CloudFront)", "x-amz-cf-id": "ckhRx3xho-1ghhyI4PAiXrfNFyuzBlVUMh9pLhMnlfqngEQIRgzVIw==", "x-amzn-trace-id": "Root=1-61116793-0da7f27d59016054706444f0", "x-forwarded-for": "3.89.136.165, 70.132.59.82", "x-forwarded-port": "443", "x-forwarded-proto": "https"}.

From the above one is able to see the quick succession in which these failing request are executed. Non failing request in the same interval also need to be accounted. There were no IT failures for dev in GitLab for this observed period. Given that most of the errors happen for HEAD requests with parameter of size=1 (health status checks?), it is acceptable for the errors to not be represented by IT failures.

While the provided data doesn't lead to conclusive results or the actual cause of the failure, a few observations can be made.

  1. Failures happen for health requests (HEAD request w/ size=1) beside normal runs and IT test.
  2. All HEAD request with size=1 seem to list all snapshots from TDR, (is it absolutely necessary?)
  3. Failures seem to appear on requests that happen in quick succession (see times in table above). (Or they failed because they simultaneously requested TDR snapshot listing?).
  4. While listing TDR snapshots through service (indirectly) failed occasionally, the IT specific to that functionality seems to have been consistently successful (visible in sandbox runs).

Infrequent failures in IT for sandbox (which disappear after a re-run or two) seem to be caused by this sporadic failure in TDR snapshot list. This infrequent failures aren't similar in shape, meaning that sometimes catalog_complete fails other times is catalog_complete among other variations. Making it difficult to establish a connection with the cause of the failures.

Failures per day in sandbox:
CloudWatch Logs Insights
region: us-east-1
log-group-names: /aws/lambda/azul-service-sandbox
Relative: 12M
query-string:

filter @message like "Task timed out after"
| stats count(*) as exceptionCount by bin(24h) as day
| sort day desc
day exceptionCount
2021-10-09 1
2021-10-06 2
2021-10-04 52
2021-10-03 53
2021-10-02 54
2021-10-01 1
2021-09-29 1
2021-09-28 13
2021-09-26 1
2021-09-24 1
2021-09-19 1
2021-09-16 1
2021-09-14 16
2021-09-03 4
2021-09-02 82
2021-08-09 281
2021-08-07 1203
2021-08-06 2
2021-08-05 3
2021-08-04 8
2021-08-03 9
2021-08-02 7
2021-07-31 1
2021-07-25 2
2021-07-23 2
2021-07-22 4
2021-07-21 1
2021-01-30 4
2021-01-12 1
2021-01-09 29
2021-01-08 47
2020-11-02 1
2020-10-22 1

We can observe that sandbox and dev deployments experience higher failures on the same days.

@achave11-ucsc
Copy link

achave11-ucsc commented Oct 12, 2021

@hannes-ucsc: "Based on @achave11's analysis this issue isn't very prevalent at this time. The change from 7bd4d4e also helps alleviate the issue."

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
debt [type] A defect incurring continued engineering cost infra [subject] Project infrastructure like CI/CD, build and deployment scripts orange [process] Done by the Azul team spike:2 [process] Spike estimate of two points
Projects
None yet
Development

No branches or pull requests

4 participants