Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[QUERY] Application took 15 minutes to update checkpoint #18014

Closed
neallee2012 opened this issue Dec 8, 2020 · 5 comments
Closed

[QUERY] Application took 15 minutes to update checkpoint #18014

neallee2012 opened this issue Dec 8, 2020 · 5 comments
Assignees
Labels
bug This issue requires a change to an existing behavior in the product in order to be resolved. Client This issue points to a problem in the data-plane of the library. Event Hubs needs-author-feedback Workflow: More information is needed from author to address the issue. no-recent-activity There has been no recent activity on this issue.

Comments

@neallee2012
Copy link

neallee2012 commented Dec 8, 2020

Describe the bug

Receive event: "12/7/2020, 6:21:43.610"
Finish processing event: "12/7/2020, 6:21:43.669"
Update Checkpoint: "12/7/2020, 6:37:26.121"

Checkpoint writing stucked in application for 15minutes. There is a duration of ownership expiration set as partitionOwnershipExpirationInterval(Duration.ofSeconds(150)) during creation.

image

To Reproduce

Keep updating checkpoint and It will happen occasionally

Code Snippet

public final Consumer<EventContext> PARTITION_PROCESSOR = eventContext -> {
        Map<String, Object> metricMap = new HashMap<>();

        EventData event = eventContext.getEventData();
        metricMap.put(OWNER_ID, ownerId);
        metricMap.put(OFFSET, event.getOffset());
        metricMap.put(SEQUENCE_NUMBER, event.getSequenceNumber());
        metricMap.put(ENQUEUE_TIME, event.getEnqueuedTime().getEpochSecond());
        metricMap.put(PARTITION_ID, eventContext.getPartitionContext().getPartitionId());
        metricMap.put(EVENT_ID, String.format("%s_%s_%d", metricMap.get(OWNER_ID), metricMap.get(PARTITION_ID), (long) metricMap.get(SEQUENCE_NUMBER)));

        String receivedEventContext = metricMap.entrySet().stream().map((entry) -> (entry.getKey() + "=" + entry.getValue().toString())).sorted().collect(Collectors.joining(" & "));
        LOG.info("receive event: {}={}", EVENT_ID, metricMap.get(EVENT_ID));

        metricMap.put(EVENT_HUB_SINK_COUNT, "0");
        metricMap.put(EVENT_HUB_SINK_TIME, "0");
        metricMap.put(BLOB_STORAGE_SINK_SIZE, "0");
        metricMap.put(BLOB_STORAGE_SINK_TIME, "0");
        metricMap.put(TRANSFORMATION_TIME, "0");

        long lag = Instant.now().getEpochSecond() - (long) metricMap.get(ENQUEUE_TIME);

        statsd.recordGaugeValue("xxxxxxx", lag, "partition_id:" + metricMap.get(PARTITION_ID));

        metricMap.put(EXECUTION_TIME, getExecutionTime(() -> processEvent(event.getBody(), metricMap)));

        String executionDetails = metricMap.entrySet().stream().map((entry) -> (entry.getKey() + "=" + entry.getValue().toString())).sorted().collect(Collectors.joining(" & "));
        LOG.info("finish event processing: " + executionDetails);

        eventContext.updateCheckpoint();
        LOG.info("update checkpoint: {}={}", EVENT_ID, metricMap.get(EVENT_ID));
    };

Expected behavior

It should either fail or succeed in shorter time.

Setup (please complete the following information):

  • OS: Linux
  • Version of the Library used: 5.3.1
@ghost ghost added the needs-triage Workflow: This is a new issue that needs to be triaged to the appropriate team. label Dec 8, 2020
@alzimmermsft alzimmermsft added bug This issue requires a change to an existing behavior in the product in order to be resolved. Client This issue points to a problem in the data-plane of the library. labels Dec 8, 2020
@ghost ghost removed the needs-triage Workflow: This is a new issue that needs to be triaged to the appropriate team. label Dec 8, 2020
@YijunXieMS
Copy link
Contributor

Hi @alzimmermsft,
The updateCheckpoint in EventHubs blob checkpoint store calls the blob storage API blobAsyncClient.setMetadata(metadata);
Among many calls, one call was stuck for 15 minutes and then succeeded. Other calls about the same time in different threads succeeded immediately. Do you know of any similar cases?

Another question is, can we configure the timeout for the above API? If it times out in much shorter time like in a few seconds, the updateCheckpoint won't be blocking for 15 minutes. An error is much better than a 15-minute waiting.

@alzimmermsft
Copy link
Member

Based on the dependencies of azure-messaging-eventhubs-checkpointstore-blob it resolves to azure-core-http-netty 1.5.4, in a new release of this library (1.6.0 and above) additional logic was added to timeout requests/responses after a certain time of inactivity. Upgrading to a new version of azure-storage-blob or azure-core-http-netty should add in this functionality which should help reduce the chances of hitting a 15 minute wait period in the application.

@YijunXieMS
Copy link
Contributor

@neallee2012 Following Alan's answer above, could you put azure-storage-blob 12.9.0 in your pom.xml before we release a new checkpoint store blob version that uses storage 12.9.0?

<!-- https://mvnrepository.com/artifact/com.azure/azure-storage-blob -->
<dependency>
    <groupId>com.azure</groupId>
    <artifactId>azure-storage-blob</artifactId>
    <version>12.9.0</version>
</dependency>

@srnagar
Copy link
Member

srnagar commented Feb 9, 2021

@neallee2012 Do you have any updates? Did you try the new storage blob version?

@ramya-rao-a ramya-rao-a added needs-author-feedback Workflow: More information is needed from author to address the issue. and removed customer-response-expected labels Feb 22, 2021
@ghost ghost added the no-recent-activity There has been no recent activity on this issue. label Mar 2, 2021
@ghost
Copy link

ghost commented Mar 2, 2021

Hi, we're sending this friendly reminder because we haven't heard back from you in a while. We need more information about this issue to help address it. Please be sure to give us your input within the next 7 days. If we don't hear back from you within 14 days of this comment the issue will be automatically closed. Thank you!

@ghost ghost closed this as completed Mar 16, 2021
azure-sdk pushed a commit to azure-sdk/azure-sdk-for-java that referenced this issue Apr 20, 2022
Azure Networking 2021-08-01 release of monthly branch (Azure#18440)

* Adds base for updating Microsoft.Network from version stable/2021-05-01 to version 2021-08-01

* Updates readme

* Updates API version in new specs and examples

* Add AppGw swagger changes for L4 proxy (Azure#17561)

* Add AppGw swagger changes for L4 proxy

* Fix Lint Errors

* fix prettier checks

* HubRoutingPreference in VirtualHub (Azure#17609)

* commit1

* commit2

Co-authored-by: Khushboo Baheti <khbaheti@microsoft.com>

* MultipleApipa feature VpnSiteLinkConnection and  VirtualNetworkGatewayConnection (Azure#17672)

* VngConnection

* VpnSiteLinkConnection

* fixes

* fixes

* fix2

* fixes

Co-authored-by: Khushboo Baheti <khbaheti@microsoft.com>

* Virtual Wan P2S MultiPool feature swagger changes (Azure#17620)

* Virtual Wan P2S MultiPool feature swagger changes

* Fix Swagger LintDiff errors

* Fix LintDiff errors

* Fix errors

* Fix spec

* Fix spec

* Fix spec

* Fix LintDiff errors

* Fix LintDiff errors

* Fix SDK azure-sdk-for-net generation error

* Remove suppression

* Fix errors

* Fix Lintdiff error

* Fix PrettierCheck

* changes (Azure#18002)

* Revert "changes (Azure#18002)" (Azure#18014)

This reverts commit 320ed6a6fc5a68e8af43da303f8e1caaacf24708.

* Add nic auxiliary mode (Azure#17577)

* Add nic auxiliary mode

* fix spacing

* Fixing prettier check

* Restoring package-lock file

* Restoring package json

Co-authored-by: Prachi Bhavsar <prbhavsar@microsoft.com>

* Connection Draining add new properties (Azure#18052)

* merge

* fix

* fix

* Adding express route port authorization apis (Azure#17582)

* adding apis and updating resource to support ports auth

* moving change to 2021-08-01

* minor: removing change from 2020-07-01

* lintdiff : adding type object

* minor: fixing prettier

* adding authorizations to ports property

* fixing circuitResourceUri property name

* fixing model validation

* changing circuit resource uri type to string

* removing authorizations child reosurce from parent property

* Fix Azure Firewall Policy regressions. Back fix validation issues (Azure#18233)

* Fix regressions in Firewall Policy Swagger / give firewallPolicy.json some love

* Additional lint violations

* remove breaking changes for next time. TO DO

* Revert "remove breaking changes for next time. TO DO"

This reverts commit 8f44a174c73c02d18d829f6dfb1d990488770b23.

* Reintroduce api-version for idps signature based routes. Create better names for enums to be generated in SDKs

* standardize enum names with FirewallPolicy prefix

* Azure Firewall Support of Private IP Ranges in IDPS (Azure#18320)

* Azure Firewall Support of Private IP Ranges in IDPS

* make sure all arrays have x-ms-identifiers

* FirewallPolicy not Firewall policy

fix spellcheck validation

* Ability to update tags on firewall policies (Azure#18322)

* Support updating of Azure Firewall Policy Tags. Includes HTTP Patch example

* Use common-types ErrorDetail

* Ability to update tags for Firewall Policies

* spell check fix for firewallpolicy

* Use future release api-version for example

* Added flush conn to nsg (Azure#18393)

* Added flush conn to nsg

* Updated flushConn to correct location

* Updated description

* Modified T/F to Enabled/Disabled

* Refactoring so that null value appears first

* Reverted FlushConnection to boolean value instead of string

* Revert "Added flush conn to nsg (Azure#18393)" (Azure#18576)

This reverts commit 6541d305880d1cf580496adc01f55197a01e992c.

* Fixing typo in response of idps private ip range feature (Azure#18574)

* Use common-type api version (Azure#18729)

Co-authored-by: Ben Eshed <bewaters@microsoft.com>

* fix (Azure#18417)

Co-authored-by: Tianen <347142915@qq.com>
Co-authored-by: gk-ms <97893166+gk-ms@users.noreply.github.com>
Co-authored-by: Khushboo Baheti <37917868+Khushboo-Baheti@users.noreply.github.com>
Co-authored-by: Khushboo Baheti <khbaheti@microsoft.com>
Co-authored-by: Nilambari <nilamd@microsoft.com>
Co-authored-by: nimaller <71352534+nimaller@users.noreply.github.com>
Co-authored-by: pracsb <78512712+pr-work@users.noreply.github.com>
Co-authored-by: Prachi Bhavsar <prbhavsar@microsoft.com>
Co-authored-by: Matthew Yang <79727592+matyang22@users.noreply.github.com>
Co-authored-by: utbarn-ms <66377251+utbarn-ms@users.noreply.github.com>
Co-authored-by: Ben Eshed <thebenwaters@users.noreply.github.com>
Co-authored-by: Satya-anshu <70507845+Satya-anshu@users.noreply.github.com>
Co-authored-by: bewaterspassover <103988461+bewaterspassover@users.noreply.github.com>
Co-authored-by: Ben Eshed <bewaters@microsoft.com>
@github-actions github-actions bot locked and limited conversation to collaborators Apr 12, 2023
This issue was closed.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug This issue requires a change to an existing behavior in the product in order to be resolved. Client This issue points to a problem in the data-plane of the library. Event Hubs needs-author-feedback Workflow: More information is needed from author to address the issue. no-recent-activity There has been no recent activity on this issue.
Projects
None yet
Development

No branches or pull requests

6 participants