Error on retrieving write buffer from log stream #10141

Zelldon · 2022-08-22T14:54:04Z

Describe the bug

On becoming a leader the LogStream seems to be closed and thrown an exception if the CommandAPI wants to create a new writer.

To Reproduce

Not 100% sure, but it looks like it is related to concurrent transitions, becoming leader and going into inactive.
In general, I can't see really an impact/effect, since the cluster was almost immediately deleted afterward.

We see after the error happening also a lot of #8606 might be related.

Expected behavior

No error, and handling the case more gracefully.

Log/Stacktrace

Full Stacktrace

java.lang.RuntimeException: Actor is closed
	at io.camunda.zeebe.logstreams.impl.log.LogStreamImpl.newLogStreamWriter(LogStreamImpl.java:203) ~[zeebe-logstreams-8.1.0-alpha4.jar:8.1.0-alpha4]
	at io.camunda.zeebe.logstreams.impl.log.LogStreamImpl.newLogStreamRecordWriter(LogStreamImpl.java:181) ~[zeebe-logstreams-8.1.0-alpha4.jar:8.1.0-alpha4]
	at io.camunda.zeebe.broker.transport.commandapi.CommandApiServiceImpl.lambda$onBecomingLeader$3(CommandApiServiceImpl.java:112) ~[zeebe-broker-8.1.0-alpha4.jar:8.1.0-alpha4]
	at io.camunda.zeebe.scheduler.ActorControl.lambda$call$0(ActorControl.java:123) ~[zeebe-scheduler-8.1.0-alpha4.jar:8.1.0-alpha4]
	at io.camunda.zeebe.scheduler.ActorJob.invoke(ActorJob.java:86) ~[zeebe-scheduler-8.1.0-alpha4.jar:8.1.0-alpha4]
	at io.camunda.zeebe.scheduler.ActorJob.execute(ActorJob.java:45) ~[zeebe-scheduler-8.1.0-alpha4.jar:8.1.0-alpha4]
	at io.camunda.zeebe.scheduler.ActorTask.execute(ActorTask.java:119) ~[zeebe-scheduler-8.1.0-alpha4.jar:8.1.0-alpha4]
	at io.camunda.zeebe.scheduler.ActorThread.executeCurrentTask(ActorThread.java:106) ~[zeebe-scheduler-8.1.0-alpha4.jar:8.1.0-alpha4]
	at io.camunda.zeebe.scheduler.ActorThread.doWork(ActorThread.java:87) ~[zeebe-scheduler-8.1.0-alpha4.jar:8.1.0-alpha4]
	at io.camunda.zeebe.scheduler.ActorThread.run(ActorThread.java:198) ~[zeebe-scheduler-8.1.0-alpha4.jar:8.1.0-alpha4]

Logs: https://drive.google.com/drive/u/0/folders/1DKM8gPL92xcdWeVZNfLQYWEdCHKGij4r

Error group: https://console.cloud.google.com/errors/detail/COaxsKTpuvHoOQ;service=zeebe;time=P7D?project=camunda-cloud-240911

Environment:

OS: ultrachaos
Zeebe Version: 8.1.0-alpha4
Configuration: trial

The text was updated successfully, but these errors were encountered:

megglos · 2022-09-02T09:17:51Z

Triage: It shouldn't be a severe issue, the cluster should recover from this. It might indicate an issue still.
We can close it and wait for the error reporting to reopen it.

remcowesterhoud · 2022-09-28T11:46:44Z

This occurred again on 8.1.0-alpha5. @megglos Do you want me to reopen the issue now that it reappeared?
https://console.cloud.google.com/errors/detail/COaxsKTpuvHoOQ;service=zeebe;time=P7D?project=camunda-cloud-240911

Zelldon · 2022-11-28T09:25:01Z

occurred again in Medic benchmark 45 https://console.cloud.google.com/errors/detail/COaxsKTpuvHoOQ;service=zeebe;version=;time=P7D?project=zeebe-io

Zelldon · 2022-12-27T06:56:55Z

Occurred again in Medic Benchmark 49

https://console.cloud.google.com/logs/query;query=%0AlogName:%22stdout%22%0Aresource.type%3D%22k8s_container%22%0Aresource.labels.namespace_name%3D%22medic-cw-49-79bcae7-benchmark-mixed%22%0Aresource.labels.cluster_name%3D%22zeebe-cluster%22%0Aresource.labels.location%3D%22europe-west1-b%22%0Aresource.labels.pod_name%3D%22medic-cw-49-79bcae7-benchmark-mixed-zeebe-0%22%0Aresource.labels.project_id%3D%22zeebe-io%22%0Aresource.labels.container_name%3D%22zeebe%22;timeRange=2022-12-23T15:31:02.053Z%2F2022-12-23T16:31:02.053Z;pinnedLogId=2022-12-23T16:00:32.053028104Z%2Fgafkiybktr6ut1ft;summaryFields=jsonPayload%252Fcontext%252F%2522actor-name%2522:false:32:beginning;cursorTimestamp=2022-12-23T16:00:32.251932723Z?project=zeebe-io

We can clearly see in the logs, that it first becomes Leader, and during transition to the leader it gets notified that Broker 1 is the leader. It has to step down. It starts follower transition, and canceled the leader transition. Cancelation doesn't work really, at least it looks to me like that. The partition transition fails with an exception since the actor is closed, for the logstream.

deepthidevaki · 2024-02-29T07:53:36Z

Happened in 8.1.13 during shutdown. No impact. Shutdown was completed successfully.

npepinpe · 2024-09-29T16:48:33Z

Here's a possible reproducer:

/*
 * Copyright Camunda Services GmbH and/or licensed to Camunda Services GmbH under
 * one or more contributor license agreements. See the NOTICE file distributed
 * with this work for additional information regarding copyright ownership.
 * Licensed under the Camunda License 1.0. You may not use this file
 * except in compliance with the Camunda License 1.0.
 */
package io.camunda.zeebe.broker.transport.commandapi;

import static org.assertj.core.api.Assertions.assertThat;

import io.camunda.zeebe.broker.system.configuration.QueryApiCfg;
import io.camunda.zeebe.engine.state.QueryService;
import io.camunda.zeebe.logstreams.impl.flowcontrol.RateLimit;
import io.camunda.zeebe.logstreams.util.ListLogStorage;
import io.camunda.zeebe.logstreams.util.SyncLogStream;
import io.camunda.zeebe.scheduler.future.ActorFuture;
import io.camunda.zeebe.scheduler.testing.ControlledActorSchedulerExtension;
import io.camunda.zeebe.stream.api.StreamClock;
import io.camunda.zeebe.transport.ServerTransport;
import java.time.Duration;
import org.junit.jupiter.api.Test;
import org.junit.jupiter.api.extension.RegisterExtension;
import org.mockito.Mock;
import org.mockito.junit.jupiter.MockitoSettings;
import org.mockito.quality.Strictness;

@MockitoSettings(strictness = Strictness.STRICT_STUBS)
final class CommandApiServiceImplTest {
  @RegisterExtension
  private final ControlledActorSchedulerExtension scheduler =
      new ControlledActorSchedulerExtension();

  @Mock private ServerTransport transport;
  @Mock private QueryService queryService;

  @Test
  void shouldThrowException() {
    // given
    final var logStream =
        SyncLogStream.builder()
            .withActorSchedulingService(scheduler.scheduler())
            .withClock(StreamClock.system())
            .withMaxFragmentSize(1024 * 1024)
            .withLogStorage(new ListLogStorage())
            .withPartitionId(1)
            .withWriteRateLimit(RateLimit.disabled())
            .withLogName("test")
            .build();
    final var service =
        new CommandApiServiceImpl(transport, scheduler.scheduler(), new QueryApiCfg());
    scheduler.submitActor(service);
    scheduler.workUntilDone();

    // when - `onBecomingLeader` enqueues a call to the underlying actor, but the log stream 
    // is closed before the actor is scheduled (via workUntilDone), so an exception is thrown
    final ActorFuture<?> result = service.onBecomingLeader(1, 1, logStream, queryService);
    logStream.close();
    scheduler.workUntilDone();

    // then
    assertThat(result).failsWithin(Duration.ofSeconds(1));
  }
}

As for solving the issue, we have two options:

Call onBecomingLeader not with a LogStream, but with the writer already. This is now possible since the LogStream is entirely synchronous now. This is not easily back portable however.
The caller of CommandApiServiceImpl#onBecomingLeader handles errors properly, such that if it fails, there may be a certain class of exceptions (e.g. the log stream is closed) which may be OK to ignore. For this, though, we need to propagate the error from the log stream all the way to the caller.

Additionally, in CommandApiServiceImpl#onBecomingLeader (and possibly onBecomingFollower?), if there is an uncaught exception, we don't handle it, and the returned future is never completed. We should also ensure that the returned future is always completed.

entangled90 · 2024-10-09T16:36:34Z

Probably solution 2 is the most general and will be probably useful for other issues, while solution 1 is more or less valid only for this specific situation. However, solution 2 will require be more careful with which exceptions are thrown ecc.
I see that a couple of exceptions are already used for classifying recoverable vs unrecoverable errors:

UnrecoverableException
RecoverablePartitionTransitionException

Regarding the last point, the current code creates a CompletableActorFuture that is manually completed at the end of task (if no exception is thrown), ignoring the future returned by the actor.call().

public ActorFuture<Void> onBecomingLeader(
      final int partitionId,
      final long term,
      final LogStream logStream,
      final QueryService queryService) {
    final CompletableActorFuture<Void> future = new CompletableActorFuture<>();
    actor.call(
        () -> {
          leadPartitions.add(partitionId);
          queryHandler.addPartition(partitionId, queryService);
          serverTransport.subscribe(partitionId, RequestType.QUERY, queryHandler);

          final var logStreamWriter = logStream.newLogStreamWriter();
          commandHandler.addPartition(partitionId, logStreamWriter);
          serverTransport.subscribe(partitionId, RequestType.COMMAND, commandHandler);
          future.complete(null);
        });
    return future;
  }

Eliminating the first future, the second one is always completed in case of exceptional errors (as it is done by the actor).
Can the first future be safely removed?

While there a transition to a leader is cancelled, the logStream will be closed, throwing an exception when creating a Writer. The returned future is now failed when such error happens instead of non-completing. Moreover, the writer is instantiated as a first step, so in case of errors no subscriptions to the other components is done. closes #10141

…23371) ## Description While there a transition to a leader is cancelled, the logStream will be closed, throwing an exception when creating a Writer. To avoid concurrency issues, the `CommandServiceApiImpl` is not a `PartitionListener` anymore, but it's a `PartitionTransitionStep`: this should avoid it as they would be executed serially. ## Related issues closes #10141

While there a transition to a leader is cancelled, the logStream will be closed, throwing an exception when creating a Writer. The returned future is now failed when such error happens instead of non-completing. Moreover, the writer is instantiated as a first step, so in case of errors no subscriptions to the other components is done. closes #10141 (cherry picked from commit a00267f)

@entangled90

… change is cancelled (#23775) # Description Backport of #23371 to `stable/8.6`. relates to #10141 original author: @entangled90

@entangled90

… change is cancelled (#23774) # Description Backport of #23371 to `stable/8.5`. relates to #10141 original author: @entangled90

@backport-action

… change is cancelled (#23824) # Description Backport of #23774 to `stable/8.4`. relates to #23371 #10141 original author: @backport-action

Zelldon added kind/bug Categorizes an issue or PR as a bug severity/low Marks a bug as having little to no noticeable impact for the user labels Aug 22, 2022

megglos closed this as completed Sep 2, 2022

megglos reopened this Sep 28, 2022

Zelldon added the component/partition-transitions label Jan 3, 2023

romansmirnov added the component/zeebe Related to the Zeebe component/team label Mar 5, 2024

npepinpe added the good first issue Marks an issue as simple enough for first time contributors label May 13, 2024

entangled90 self-assigned this Oct 9, 2024

entangled90 mentioned this issue Oct 11, 2024

fix: complete future with error when leadership change is cancelled #23371

Merged

entangled90 closed this as completed in #23371 Oct 21, 2024

backport-action mentioned this issue Oct 21, 2024

[Backport stable/8.6] fix: complete future with error when leadership change is cancelled #23775

Merged

github-merge-queue bot pushed a commit that referenced this issue Oct 21, 2024

[Backport stable/8.6] fix: complete future with error when leadership…

428994f

… change is cancelled (#23775) # Description Backport of #23371 to `stable/8.6`. relates to #10141 original author: @entangled90

github-merge-queue bot pushed a commit that referenced this issue Oct 22, 2024

[Backport stable/8.5] fix: complete future with error when leadership…

0190192

… change is cancelled (#23774) # Description Backport of #23371 to `stable/8.5`. relates to #10141 original author: @entangled90

backport-action mentioned this issue Oct 22, 2024

[Backport stable/8.3] fix: complete future with error when leadership change is cancelled #23823

Closed

backport-action mentioned this issue Oct 22, 2024

[Backport stable/8.4] fix: complete future with error when leadership change is cancelled #23824

Merged

github-actions bot added the version:8.6.4 label Oct 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error on retrieving write buffer from log stream #10141

Error on retrieving write buffer from log stream #10141

Zelldon commented Aug 22, 2022 •

edited

Loading

megglos commented Sep 2, 2022

remcowesterhoud commented Sep 28, 2022 •

edited

Loading

Zelldon commented Nov 28, 2022

Zelldon commented Dec 27, 2022

deepthidevaki commented Feb 29, 2024

npepinpe commented Sep 29, 2024

entangled90 commented Oct 9, 2024 •

edited

Loading

Error on retrieving write buffer from log stream #10141

Error on retrieving write buffer from log stream #10141

Comments

Zelldon commented Aug 22, 2022 • edited Loading

megglos commented Sep 2, 2022

remcowesterhoud commented Sep 28, 2022 • edited Loading

Zelldon commented Nov 28, 2022

Zelldon commented Dec 27, 2022

deepthidevaki commented Feb 29, 2024

npepinpe commented Sep 29, 2024

entangled90 commented Oct 9, 2024 • edited Loading

Zelldon commented Aug 22, 2022 •

edited

Loading

remcowesterhoud commented Sep 28, 2022 •

edited

Loading

entangled90 commented Oct 9, 2024 •

edited

Loading