Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Race Condition in WorkflowDefinitionActivity Causing Execution Failures #5314

Closed
sfmskywalker opened this issue May 1, 2024 · 1 comment · Fixed by #5315
Closed
Assignees
Labels
bug Something isn't working elsa 3 This issue is specific to Elsa 3
Milestone

Comments

@sfmskywalker
Copy link
Member

Description

When running workflows in parallel with caching enabled, there is an observed race condition where WorkflowDefinitionActivity instances fail to execute properly due to internal state changes. This issue particularly arises when the cached workflow definitions expire while a workflow instance is still executing, leading to discrepancies between the workflow graph in the execution context and newly materialized workflow definition objects.

Problem Details

During the execution of workflows:

  • A WorkflowExecutionContext is created.
  • This context initializes a Graph of ActivityNodes.
  • Each ActivityNode references an IActivity instance derived from the Workflow object.
  • Lookups are performed by IActivity instance into the Graph.

Issues occur when:

  • The cached Workflow entry is evicted and a new Workflow object is materialized during ongoing workflow execution.
  • This results in WorkflowDefinitionActivity's InitializeAsync method being invoked more than once, causing the Root property to be overwritten with a new instance.
  • The newly scheduled execution of this Root instance fails as it does not match the existing workflow graph.

Symptoms

  • Workflows suddenly stop executing with no errors or exceptions.
  • This problem may still occur with caching disabled if there's code that produces new Workflow objects.

Reproduction and Findings

  • A component test was designed to reproduce this race condition by sending numerous HTTP requests in parallel while invalidating the cache.
  • The test confirmed that when WorkflowDefinitionActivity dynamically sets its Root property based on the cached workflow definition, concurrent executions can interfere with each other, leading to execution failures.

Solution

The fix involves:

  • Ensuring that workflow objects and their collections of activities, once loaded into memory, do not change their internal state.
  • Throwing an exception when an activity scheduled for execution is not part of the workflow graph.
  • Optimizing the building of the workflow graph by constructing it concurrently with the Workflow object materialization, thereby reducing the overhead and potential for error in subsequent executions.

The solution has been validated by both the component test and practical applications within the Elsa reference app.

Impact

This fix not only addresses the root cause of the race condition but also improves performance by reducing the frequency of graph constructions per workflow execution.

@sfmskywalker sfmskywalker added the bug Something isn't working label May 1, 2024
@sfmskywalker sfmskywalker self-assigned this May 1, 2024
@sfmskywalker sfmskywalker added the elsa 3 This issue is specific to Elsa 3 label May 1, 2024
Copy link

glime-ai bot commented May 1, 2024

Based on the detailed description of the problem, it seems that the race condition is caused by the concurrent execution of workflows where the cached workflow definitions are being evicted and new instances are materialized during the ongoing execution. This leads to inconsistencies between the workflow graph in the execution context and the newly materialized workflow definition objects.

To address this issue, the proposed solution involves making sure that once workflow objects and their collections of activities are loaded into memory, they do not change their internal state. Additionally, an exception should be thrown when an activity scheduled for execution is not part of the workflow graph. By optimizing the building of the workflow graph to occur concurrently with the Workflow object materialization, the overhead and potential for error in subsequent executions are reduced.

This solution should prevent workflows from stopping unexpectedly and improve overall performance by minimizing the need to reconstruct the workflow graph for each execution. The fix has been validated through component testing and practical application within the Elsa reference app, indicating that it effectively resolves the race condition issue.

If you are experiencing this issue, it is recommended to update to the latest version of Elsa where this fix has been implemented. If the problem persists or if you have a different scenario, please provide additional details so that further assistance can be offered.

This comment was generated by Glime.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working elsa 3 This issue is specific to Elsa 3
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

1 participant